Skip to content
Data & Analytics

Data Engineering

Clean, reliable pipelines that make your data trustworthy before anyone touches it.

Every data question eventually runs into a data quality problem. Pipelines that silently drop rows, transformation logic duplicated across three different dashboards, warehouse tables without owners — these are the real reasons analytics takes three days and nobody quite trusts the number. We fix the foundations before anything else.

We build the full data engineering layer: extraction from every source you have, a clean transformation model in dbt, orchestration that retries failures rather than silently stopping, and a warehouse schema designed so adding a new report does not require touching the core models. The work is documented as code — every transformation is version-controlled, testable and readable by the next engineer who joins your team.

what's included

What this covers

Source connectors: REST APIs, databases, SaaS tools, flat files and event streams

ELT pipeline development with incremental loading and idempotent design

dbt transformation layer: staging, intermediate and mart models with documented lineage

Orchestration via Airflow or Prefect: scheduling, retries, alerting on failures

Data warehouse setup and schema design on Snowflake, BigQuery or Redshift

Data quality checks and freshness assertions built into the pipeline — not manual audits

Streaming pipelines for real-time or near-real-time use cases via Kafka or Kinesis

what you get

Deliverables

  • A running ELT pipeline with orchestration, monitoring and failure alerting
  • A dbt project in source control with documented models, tests and lineage
  • Data warehouse configured with roles, access controls and cost guardrails
  • A data catalogue or inline documentation so any analyst can find and trust what they need
tools & stack

What we build with

dbtApache AirflowPrefectApache SparkKafkaSnowflakeBigQueryRedshiftPostgreSQLFivetranAirbytePythonSQLAWS Glue
what we mean

Data engineering

Data engineering is the discipline of building and operating the infrastructure that moves data from operational sources into an analytical form that can be queried, trusted, and acted on. It sits upstream of everything else: dashboards, ML models, and product analytics are only as reliable as the pipelines that feed them. The work spans ingestion (getting data out of source systems reliably), transformation (reshaping and business-logic-encoding in a warehouse layer), orchestration (scheduling, retrying, and observing the full DAG), and quality enforcement (asserting contracts at every layer boundary).

Modern data engineering is defined by two shifts. First, the movement from ETL to ELT: raw data lands in the warehouse before any transformation runs, so the audit trail is complete and bugs are fixed by re-running SQL rather than re-extracting from source. Second, the adoption of change-data capture (CDC) over batch polling: rather than querying the source database on a schedule, CDC reads the database write-ahead log or binlog and streams row-level changes continuously, eliminating the full-table scans that slow source systems and the hourly delay that makes dashboards stale. The scope of data engineering ends at the serving layer — the clean, trusted mart tables that analysts and BI tools consume. Metric definition and dashboard design are adjacent disciplines.

how we work

Pipeline engineering lifecycle

Stages run roughly in sequence on a new pipeline, but quality and observability gates apply continuously once the pipeline is live. Skipping a stage does not accelerate delivery — it defers a production incident.

    01

    Source profiling & contract definition

    Before writing a single connector, understand what the source actually emits and agree on what it promises to emit in future. Source schema assumptions that are undocumented will break the pipeline silently the first time a column is renamed or a nullable field becomes null.

    • Profile every source table: row counts, null rates per column, cardinality of candidate keys, and the distribution of updated_at timestamps
    • Identify the correct primary key (or composite key) for deduplication — operational systems often have soft deletes and duplicate rows that are not obvious from the schema
    • Document freshness expectations per source domain: e.g., orders table is written continuously; finance tables are batch-loaded nightly at 02:00 UTC
    • Agree a schema contract with the source team: which columns are guaranteed stable, which may change, and what the notification SLA is for breaking changes
    • Capture the agreed contract in a dbt source YAML block with freshness max_loaded_at_field and warn/error thresholds
    02

    Ingestion design: managed EL vs custom CDC

    The extraction mechanism determines your pipeline's latency floor, operational overhead, and resilience to source-side changes. Choosing the wrong tool here means rebuilding later under pressure.

    • Evaluate managed connectors (Airbyte, Fivetran) against the source catalogue: if 80% of sources have a stable connector, managed EL reduces engineering time significantly
    • Identify sources that require CDC: high-write-volume transactional databases where batch polling would impact source performance, or use cases where sub-minute freshness is a hard requirement
    • Design the CDC connector stack for relational sources: Debezium on the WAL (PostgreSQL) or binlog (MySQL), publishing to Kafka topics; or Airbyte CDC mode for simpler deployments that do not need a persistent Kafka cluster
    • Implement idempotent load patterns for every source: append-only with a dedup window for event streams, merge/upsert keyed on the documented primary key for entity tables, truncate-and-reload for reference data under 1 M rows
    • Store raw payloads in a bronze layer with ingestion metadata columns (_extracted_at, _source_file, _cdc_operation) before any transformation runs
    03

    Transformation modelling in dbt

    All business logic belongs in dbt, not in ingestion scripts, BI tools, or ad hoc queries. A dbt model is a SELECT statement in version control — testable, reviewable, and diffable.

    • Scaffold the project into four layers: sources (declarations only), staging (one-to-one with sources: cast types, rename columns, no joins), intermediate (cross-source joins, window functions, deduplication), and marts (grain-specific fact and dimension tables for reporting)
    • Enforce one-model-one-grain: a fact table must never silently aggregate; if a mart needs pre-aggregated rows, that is a separate model with a distinct grain documented in its description block
    • Apply incremental materialisation to high-volume fact tables using is_incremental() with a lookback window wide enough to reprocess late-arriving data; set full-refresh as a scheduled weekly job rather than per-run
    • Handle slowly-changing dimensions (SCDs) explicitly: SCD Type 1 (overwrite) for attributes where history is not required, SCD Type 2 (row versioning with valid_from / valid_to) for attributes where historical snapshots drive reporting
    • Generate dbt docs and column-level lineage on every CI run so that impact analysis ('what breaks if I rename this column?') is always current and queryable
    04

    Orchestration & failure handling

    An orchestrator that only runs tasks is not doing its job. The orchestrator must retry intelligently, surface failures immediately, and enforce execution order so downstream models never run on stale inputs.

    • Model the pipeline as a DAG in Airflow or Prefect: ingestion tasks upstream, dbt source freshness checks as a gate, dbt run and dbt test as sequential children, with BI refresh or downstream notification as the terminal node
    • Configure retry policies per task type: network-bound ingestion tasks warrant 3 retries with exponential back-off; idempotent dbt runs can retry once immediately before alerting
    • Send structured alerts to a dedicated data-ops channel on every DAG failure: task name, run ID, error snippet, and a deep link to the log — not just 'pipeline failed'
    • Implement SLA miss alerts separately from task failures: a DAG that completes at 07:45 when stakeholders expect data by 07:00 is a failure even if every task succeeded
    • Parameterise backfills as first-class operations so re-processing a date range is a single orchestrator API call, not a manual intervention
    05

    Data quality gates & observability

    Quality tests run as part of the pipeline, blocking promotion to the next layer if they fail. Observability covers the pipeline itself (task latency, row-count throughput) and the data (freshness, distribution drift, referential integrity).

    • Add dbt generic tests to every staging model primary key (not_null + unique at minimum) before any model merges to main; treat CI test failures as blocking
    • Write custom singular tests for business invariants that generic tests cannot express: e.g., daily_revenue > 0, refund_amount <= original_charge_amount, event_timestamp between '2020-01-01' and current_date
    • Instrument row-count and null-rate metrics per model run and push to a time-series store (Prometheus or BigQuery audit table) so anomaly detection can compare today's load against a rolling 14-day baseline
    • Track pipeline latency end-to-end: time from source record created_at to the record being queryable in the gold mart — and alert when that SLA is breached
    • Run dbt source freshness as a pre-condition gate that aborts the transformation DAG if any source has not loaded within its declared SLA, preventing stale data from silently flowing into marts
decision guides

How we'd choose

There's rarely one right answer — these are the trade-offs we weigh before recommending an approach.

Managed ingestion (Airbyte / Fivetran) vs custom Debezium CDC

The choice is not binary — most production platforms use managed connectors for SaaS sources and CDC for high-volume transactional databases. The criteria below help you decide which pattern a specific source warrants.

CriterionManaged EL (Airbyte / Fivetran)Custom CDC (Debezium + Kafka)
Source typeSaaS APIs (Salesforce, HubSpot, Stripe, Zendesk) and low-to-medium volume databases where a connector already existsHigh-write-volume relational databases (PostgreSQL, MySQL, SQL Server) where polling would degrade source performance or miss deletes
Freshness achievableTypically 15-minute to 1-hour sync intervals; some Fivetran connectors support 5-minute micro-batch for high-priority sourcesSub-second event delivery from the WAL/binlog into the Kafka topic; warehouse sink latency adds seconds, not minutes
Operational overheadNear-zero: connector upgrades, API version changes, and schema drift are handled by the vendor; on-call burden is minimalSignificant: Debezium connector configuration, Kafka cluster management, Schema Registry, consumer-group lag monitoring, and WAL retention policies all require active ownership
Delete and update captureDepends on source API: most SaaS connectors cannot capture hard deletes; soft-delete flags are exposed only if the API exposes themFull fidelity: the binlog emits INSERT, UPDATE, and DELETE events with before/after row images; deletes are first-class citizens in the change stream
Cost modelPer-row or per-connector pricing (Fivetran) or self-hosted compute (Airbyte OSS on Kubernetes); predictable for stable sync volumesKafka infrastructure cost (MSK, Confluent Cloud, or self-managed) plus Debezium worker compute; cost is flat relative to row volume once the cluster is sized
When to chooseDefault choice for SaaS sources and databases under ~10 M rows/day where sub-minute freshness is not required and engineering capacity is the binding constraintRequired when the source is a transactional database under write load, when hard deletes must be captured, or when downstream consumers need a durable, replayable event stream
what we avoid

Anti-patterns to avoid

These failure modes appear specifically in pipeline engineering work — distinct from the broader data-platform anti-patterns covered in the pillar overview.

Watermark-based incrementals without a late-arriving data window

Incremental pipeline designs that filter on updated_at >= last_run_timestamp assume that every row's timestamp reflects when it was last modified in the source. In practice, event streams have late-arriving data: a mobile event captured offline, a transactional system that back-dates corrections, or a CDC connector that briefly falls behind and replays. Without a lookback window, these rows are silently skipped on the incremental run and never appear in the warehouse.

Always apply a lookback window on incremental models — typically 3 to 7 days depending on the source's observed late-arrival distribution. In dbt incremental models, this means filtering on updated_at >= (select max(updated_at) from {{ this }}) - interval '3 days'. Accept the small cost of reprocessing recent rows in exchange for correctness. For event sources with known late-arrival characteristics, measure the 99th-percentile arrival delay empirically before setting the window.

Hardcoding pipeline logic in the orchestrator

When Airflow DAGs or Prefect flows contain transformation SQL, business rules, or data manipulation logic inline — rather than delegating to dbt — that logic becomes invisible to the data team, untested by dbt's test framework, and untracked in lineage graphs. The orchestrator becomes a second place where business logic lives, and the two versions eventually diverge. Debugging requires reading Python, not SQL, and the logic is not reproducible outside the orchestration context.

Treat orchestration as scheduling and dependency management only. A DAG node should call a dbt run --select model_name, a dbt test, or a well-defined Python function with a documented interface — not execute inline SQL against the warehouse. If a data transformation does not fit neatly into a dbt model, it belongs in a standalone Python script with its own test suite, called as a task, not embedded in the DAG definition.

No idempotency — running the pipeline twice produces wrong data

A pipeline that cannot be re-run safely is fragile by design. Append-only loads without deduplication, INSERT statements without conflict handling, or transformations that accumulate state across runs will produce duplicate rows the moment a retry occurs — which it will, whether due to a transient network failure, a manual backfill, or a deployment rollback. The resulting duplicates often go undetected until a metric looks unexpectedly high.

Design every load and transformation for idempotency from the start. For entity tables, use MERGE or INSERT ... ON CONFLICT (upsert) keyed on the documented primary key. For event tables, use append-only loads but add a dedup step in the dbt staging model that partitions by event_id and picks the latest record. For full-refresh jobs, always truncate-and-reload atomically within a transaction. Add a uniqueness test on every primary key in dbt so that duplicate rows cause an explicit test failure rather than a silent data quality regression.

good to know

Common questions

We use Fivetran already — do you replace it or build on top of it?

We work alongside managed connectors like Fivetran where they save time. We build custom connectors for sources Fivetran does not cover, and we always own the transformation layer in dbt regardless of where raw data comes from.

How do you handle pipeline failures without someone manually monitoring it at all times?

Orchestration tools like Airflow and Prefect have built-in retry policies, and we configure alerting to a Slack channel or email for any DAG failure. We also add data quality tests that run after each load and alert separately if the data looks wrong even when the pipeline itself succeeded.

more in Data & Analytics

Related capabilities

Have something in mind?

Tell us what you're building or stuck on. The first consultation is free — no obligation, no hard sell.