A data pipeline is an automated workflow that extracts data from one or more source systems, optionally transforms it, and loads it into one or more destination systems — moving data from where it is generated (operational apps, event streams, files) to where it is consumed (data warehouses, analytics tools, machine learning models).
Why Data Pipeline Matters
Modern businesses generate data in dozens of operational systems — Salesforce, Stripe, Postgres, Segment, HubSpot, Zendesk, internal apps. Without data pipelines, that data lives in silos. Data pipelines stitch the operational landscape together, feeding analytics, BI, ML, and reverse ETL workflows from a single connected data flow.
The 2026 data stack is essentially a graph of data pipelines: extract-load pipelines (Fivetran, Airbyte) move data into the warehouse; transformation pipelines (dbt) reshape it; reverse ETL pipelines push it back to operational tools. Pipeline reliability is data reliability.
How Data Pipeline Works
A typical data pipeline has three stages plus orchestration:
- Extract: Pull data from source systems — operational databases, SaaS APIs, event streams, files. Often incremental (only new/changed records).
- Transform (optional): Clean, deduplicate, join, aggregate, or reshape the data. Modern pipelines often defer transformation to in-warehouse dbt models (the ELT pattern).
- Load: Write to destinations — data warehouse, data lake, operational system, or feature store.
- Orchestration: Schedule, monitor, alert, retry, and track lineage across pipeline stages. Tools like Airflow, Dagster, and Prefect orchestrate complex multi-step pipelines.
Modern pipelines split into batch (run on schedule, e.g. hourly) and streaming (process events continuously). The choice depends on data freshness requirements.
Real-World Example
A SaaS company’s data pipeline runs every 15 minutes: Fivetran extracts new customer records from Salesforce and Stripe and loads them into Snowflake (the EL stages). dbt then transforms the raw data into clean fact and dimension tables, runs data quality tests, and materialises an analytics.customers table (T stage). Hightouch reverse-ETL pipelines sync customer health scores back to Salesforce. Airflow orchestrates the whole flow with retries and Slack alerts on failure.
Common Data Pipeline Tools and Platforms in 2026
2026 data pipeline tool landscape:
Fivetran
Managed ELT service. 300+ connectors, fully automated, premium pricing.
Airbyte
Open-source ELT. Self-host or managed cloud. Hundreds of connectors.
dbt
Industry-standard transformation tool for in-warehouse pipelines.
Apache Airflow
Open-source orchestration platform. Most-deployed workflow tool.
Dagster / Prefect
Modern alternatives to Airflow with better Python ergonomics and asset-based modelling.
Estuary Flow / Materialize
Streaming-first data pipelines for real-time use cases.
Frequently Asked Questions About Data Pipeline
What is the difference between an ETL pipeline and an ELT pipeline?
ETL transforms before loading (older pattern). ELT loads raw and transforms inside the warehouse (modern default). Both are types of data pipelines.
What is a streaming data pipeline?
A pipeline that processes events continuously as they arrive, with sub-second to seconds latency. Typical tools: Kafka, Estuary Flow, Materialize, Apache Flink, Confluent.
Do I need an orchestrator for data pipelines?
For more than a few pipelines, yes. Airflow, Dagster, and Prefect handle scheduling, retries, alerting, lineage, and dependencies. Without orchestration, pipeline failures become invisible.
What is data pipeline observability?
See data observability. Continuous monitoring of pipeline freshness, volume, schema, and quality with anomaly alerting.
How do I choose a data pipeline tool?
For extract-load, choose Fivetran (managed, premium) or Airbyte (open-source). For transformation, dbt is the standard. For orchestration, Airflow if you need ecosystem, Dagster if you want modern UX.
What are common data pipeline failure modes?
Source schema changes, API rate limits, network failures, downstream warehouse outages, and bad data corrupting downstream tables. Data observability tools help catch these early.