A data lake is a centralised repository that stores large volumes of raw data — structured, semi-structured, and unstructured — in its native format until needed, typically built on inexpensive object storage like Amazon S3, Azure Data Lake Storage, or Google Cloud Storage.
Why Data Lake Matters
The data lake emerged in the 2010s as data volumes exploded beyond what traditional data warehouses could economically store. Where warehouses store clean, structured data optimised for SQL queries, data lakes store everything: log files, JSON event streams, images, video, machine learning training data, plus structured tables.
The key economic argument for data lakes is storage cost: object storage like S3 costs ~$23 per terabyte per month, vs $30-100+ for warehouse storage. For petabyte-scale data, the economics matter. The trade-off is query performance — querying raw lake data is slower than querying structured warehouse tables.
How Data Lake Works
A modern data lake architecture typically includes:
- Raw storage layer: Object storage (S3, ADLS, GCS) holding files in formats like Parquet, ORC, JSON, CSV, or Avro.
- Catalog layer: Metadata catalog (AWS Glue, Hive Metastore, Unity Catalog) describing what data lives where.
- Query engines: Tools like Athena, Presto, Trino, Spark, or BigQuery that query the lake in place without moving data.
- Governance: Tools enforcing access controls, data quality, and lineage.
- Optional table formats: Apache Iceberg, Delta Lake, or Apache Hudi add ACID transactions, time travel, and schema evolution to the data lake — turning it into a “lakehouse.”
Most modern data teams use a hybrid pattern: store everything in the data lake (cheap, flexible), then transform high-value data into a data warehouse (Snowflake, BigQuery) for fast analytical queries.
Real-World Example
A SaaS company stores 5 TB of raw clickstream events per month in S3 as Parquet files (the data lake). Spark jobs nightly aggregate the raw events into hourly summary tables, also in the lake. dbt then loads the summary tables into Snowflake for BI consumption (the warehouse). Data scientists query the raw lake directly for ML training. The total cost is one-tenth of what storing all 5 TB in Snowflake would cost.
Common Data Lake Tools and Platforms in 2026
2026 data lake and lakehouse tool landscape:
Amazon S3 + Athena
Most common AWS data lake. Object storage + serverless SQL query engine.
Azure Data Lake Storage + Synapse
Microsoft equivalent. Tight Azure integration.
Google Cloud Storage + BigQuery
GCP data lake. BigQuery can query lake files directly via external tables.
Databricks (Delta Lake)
Lakehouse pioneer. Delta Lake adds ACID + time travel to the data lake.
Apache Iceberg
Open table format becoming the 2026 standard for lakehouse architectures.
Trino / Presto
Open-source distributed SQL engine for querying data lakes in place.
Frequently Asked Questions About Data Lake
What is the difference between a data lake and a data warehouse?
Data lakes store raw data in any format on cheap object storage; warehouses store structured tables optimised for SQL queries. Lakes are cheaper to store but slower to query; warehouses are more expensive but faster.
What is a lakehouse?
A lakehouse architecture combines the flexibility and economics of a data lake with the performance and governance of a data warehouse. Pioneered by Databricks; powered by table formats like Delta Lake, Iceberg, and Hudi.
Should I use a data lake or a data warehouse?
Most modern data teams use both. Lake for cheap raw storage and ML; warehouse for fast SQL analytics. The lakehouse pattern increasingly merges them — store everything in the lake, query with warehouse-grade performance via Iceberg or Delta.
Is a data lake just a folder of files?
Conceptually yes — object storage with files in Parquet, ORC, or similar formats. The difference between “a folder of files” and “a data lake” is the metadata catalog, governance, and query engines on top.
Do data lakes support SQL?
Yes. Engines like Athena, Trino, Presto, BigQuery, and Snowflake can query lake data using SQL. Modern lakehouse table formats (Iceberg, Delta) support full ACID SQL workloads.
How do data lakes handle data quality?
Through a combination of schema enforcement (via lakehouse table formats), data quality tools (Great Expectations, dbt tests), and curated zones (raw / cleaned / curated). Many lakes follow a “medallion architecture” with bronze, silver, gold layers.