Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics (1)

Overview of the main idea (3 sentences)

Data lakes are repositories of data (typically very large in size), stored as files of one or more formats such as Parquet, ORC.
Data warehouses are typically more closed systems built with query engines that run on proprietary data formats. Their query engines and storage formats are co-optimized for ultimate performance.
The authors suggest that data lakes combined with a metadata layer and other optimizations can function as data warehouses, but with a more open architecture and support for more workflows (direct file access, ML, etc.).

Key findings / takeaways from the paper (2-3 sentences)

Data lakes can now be used somewhat like data warehouses with SQL access, and with features such as hard schemas, schema evolution and data governance. Databricks has been “leading the charge” here with Delta Lake and Spark SQL (but Iceberg and Hive SQL are also popular in this area).

With this architecture, it’s possible for developers to directly access the storage layer (files) for their analytics data repositories. This makes it possible to use common ML workflows such as running Python scripts with Tensorflow, Pandas, etc. on top of these files.

My main “open question” after reading this paper is how data lakehouses can ensure that schemas are respected if direct file access is possible, as well as regular SQL. I do not currently fully understand how that can work.

System used in evaluation and how it was modified/extended (1 sentence)

The authors mainly discuss Databricks and its composition (Delta Lake, Delta Engine, Spark SQL).

Workload Evaluated (1 sentence)

As part of this paper, TPC-DS was evaluated across a data lakehouse (Databricks) and various data warehouses (unnamed). The paper in general is focused only on pure analytical workloads.