Overview of the main idea (3 sentences)

Key findings / takeaways from the paper (2-3 sentences)

Data lakes can now be used somewhat like data warehouses with SQL access, and with features such as hard schemas, schema evolution and data governance. Databricks has been “leading the charge” here with Delta Lake and Spark SQL (but Iceberg and Hive SQL are also popular in this area).

With this architecture, it’s possible for developers to directly access the storage layer (files) for their analytics data repositories. This makes it possible to use common ML workflows such as running Python scripts with Tensorflow, Pandas, etc. on top of these files.

My main “open question” after reading this paper is how data lakehouses can ensure that schemas are respected if direct file access is possible, as well as regular SQL. I do not currently fully understand how that can work.

System used in evaluation and how it was modified/extended (1 sentence)

The authors mainly discuss Databricks and its composition (Delta Lake, Delta Engine, Spark SQL).

Workload Evaluated (1 sentence)

As part of this paper, TPC-DS was evaluated across a data lakehouse (Databricks) and various data warehouses (unnamed). The paper in general is focused only on pure analytical workloads.