Overview of the main idea (3 sentences)

The authors ask that the data industry moves toward the lakehouse model over the warehouse model. They try to argue that this is the right direction for innovation at every level of the stack when it comes to build large scale data analysis systems. They give examples of how modularization is already happening.

Key findings / takeaways from the paper (2-3 sentences)

I personally don’t have many takeaways. I think the vision from the authors is a bit hopeful, as the business incentives are not aligned to make this future happen. However, I understand that from Meta’s point of view, it would be ideal to have a more modular and open ecosystem. It would really suit them to be able to “plug and play” different modules to build their dream data warehouse.

However, it’s very hard for me to see how an execution engine could realistically be completely decoupled from a storage engine. These things are much easier to build hand-in-hand because they can be co-optimized around each other.

Nevertheless, some modularization will happen and is already happening. For example, all data warehouses and lakehouses must support Iceberg. Even HTAP systems like SingleStoreDB will have to support Iceberg. And eventually Iceberg could take over as the de facto storage layer for large scale data storage.

System used in evaluation and how it was modified/extended (1 sentence)

The paper goes into a lot of different sub-systems of data lakehouses such as Velox, different catalogs like HMS, different storage layers like Iceberg/DeltaLake.

Workload Evaluated (1 sentence)

None really.