Lecture #18: System Analysis (Databricks / Spark)

Date: Apr 10 2024

Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/18-databricks.pdf

Reading

Photon: A Fast Query Engine for Lakehouse Systems (A. Behm, et al., SIGMOD 2022)
Analyzing and Comparing Lakehouse Storage Systems (P. Jain, et al., CIDR 2023) (Optional)
Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores (M. Armbrust, et al., VLDB 2020) (Optional)

Raw Lecture Notes

Spark is going to be the first system we study that’s written in Java and thus runs on the JVM. This matters as we’ll see later.
Spark SQL (2015)
- row-based SQL engine natively inside the Spark runtime
- in-memory columnar representation of data between query operators
- some query codegen
Problems with JVM
- garbage collector slowdown
- JIT codegen limitations for large methods
So this is where Databricks Photon comes in. Single-threaded C++ execution engine embedded into Databricks Runtime (DBR) via JNI.
- It’s even more low-level than Velox, but also a library.
- It helps accelerate execution of query plans over raw/uncurated files in a data lake.
Features of Photon
- Shared-Disk / Disaggregated Storage
- Pull-based Vectorized Query Processing
- Precompiled primitives + Expression Fusion
- Shuffle-based distributed QE
- sort-merge and hash joins
- unified query optimizer + adaptive optimizations
(similar to Dremel in many ways. e.g., the “Driver” in Photon is like the “Coordinator” in Dremel)
The Photon engine uses precompiled operator kernels (primitives), but they do not precompile parts of the query. This is because the software engineering complexity overhead of code generation is, in the opinion of Andy and people from Databricks, not worth it. The problem with working in systems that do codegen is that debugging of those systems is really hard.
With vectorization only, the performance one is able to achieve is roughly the same as with codegeneration, but working on the system is much easier.
Expression Fusion in Photon
- when they see 2 intertwined operators, they fuse the 2 pre-compiled primitives
- because databricks is a managed service, they actually have looked at which operators are most commonly used together and have those as primitives already.
Adaptivity
- they do some stuff like dremel where they might adjust the number of workers for different stages while the query execution progresses
- but they also adapt how an individual operator executes on a specific tuple depending on how the data is laid out
  - similar to some things velox does
  - they might switch between a shuffle join vs. broadcast join, etc.
- This is particularly important in Databricks because there are very little statistics for the underlying data.
Photon costs more on Databricks then SparkSQL.
- (even though it uses less resources)
(Andy goes on a rant on official TPC auditing, and benchmark wars. Very fun.)
DataFusion is a bit of an open-source velox-like thing that speeds up Spark, in a way that Photon also does