Lecture #07: Code Generation & Compilation

Date: Feb 14 2024

Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/07-compilation.pdf

Reading

Efficiently Compiling Efficient Query Plans for Modern Hardware (T. Neumann, VLDB 2011)
Generating Code for Holistic Query Evaluation (K. Krikellas, et al., ICDE 2010) (Optional)
Bringing Compiling Databases to RISC Architectures (F. Gruber, et al., VLDB 2023) (Optional)
How to Architect a Query Compiler (A. Shaikhha, et al., SIGMOD 2016) (Optional)
Adaptive Execution of Compiled Queries (A. Kohn, et al., ICDE 2018) (Optional)

Raw Lecture Notes

The 2011 paper from Thomas Neumman is a “seminal” paper, despite it not being the introduction of query compilation for DBMSs.
- (MS has a paper from 2011 too, the original academic paper is Hique from 2016)
Code compilation/specialization is all about reducing instruction count.
There’s 2 variations:
- Source-to-source or transpilation (this might be a full generation like generating C++, or using LLVM)
- JIT compilation
The generation process makes the code harder to debug, but very good for performance. It’s especially hard to debug LLVM code, raw C++ generated code is a easier to reason about.
(Postgres is doing some JIT? WHERE clause? I need to look into this, I’m curious)
Typically, DBMS systems codegen only a few things, in this order:
- predicate evaluation
- query operators
- stored procedures
- …
With shared/linked code like in Hique (see paper above), the generated code can call any regular function in the rest of the DBMS code like network or I/O stuff.
Query Compilation Cost
- Big topic. How can systems prevent high code compilation costs that can easily be larger than the query execution time if interpreted?
- The more complex the query, the bigger the compilation time.
- Idea from 2018 paper: run an LLVM IR interpreter in parallell to the compilation.
- (I also know that SingleStore supports an interpret_first mode where the first time a query is run, it is interpreted. And the compilation runs async and will be used for later executions. SingleStore supports other modes too.)
Redshift
- query fragments are compiled to templated C++ code
- paper from SIGMOD 2022 talks about a global cache of these for all customers of Redshift with a 99.9+% hit rate
SQLite
- they also do some compilation
- it generates some bytecode for a custom VM
The VLDB 2021 paper on Umbra is probably good for me to read.