Date: Feb 14 2024
Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/07-compilation.pdf
Reading
- Efficiently Compiling Efficient Query Plans for Modern Hardware (T. Neumann, VLDB 2011)
- Generating Code for Holistic Query Evaluation (K. Krikellas, et al., ICDE 2010) (Optional)
- Bringing Compiling Databases to RISC Architectures (F. Gruber, et al., VLDB 2023) (Optional)
- How to Architect a Query Compiler (A. Shaikhha, et al., SIGMOD 2016) (Optional)
- Adaptive Execution of Compiled Queries (A. Kohn, et al., ICDE 2018) (Optional)
Raw Lecture Notes
- The 2011 paper from Thomas Neumman is a “seminal” paper, despite it not being the introduction of query compilation for DBMSs.
- (MS has a paper from 2011 too, the original academic paper is Hique from 2016)
- Code compilation/specialization is all about reducing instruction count.
- There’s 2 variations:
- Source-to-source or transpilation (this might be a full generation like generating C++, or using LLVM)
- JIT compilation
- The generation process makes the code harder to debug, but very good for performance. It’s especially hard to debug LLVM code, raw C++ generated code is a easier to reason about.
- (Postgres is doing some JIT?
WHERE
clause? I need to look into this, I’m curious)
- Typically, DBMS systems codegen only a few things, in this order:
- predicate evaluation
- query operators
- stored procedures
- …
- With shared/linked code like in Hique (see paper above), the generated code can call any regular function in the rest of the DBMS code like network or I/O stuff.
- Query Compilation Cost
- Big topic. How can systems prevent high code compilation costs that can easily be larger than the query execution time if interpreted?
- The more complex the query, the bigger the compilation time.
- Idea from 2018 paper: run an LLVM IR interpreter in parallell to the compilation.
- (I also know that SingleStore supports an interpret_first mode where the first time a query is run, it is interpreted. And the compilation runs async and will be used for later executions. SingleStore supports other modes too.)
- Redshift
- query fragments are compiled to templated C++ code
- paper from SIGMOD 2022 talks about a global cache of these for all customers of Redshift with a 99.9+% hit rate
- SQLite
- they also do some compilation
- it generates some bytecode for a custom VM
- The VLDB 2021 paper on Umbra is probably good for me to read.