Date: Apr 17 2024
Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/20-duckdb.pdf
Reading
Raw Lecture Notes
- In 2017, CWI researchers created an embeddable version of MonetDB called MonetDBLite
- there is no
main
function, it’s just used as a library (it can spin up its own threads, etc.)
- also called “in-process DB”
- but MonetDB had too much legacy
- So they created DuckDB
- Core is pure C++ with almost no dependencies
- SQL dialect is based on Postgres (they took the Postgres grammar file)
- has some very nice quality-of-life enhancements
- it is basically “SQLite for analytics”
- DuckDB runs quite easily on wasm compared to say, Postgres
- Properties
- Shared-Everything
- Push-based Vectorized Query Processing
- was pull-based originally
- Precompiled Primitives
- MVCC
- Morsel Parallelism + Scheduling
- PAX Columnar Storage (of course)
- Sort-Merge + Hash Joins
- Stratified Query Optimizer
- with support for unnesting of arbitrary subqueries, which we studied before and the only other system that does it is Umbra
- Push-Based vs Pull-Based
- Initially was pull-based, now it’s push-based since 2021
- The authors found pull-based was hard to expand to new types of parallelism
- Query Engine
- tl;dr: lots of cool stuff
- Storage is all Vectors
- 4 different physical representations depending on the data that’s being stored
- DuckDB can be used directly by dataframe libraries like Python’s Ibis, their connectors generate DuckDB logical plans thus bypassing the SQL parser. The results are zero-copied with Apache arrow.
- DuckDB is very extensible. For instance, there’s an extension that allows it to parse Substrait plans. This is pretty advanced stuff to be able to build through extensions.
- DuckDB supports Parquet/SQLite/JSON files/Arrow…