Lecture #20: System Analysis (DuckDB)

Date: Apr 17 2024

Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/20-duckdb.pdf

Reading

DuckDB: an Embeddable Analytical Database (M. Raasveldt, et al., SIGMOD 2019)
Data Management for Data Science Towards Embedded Analytics (M. Raasveldt, et al., CIDR 2020) (Optional)
MotherDuck: DuckDB in the cloud and in the client (RJ. Atwal, et al., CIDR 2024) (Optional)

Raw Lecture Notes

In 2017, CWI researchers created an embeddable version of MonetDB called MonetDBLite
- there is no main function, it’s just used as a library (it can spin up its own threads, etc.)
- also called “in-process DB”
- but MonetDB had too much legacy
So they created DuckDB
Core is pure C++ with almost no dependencies
SQL dialect is based on Postgres (they took the Postgres grammar file)
- has some very nice quality-of-life enhancements
it is basically “SQLite for analytics”
DuckDB runs quite easily on wasm compared to say, Postgres
Properties
- Shared-Everything
- Push-based Vectorized Query Processing
  - was pull-based originally
- Precompiled Primitives
- MVCC
- Morsel Parallelism + Scheduling
- PAX Columnar Storage (of course)
- Sort-Merge + Hash Joins
- Stratified Query Optimizer
  - with support for unnesting of arbitrary subqueries, which we studied before and the only other system that does it is Umbra
Push-Based vs Pull-Based
- Initially was pull-based, now it’s push-based since 2021
- The authors found pull-based was hard to expand to new types of parallelism
Query Engine
- tl;dr: lots of cool stuff
Storage is all Vectors
- 4 different physical representations depending on the data that’s being stored
DuckDB can be used directly by dataframe libraries like Python’s Ibis, their connectors generate DuckDB logical plans thus bypassing the SQL parser. The results are zero-copied with Apache arrow.
DuckDB is very extensible. For instance, there’s an extension that allows it to parse Substrait plans. This is pretty advanced stuff to be able to build through extensions.
DuckDB supports Parquet/SQLite/JSON files/Arrow…