Date: Apr 8 2024
Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/17-bigquery.pdf
Reading
Raw Lecture Notes
- observation: in the early 2000s, Google had a huge influence in the data engineering ecosystem. Whenever Google put out a paper, the industry would copy “because Google”
- a lot of the initial google systems were NoSQL, but not anymore
- “Dremel was one of the first systems to reintroduce SQL for Big Data at Google”
- Let’s learn about Dremel:
- context: there was all this data being generated into GFS (Google File System) and the only way to study it was by writing hadoop jobs in C++, these engineers at google wanted to use SQL
- initial version didn’t have joins
- released in 2012 as a commercial product BigQuery
- Dremel is a in-situ data processing engine, basically a data lake that works on files that are already in GFS
- Dremel (parquet and ORC)
- shared disk
- vectorized query processing
- shuffle-based distributed query execution
- columnar storage (zone maps, filters, RLE)
- hash joins only
- heuristic optimizer + adaptive optimizations
- Dremel’s query architecture is that there is a coordinator node which splits up tasks across all worker nodes in multiple stages
- e.g. stage 1: partial group by
- e.g. stage 2: group by / sort / limit
- e.g. stage 3: sort/limit
- (Some parts of stage 2 can be started before all of stage 1 is done. This reminds me a lot of GitLab CI stages where you can have jobs in a later stage start before an entire previous stage is done.)
- All the worker nodes between 2 stages use a shared memory buffer.
- The workers of the last stage write the results to filesystem/return to client.
- This shared memory buffer can be spilled to disk as needed.
- What is “the shuffle”? “The shuffle” is a checkpoint in the middle of the query. There can be many shuffles, and it’s the coordinator’s responsibility to make sure all tasks for each stage is completed.
- The coordinator may choose to kill a task in a node and give the same task to another node which might be moving faster.
- Why have stages and not just a full-blown DAG?
- Because as each stage completes, the coordinator can decide to change the query plan ahead.
- Also easier to manage data cleanup and probably simplifies other software engineering things.
- GFS is like S3
- Query Optimizer
- it’s adaptive, stage by stage
- it might change # of workers per stage, or change the join order, but not so much the entire query plan
- another clever things Dremel/BigQuery do is adjust intermediate results to partition skew, so it will instruct worker nodes to change their partitioning scheme based on results from previous stages
- Storage
- “Colossus” is like an S3 at Google
- they use a Capacitor file format which is like ORC/Parquet
- it supports some type of filtering on compressed data, which ORC/Parquet don’t (yet) support
- There are now open source systems inspired by Dremel:
- Apache Impala
- Presto (Meta)
- forked to Trino by Starburst
- (presto→prestoDB→prestoSQL→Trino 😕)
- hive, presto and trino are all query engines
- Apache Drill
- Dremio
- has materialized views and their own query engine