Lecture #17: System Analysis (Google Dremel / BigQuery)

Date: Apr 8 2024

Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/17-bigquery.pdf

Reading

Dremel: A Decade of Interactive SQL Analysis at Web Scale (S. Melnik, et al., VLDB 2020)
Dremel: Interactive Analysis of Web-Scale Datasets (S. Melnik, et al., VLDB 2010) (Optional)
Storing and Querying Tree-Structured Records in Dremel (F.N. Afrati, et al., VLDB 2014) (Optional)

Raw Lecture Notes

observation: in the early 2000s, Google had a huge influence in the data engineering ecosystem. Whenever Google put out a paper, the industry would copy “because Google”
- a lot of the initial google systems were NoSQL, but not anymore

Screenshot 2024-04-11 at 07.17.33.png

“Dremel was one of the first systems to reintroduce SQL for Big Data at Google”
Let’s learn about Dremel:
- 20% project
context: there was all this data being generated into GFS (Google File System) and the only way to study it was by writing hadoop jobs in C++, these engineers at google wanted to use SQL
- initial version didn’t have joins
- released in 2012 as a commercial product BigQuery
Dremel is a in-situ data processing engine, basically a data lake that works on files that are already in GFS
Dremel (parquet and ORC)
- shared disk
- vectorized query processing
- shuffle-based distributed query execution
- columnar storage (zone maps, filters, RLE)
- hash joins only
- heuristic optimizer + adaptive optimizations
Dremel’s query architecture is that there is a coordinator node which splits up tasks across all worker nodes in multiple stages
- e.g. stage 1: partial group by
- e.g. stage 2: group by / sort / limit
- e.g. stage 3: sort/limit
- (Some parts of stage 2 can be started before all of stage 1 is done. This reminds me a lot of GitLab CI stages where you can have jobs in a later stage start before an entire previous stage is done.)
- All the worker nodes between 2 stages use a shared memory buffer.
  - The workers of the last stage write the results to filesystem/return to client.
  - This shared memory buffer can be spilled to disk as needed.
- What is “the shuffle”? “The shuffle” is a checkpoint in the middle of the query. There can be many shuffles, and it’s the coordinator’s responsibility to make sure all tasks for each stage is completed.
- The coordinator may choose to kill a task in a node and give the same task to another node which might be moving faster.
Why have stages and not just a full-blown DAG?
- Because as each stage completes, the coordinator can decide to change the query plan ahead.
- Also easier to manage data cleanup and probably simplifies other software engineering things.
GFS is like S3
Query Optimizer
- it’s adaptive, stage by stage
- it might change # of workers per stage, or change the join order, but not so much the entire query plan
- another clever things Dremel/BigQuery do is adjust intermediate results to partition skew, so it will instruct worker nodes to change their partitioning scheme based on results from previous stages
Storage
- “Colossus” is like an S3 at Google
- they use a Capacitor file format which is like ORC/Parquet
- it supports some type of filtering on compressed data, which ORC/Parquet don’t (yet) support
There are now open source systems inspired by Dremel:
- Apache Impala
- Presto (Meta)
  - forked to Trino by Starburst
  - (presto→prestoDB→prestoSQL→Trino 😕)
  - hive, presto and trino are all query engines
- Apache Drill
- Dremio
  - has materialized views and their own query engine