Lecture #21: System Analysis (Yellowbrick)

Date: Apr 22 2024

Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/21-yellowbrick.pdf

Reading

Yellowbrick: An Elastic Data Warehouse on Kubernetes (M. Cusack, et al., CIDR 2024)

Raw Lecture Notes

Andy thinks Yellowbrick is very interesting, but it is not very well known. The most interesting parts are some low-level hackery they do for ultimate performance.
Historical context: a lot of DBMSs were built for a specific hardware config and sold as a hardware+software bundle.
- e.g. Oracle Exadata
Yellowbrick started like this!
They switched to a cloud DBaaS in 2021, using Kubernetes for everything.
They use the PG frontend (networking, parsing, catalog) for SQL.
This lecture is only about Yellowbrick Cloud.
Overview/Features
- Shared-Disk / Disaggregated Storage
- Push-based vectorized query processing
- Query Codegen (C++)
- Compute-side Caching
- Separate Row + PAX Columnar Storage
  - ingestion in rowstore and then compacted to columnar
- Sort-Merge + Hash Joins
- PG query optimizer, but improved by them
- Lots of cool systems engineering stuff
Architecture
- Rowstore → Columnstore Disk Caches → Object Storgae
- 1 worker pod per worker node to guarantee exclusive hardware access (not very efficient IMO)
Query Execution
- They make sure data is sitting in CPU L3 cache, not just in memory.
- They take the query plan and transpile fragments into C++ code (codegen with LLVM)
- They have a separate service for this compilation which caches fragments→codegens (with engine version as a caching key). This cache is probably per tenant, but Redshift has a similar service which caches across all customers.
- One of the things that Yellowbrick added to the PG query optimizer is cost-based join-order selection using statistics collected during the rowstore→columnstore compaction.
Storage
- managed storage (bring your own s3 bucket)
- proprietary format (dictionary encoding, …)
- sharding key / sort key supported
  - this shard key leads to mapping files to different worker nodes
- they somewhat support ingesting parquet files
- row-store in frontend and columnar data in object store
OS Optimizations (low level systems engineering, similar to things high frequency trading software does - to take over as much as possible of what the underlying OS is doing)
- Memory Allocator
  - custom NUMA-aware allocator
  - everything happens in user space, the program allocates its memory at the start and never calls malloc / New again
  - Huge Pages!
    - from the default of 4KB to 2MB-1GB, is good for OLAP performance
    - but not THP (transparent huge pages), those are bad!
- Thread Scheduler
  - All cores on the same worker execute different parts of the same query at the same time to make sure all data is always on CPU L3 cache (instead of memory)
  - Open question from the class: so they don’t allow more than one query to be running at the same time? That seems quite limiting?? 🤷
- Device Drivers
  - they built their own NVMe / NIC drivers that run in user space. These are the drivers that allow for reading/writing from disk (SSDs). Their replacements use kernel-bypass for internal communication.
- Network Protocols
Parting Thoughts from Andy
- What the team at Yellowbricks did doesn’t matter that much if the optimizer picks bad query plans.
  - (if the join order is wrong, all the low-level optimizations are irrelevant)