Date: Apr 24 2024
Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/22-redshift.pdf
Reading
Raw Lecture Notes
- A lot of people have tried to make OLAP versions of Postgres
- Vertica, Citus, Greenplum, ParAccel
- Redshift is based on ParAccel
- shared-nothing architecture
- they added S3 support (and gave it separation of storage and compute) in 2017
- Redshift is a more traditional data warehouse since it is more geared towards a closed data format (like Yellowbrick)
- (But now they have Iceberg tables too!)
- AWS also has…
- Athena - serverless query engine built with Presto/Trino (lakehouse style)
- Redshift Spectrum - lakehouse-version of Redshift (bring your own S3 without having to load it into the warehouse)
- High-Level Overview of Redshift
- Shared-Disk / Separation of Storage and Compute (Disaggregated Storage)
- Push-based Vectorized Query Processing
- Transpilation Query codegen (C++)
- Precompiled Primitives
- Compute-side Caching
- PAX Columnar Storage
- Sort-Merge + Hash Joins
- Hardware Acceleration (AQUA)
- Stratified Query Optimizer
- Architecture
- Customer chooses underlying EC2 instance type that the redshift software runs on
- Query Engine
- hybrid query compilation using both source-to-source (C++) and precompiled primitives
- SIMD via AVX2 instances
- late materialization
- SIMD fast paths for ASCII data…
- dynamic bloom filter sizing for join filters…
- Shared Compilation Caching Service
- as we learnt before, they cache templated fragments → C++ codegen’d code across all customers
- (background workers proactively recompile plans when a new version of the engine comes out)
- Query Optimizer
- postgres is still at the very core of it, but probably over the years it’s been mostly scrapped out
- cost-based query optimizer that is guided by statistics extracted from data (if the customer is using redshift managed storage, not spectrum)
- for spectrum, they just leverage metadata like zone maps in the headers of parquet/orc files, and use those to pushdown filters
- “Query Rewriting Framework” - little is known about what this is and how it works
- DSL-based approach to specify patterns and transformation rules
- Storage
- RMS (Redshift Managed Storage) are separate nodes storing data in Redshift’s proprietary data format on direct-attached SSDs
- compute nodes also maintain local cache (SSD or not?)
- these can spill data to S3
- they support dictionary encoding, delta encoding, RLE per column and the user can tweak this
- Notes on Benchmarks
- Everybody claims they’re the best on their own papers
- Nobody (except Yellowbrick) puts real runtime numbers (query durations) in their papers so they’re not quoted against those numbers by competitors later
- (Almost) Everybody has “serverless”-style sizes (units/t-shirt size/etc.), so comparing apples to apples is very hard
- Great question from class student
- “How can Snowflake succeed and even beat AWS/Azure/GCP if they need to make margins on top of AWS/Azure/GCP”
- Andy responds: because they started early
- (My personal response: user experience is way better than Redshift)
- Parting Thoughts
- Amazon makes maybe up to 100B a year on Redshift!