Lecture #22: System Analysis (Amazon Redshift)

Date: Apr 24 2024

Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/22-redshift.pdf

Reading

Amazon Redshift Re-Invented (N. Armenatzoglou, et al., SIGMOD 2022)
Amazon Redshift and the Case for Simpler Data Warehouses (A. Gupta, et al., SIGMOD 2015) (Optional)

Raw Lecture Notes

A lot of people have tried to make OLAP versions of Postgres
- Vertica, Citus, Greenplum, ParAccel
Redshift is based on ParAccel
- shared-nothing architecture
- they added S3 support (and gave it separation of storage and compute) in 2017
Redshift is a more traditional data warehouse since it is more geared towards a closed data format (like Yellowbrick)
- (But now they have Iceberg tables too!)
AWS also has…
- Athena - serverless query engine built with Presto/Trino (lakehouse style)
- Redshift Spectrum - lakehouse-version of Redshift (bring your own S3 without having to load it into the warehouse)
High-Level Overview of Redshift
- Shared-Disk / Separation of Storage and Compute (Disaggregated Storage)
- Push-based Vectorized Query Processing
- Transpilation Query codegen (C++)
- Precompiled Primitives
- Compute-side Caching
- PAX Columnar Storage
- Sort-Merge + Hash Joins
- Hardware Acceleration (AQUA)
- Stratified Query Optimizer
Architecture
- Customer chooses underlying EC2 instance type that the redshift software runs on
Query Engine
- hybrid query compilation using both source-to-source (C++) and precompiled primitives
- SIMD via AVX2 instances
- late materialization
- SIMD fast paths for ASCII data…
- dynamic bloom filter sizing for join filters…
Shared Compilation Caching Service
- as we learnt before, they cache templated fragments → C++ codegen’d code across all customers
- (background workers proactively recompile plans when a new version of the engine comes out)
Query Optimizer
- postgres is still at the very core of it, but probably over the years it’s been mostly scrapped out
- cost-based query optimizer that is guided by statistics extracted from data (if the customer is using redshift managed storage, not spectrum)
  - for spectrum, they just leverage metadata like zone maps in the headers of parquet/orc files, and use those to pushdown filters
- “Query Rewriting Framework” - little is known about what this is and how it works
  - DSL-based approach to specify patterns and transformation rules
Storage
- RMS (Redshift Managed Storage) are separate nodes storing data in Redshift’s proprietary data format on direct-attached SSDs
  - compute nodes also maintain local cache (SSD or not?)
  - these can spill data to S3
- they support dictionary encoding, delta encoding, RLE per column and the user can tweak this
Notes on Benchmarks
- Everybody claims they’re the best on their own papers
- Nobody (except Yellowbrick) puts real runtime numbers (query durations) in their papers so they’re not quoted against those numbers by competitors later
- (Almost) Everybody has “serverless”-style sizes (units/t-shirt size/etc.), so comparing apples to apples is very hard
Great question from class student
- “How can Snowflake succeed and even beat AWS/Azure/GCP if they need to make margins on top of AWS/Azure/GCP”
  - Andy responds: because they started early
  - (My personal response: user experience is way better than Redshift)
Parting Thoughts
- Amazon makes maybe up to 100B a year on Redshift!