Date: Apr 15 2024
Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/19-snowflake.pdf
Reading
Raw Lecture Notes
- SutterHill VCs recruited two Oracle engineers (Dageville, Cruanes) and Vectorwise co-founder (Zukowski) to build Snowflake in 2012.
- Redshift actually beat Snowflake to the market by a few months. (Redshift came from ParAccel’s source code)
- Snowflake is an OLAB DBMS written in C++
- shared-disk architecture, with aggressive compute-side local caching
- written from scratch, did not borrow any components from any systems
- custom SQL dialect and client-server network protocol
- More properties
- precompiled primitives
- separate table data from metadata!
- no buffer pool
- pax columnar storage
- both their proprietary storage as well as iceberg
- sort-merge + hash joins
- unified query optimizer + adaptive optimization
- push-based vectorized query processing
- storage is s3/equivalents
- they have worker nodes which are vm instances running their software with attached disks
- Work Stealing
- super important part of snowflake’s architecture where they can steal worker nodes that are assigned to other customers
- this is very different from yugabyte/singlestore, and probably quite valuable in terms of efficiency for the overall service
- when snowflake does this type of work stealing, the query operators do not write to disk but instead to S3. This is slower but it’s the only possible way because otherwise the disk would have to be spilled in case the dormant customer were to come back
- interesting tidbit: they store parts of JSON/XML as binary columns (they auto detect dates and things like that), but keep the unparsed data around in case it drifts and no longer can be saved in that type
- UniStore
- for TPCC-style transactional workloads
- it’s a rowstore in front of a columnstore, so basically they have “hybrid tables”
- Parting Thoughts from Andy
- Snowpark / snowpipe / everything else snowflake has built on top of the warehouse is what distinguishes it, since the underlying query performance is becoming commoditized against databricks, redshift, yellowbrick, etc.