Lecture #02: Data Formats & Encoding I

Date: Jan 24 2024

Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/02-data1.pdf

Reading

‣ (X. Zeng, et al., VLDB 2023
A Deep Dive into Common OpenFormats for Analytical DBMSs (C. Liu, et al., VLDB 2023) (Optional)

Raw Lecture Notes

OLAP systems mostly do large sequential scans, as opposed to OLTP systems
Different storage models
- N-ary Storage Model (NSM, rowstore)
- Decomposition Storage Model (DSM, columnstore)
- Hybrid Storage Model (PAX) - basically a hybrid of columnstore and rowstore that is much more columnstore than rowstore. Invented in 2002 at CMU, used by every file format like Parquet and ORC nowadays.
In either DSM or PAX, all cell values are stored with a fixed length for faster lookups (offsets can be computed easily). How does this work for strings and other weird things? Dictionary encoding and other techniques.
PAX is made up of row groups (called stripes in ORC, but called row groups in Parquet).
- I think SingleStoreDB used to use some version of PAX before but now the columnstore implementation is much more complex since it’s LSM-based and really more of a “Universal Storage” for real rowstore+columnstore.
The PAX hybrid rowstore/columnstore is good because it accelerates most queries, since most queries are multi-column, not single column. This is the main differentiator from the naive columnstore implementation (DSM).
When designing a data format, there’s a set of things to think about:
- File Metadata
- Format Layout (e.g., what should be the maximum size for each “row group”?)
- Type System (which physical types to have, how to represent logical types on top of physical types)
  - Parquet has very few physical types (1 bit boolean, int32, float, double, byte array, fixed length byte array). Everything else is built as a logical type on top of these physical types.
    - ORC has a lot more physical types.
- Encoding Schemes
  - Dictionary Encoding or Dictionary Compression is the most widely used form of encoding for big data file formats.
  - We can apply RLE on top of DE.
  - Parquet DEs everything, even numbers. ORC mainly uses DE for strings. This leads Parquet to have better data compression, thus smaller file sizse.
  - DE ultimately allows us to have fixed-length blocks even for TEXT columns.
- Block Compression
  - LZ0, LZ40, Snappy, ZTSD are some of the popular algorithms.
  - Andy Pavlo does not recommend using block-level compression too much because it makes things slower since they require compressing/decompressing. In the old days this was different because unlimited storage wasn’t as common as it is today in the cloud. Furthermore, bandwidth is also getting much faster and becoming less and less of a bottleneck.
- Filters
  - Filters are things like min/max ranges for column values that are stored at the row group level. These allow query optimizers to run faster.
  - Bloom filters and other things are used too.
- Nested Data
  - Shredding is where you take something like a JSON file and map it out to columnar data. This is how SingleStore and many other systems store JSON. It works very well since a lot of JSON files have a strict schema.
(Fun fact that I didn’t fully know about before: S3 allows you to get specific byte offsets from any file.)
Parquet vs ORC in real life
- ORC is better for geospatial
- Parquet is better for compressing (see note about numbers above)
- Generally, Parquet allows for better performance than ORC, but it’s not a huge difference.
Main class takeaways
- DE works for more than just strings!
- Simple encoding schemes are better than block compression
- Avoid general-purpose block compression