Lecture #03: Data Formats & Encoding II

Date: Jan 29 2024

Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/03-data2.pdf

Reading

The FastLanes Compression Layout: Decoding > 100 Billion Integers per Second with Scalar Code (A. Afroozeh, et al., VLDB 2023)
BtrBlocks: Efficient Columnar Compression for Data Lakes (M. Kuschewski, et al., SIGMOD 2023) (Optional)
BitWeaving: Fast Scans for Main Memory Data Processing (Y. Li, et al., SIGMOD 2013) (Optional)

Raw Lecture Notes

good initial reminder: most data is semi-structured or unstructured
brush up on shredding (already learnt this last week)
reminder: when parquet/ORC were invented (10-12 years ago), network was slower and storage was more expensive. Some of the hardware assumptions made when they were designed no longer hold true. For instance, SIMD wasn’t really a thing back then.
In this lecture, we’ll learn about 3 modern encoding schemes/file formats that can be used for storage:
- BtrBlocks (TUM) - basically parquet++
- FastLanes (CWI) - very different storage layout, focused on SIMD
- BitWeaving (Wisconsin) - very niche and unique, mainly interesting to study
BtrBlocks
- more aggressive nested encoding schemes than parquet/orc
- no naive block compression (as we learnt in the paper from last class, this is not very good anymore)
- metadata is in a separate file (makes this format less portable). The argument from the authors is that it allows clients to read zone maps from the metadata file and skip reading the main data file entirely. However, in S3 we can just read byte offsets and ask for the metadata. So this point from the authors is a bit of a moot point.
- they basically sample 1% of data against different encoding schemes (and they take the samples from variadic locations) to decide the best encoding scheme
- one of the encoding schemes they use is FSST (VLDB 2020) which is a very clever string encoding scheme that appears to be better than naive dictionary encoding.
- Observation: what FSST does is not that different from what LZ4/ZTSD do but in a way where the auxiliary data structures (dictionaries and whatnot) are open to the reader. This allows for certain queries to be optimized using these data structures. As an example, for certain queries, like checking if any string in a column begins with “foo”, it may be possible to just check the dictionary of encodings and not the raw data.
- Another encoding scheme used is Roaring Bitmaps, which is a way to store numbers as bits in a bitmap. Not that different from bloom filters but with perfect accuracy.
FastLanes (VLDB 2023)
- One issue with BtrBlocks, Parquet and ORC is that they all generate variable-length runs of values. This does not help enable SIMD.
- FastLanes is an encoding scheme that is optimized for enabling SIMD for data processing. It’s also very future proof by ensuring that when 1024 bit registers for SIMD appear it should work as well.
- How FastLanes works is by re-ordering data heavily in order to optimize for SIMD. I need to go and read the paper to fully understand this.
BitWeaving (SIGMOD 2013)
- One observation about all encoding schemes discussed so far is that they encode the entire value. This is NOT optimal for many queries/comparisons that could be much faster if only a subset of the value was checked. We don’t need to read a full number to know if 1050 is larger than 5, we can just check the first bit in some cases.
- BitWeaving is built on the idea of bit slicing from the 1990s. It’s a bit niche, and was eventually built as Apache Quickstep, which has since been retired.
Parting thoughts from Andy. The separation of how data is queried versus how it is stored allowed for all the innovation we discussed in the last 2 classes. Having SQL be completely decoupled from the how the data is stored was really really important.