The FastLanes Compression Layout: Decoding > 100 Billion Integers per Second with Scalar Code

First of all, this paper was extremely hard for me to read. I scanned through it a few times, but it wasn’t until after I re-watched this section from Andy’s lecture on it that I was able to grok this.

(54:45 to 01:04:20)

Overview of the main idea (3 sentences)

The main idea behind FastLanes is that if you take Delta encoded data, you can’t really make use of SIMD instructions to decode it. This is because there is a contiguous dependency between values (i.e., values depend on the immediate previous/next value). What the FastLanes encoding does is to re-order the data in such a way where it can be decoded with SIMD instructions. Interestingly, this leads to bigger storage but faster decoding times.

Key findings / takeaways from the paper (2-3 sentences)

Optimizing for the least amount of storage is not necessarily the right thing to do in a data format/encoding scheme. Instead, especially nowadays with really cheap storage, it makes more sense to optimize for faster decoding/encoding and other operations. The FastLanes encoding scheme achieves this by optimizing specifically for SIMD decoding.

Furthermore, this paper introduces a novel concept — what if we design an algorithm in a way that’s flexible enough for future SIMD registers that don’t even exist today? That’s what they did in this paper by considering 1024-bit SIMD registers which don’t even exist today.

System used in evaluation and how it was modified/extended (1 sentence)

The main evaluation work is done on a benchmark written by the authors to do bit-unpacking. They measure CPU cycles instead of time as this is a more objective/reproducible metric.

The authors also integrated FastLanes into Tectorwise, an experimental query processor. They did this to test the output of a real SELECT SUM(COL) FROM TAB query.

Workload Evaluated (1 sentence)

Bit unpacking, as mentioned before
End-to-end query performance (SELECT SUM(COL) FROM TAB)