Personal note: this paper was very hard to read. Even harder than The FastLanes Compression Layout: Decoding > 100 Billion Integers per Second with Scalar Code. Even after watching Andy’s lecture on this, I don’t understand everything in the paper. Specifically, I did not fully understand the difference between divergent, partial, buffered and materialization as it pertains to “AVX-512 implementations”.

Screenshot from 2024-02-19 15-13-22.png

Overview of the main idea (3 sentences)

The main idea discussing in this paper is figuring out how to maximize SIMD lane utilization for data that has yet to be filtered out by a query, while also avoiding too frequent (and costly) pipeline breaks and refills.

The authors introduce some algorithms and some heuristics for doing all of this, and then benchmark them on a number of different tests, and runtime variables such as different Intel processors, different threshold percentages for their algorithms, different hyper threading configurations, etc.

Key findings / takeaways from the paper (2-3 sentences)

Perhaps the main key finding is that all of the optimizations discussed in the paper are very hard to implement and test, and they are subject to a lot of variability issues across different processors and configurations. However, the speedups they generate are not even that spectacular (“our refill algorithms merely achieve a 2x speedup over scalar code” — scalar code basically meaning “volcano model” style code where tuples are generated one at a time.

System used in evaluation and how it was modified/extended (1 sentence)

The system used is basically raw C code running on Intel Skylake and Intel Knights Landing processors.

Workload Evaluated (1 sentence)

They evaluate TPCH’s query 1, a linear probing operation for a hash table and an approximate geospatial join.