Personal note: this paper was very hard to read. Even harder than The FastLanes Compression Layout: Decoding > 100 Billion Integers per Second with Scalar Code. Even after watching Andy’s lecture on this, I don’t understand everything in the paper. Specifically, I did not fully understand the difference between divergent, partial, buffered and materialization as it pertains to “AVX-512 implementations”.
The main idea discussing in this paper is figuring out how to maximize SIMD lane utilization for data that has yet to be filtered out by a query, while also avoiding too frequent (and costly) pipeline breaks and refills.
The authors introduce some algorithms and some heuristics for doing all of this, and then benchmark them on a number of different tests, and runtime variables such as different Intel processors, different threshold percentages for their algorithms, different hyper threading configurations, etc.
Perhaps the main key finding is that all of the optimizations discussed in the paper are very hard to implement and test, and they are subject to a lot of variability issues across different processors and configurations. However, the speedups they generate are not even that spectacular (“our refill algorithms merely achieve a 2x speedup over scalar code” — scalar code basically meaning “volcano model” style code where tuples are generated one at a time.
The system used is basically raw C code running on Intel Skylake and Intel Knights Landing processors.
They evaluate TPCH’s query 1, a linear probing operation for a hash table and an approximate geospatial join.