Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age

Overview of the main idea (3 sentences)

The authors propose a user-space algorithm for parallelizing DBMS query execution. The “tasks” that are to be parallelized are a partition of individual “pipelines” that make up queries (see previous notes on what constitutes a pipeline). So, for each pipeline, they break it up into partitions (could be round-robin, “keyless sharding”, or based on some data attribute which is even better because of joins), and then each of those “pipeline-partition jobs” gets treated as an individual, parallelizable task. Furthermore, the authors also propose scheduling these across NUMA-aware worker threads. This way, each worker thread only should have to access very local data. The authors start 1 thread for each physical core on the machine. Sometimes, however, if a thread is free it might do some work that is not NUMA-local (this is not ideal but it’s better than sitting idle).

Key findings / takeaways from the paper (2-3 sentences)

My personal main takeaway from this paper is that “partitioning”, which I’ve always correlated with distributed databases, is actually a completely valuable concept even in single-node systems. It’s as Andy mentioned in class… distributed databases are basically just the same thing as single-node systems because multi-core is equivalent to multi-server (with many caveats). But understanding single-node systems and optimizing them to their upmost is really, really important to build a great distributed database.
Another takeaway from me is that “in order to be able to parallelize each pipeline, each operator must be capable to accept tuples in parallel and for operators that start a new pipeline, to produce tuples in parallel”. This is obvious in hindsight. The paper describes some algorithms to do this kind of thing well. I was surprised by the implementation of ORDER BY which is a parallel local-sort + merge, but the merge is also parallel! (Very cool)
Finally, the authors argue for thread-local scheduling which means no centralized dispatcher. They achieve this efficiently using fancy lock-free data structures.

System used in evaluation and how it was modified/extended (1 sentence)

The authors used Hyper and they tested the following different scheduling mechanisms:

Adaptive NUMA-aware scheduling
Adaptive non-NUMA-aware scheduling that is still adaptive
Non-adaptive scheduling

Workload Evaluated (1 sentence)

TPC-H and SSB.