Date: Feb 21 2024
Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/08-scheduling.pdf
Reading
- An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory (S. Schuh, et al., SIGMOD 2016)
- A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing (S. Richter, et al., VLDB 2015) (Optional)
- Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs (S. Blanas, et al., SIGMOD 2011) (Optional)
- Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying Hardware (C. Balkesen, et al., ICDE 2013) (Optional)
- To Partition, or Not to Partition, That is the Join Question in a Real System (M. Bandle, et al., SIGMOD 2021) (Optional)
Raw Lecture Notes
- brief follow-up on Work Stealing and Dynamic Scaling from last class
- (Snowflake allows resource borrowing, they like have a cluster that can be used by multiple different tenants for excess capacity)
- Join is the most important operator!
- Parallel Hash Join / Hash Functions / Hashing Schemes
- How can we do a 2-way join in parallel?
- 3+-way (multi way join) will be next class
- 3 types of joins:
- Hash Join (faster much more often)
- Sort Merge Join (slower, but perfect if data is already sorted)
- Nested-Loop Join (basically irrelevant except for tables with very few number of rows)
- (Fun fact: SingleStoreDB implements all 3 for different cases. Sort Merge is great for SingleStore’s columnstore because data is sorted by the sort keys.)
- Over the years, there were lots of discussions on whether Hash Join or Sort Merge Join was faster. Nowadays the consensus is Hash Join is faster.
- Another historical discussion is whether Hash Join should have a pre-partition step or not. The consensus has shifted back and forth on this over the years in academia.
- Goals:
- 1 minimize synchronization (less latches)
- 2 minimize memory access cost (increase data locality and utilization of CPU caches)
- Hash Join R <> S Phases
- Phase 1 (Optional) Partition
- Phase 2 Build the HT
- Phase 3 Probe the HT, i.e., for each tuple in S, check HT, combine tuples accordingly
- Partition step for Hash Joins
- goal: distribute work/data across cores
- the cost of this step should be lower than the cost of cache misses during the Build Phase. Otherwise, it’s useless and negative to performance.
- 2 general types of algorithms for partitioning
- blocking (radix)
- for Radix, we first build a histogram of keys and all threads receive this. We also take the prefix sum so each thread knows where it’s safe to write.
- each thread should have a local buffer and then batch writes to the global shared buffer
- the threads should ideally write data to the global shared buffer without writing to CPU cache first
- non-blocking (build can start in the meantime)
- shared partitions (latches)
- private partitions (gotta join the data at the end)
- Build Phase
- Build R → store everything in a hash table
- hash table:
- hash function (32 bit or 64 bit int output)
- hashing scheme (how we deal with inevitable collisions)
- chained hashing
- linear probe hashing
- robin hood hashing
- hopskotch hashing (too fancy)
- cuckoo hashing (too fancy)
- We should build a HT with 2x the number of elements in R.
- Robin Hood hashing makes building the HT slower but it leads to a much more even distribution which is good for retrieval (i.e., the Probe phase)
- Hopskotch haching and cuckoo hashing are too fancy for their own good. Simpler is better! But hopskotch is basically a fancier version of robin hood hashing.
- Hash Table Contents
- (I didn’t pay too much attention to this part…)
- But basically there’s a decision to be made on whether to store the entire tuple in the HT or just a pointer to it for some other data structure.