Lecture #09: Hash Join Algorithms

Date: Feb 21 2024

Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/08-scheduling.pdf

Reading

An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory (S. Schuh, et al., SIGMOD 2016)
A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing (S. Richter, et al., VLDB 2015) (Optional)
Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs (S. Blanas, et al., SIGMOD 2011) (Optional)
Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying Hardware (C. Balkesen, et al., ICDE 2013) (Optional)
To Partition, or Not to Partition, That is the Join Question in a Real System (M. Bandle, et al., SIGMOD 2021) (Optional)

Raw Lecture Notes

brief follow-up on Work Stealing and Dynamic Scaling from last class
- (Snowflake allows resource borrowing, they like have a cluster that can be used by multiple different tenants for excess capacity)
Join is the most important operator!
Parallel Hash Join / Hash Functions / Hashing Schemes
How can we do a 2-way join in parallel?
- 3+-way (multi way join) will be next class
3 types of joins:
- Hash Join (faster much more often)
- Sort Merge Join (slower, but perfect if data is already sorted)
- Nested-Loop Join (basically irrelevant except for tables with very few number of rows)
- (Fun fact: SingleStoreDB implements all 3 for different cases. Sort Merge is great for SingleStore’s columnstore because data is sorted by the sort keys.)
Over the years, there were lots of discussions on whether Hash Join or Sort Merge Join was faster. Nowadays the consensus is Hash Join is faster.
Another historical discussion is whether Hash Join should have a pre-partition step or not. The consensus has shifted back and forth on this over the years in academia.
Goals:
- 1 minimize synchronization (less latches)
- 2 minimize memory access cost (increase data locality and utilization of CPU caches)
Hash Join R <> S Phases
- Phase 1 (Optional) Partition
- Phase 2 Build the HT
- Phase 3 Probe the HT, i.e., for each tuple in S, check HT, combine tuples accordingly
Partition step for Hash Joins
- goal: distribute work/data across cores
- the cost of this step should be lower than the cost of cache misses during the Build Phase. Otherwise, it’s useless and negative to performance.
- 2 general types of algorithms for partitioning
  - blocking (radix)
    - for Radix, we first build a histogram of keys and all threads receive this. We also take the prefix sum so each thread knows where it’s safe to write.
    - each thread should have a local buffer and then batch writes to the global shared buffer
    - the threads should ideally write data to the global shared buffer without writing to CPU cache first
  - non-blocking (build can start in the meantime)
    - shared partitions (latches)
    - private partitions (gotta join the data at the end)
Build Phase
- Build R → store everything in a hash table
- hash table:
  - hash function (32 bit or 64 bit int output)
  - hashing scheme (how we deal with inevitable collisions)
    - chained hashing
    - linear probe hashing
    - robin hood hashing
    - hopskotch hashing (too fancy)
    - cuckoo hashing (too fancy)
- We should build a HT with 2x the number of elements in R.
- Robin Hood hashing makes building the HT slower but it leads to a much more even distribution which is good for retrieval (i.e., the Probe phase)
- Hopskotch haching and cuckoo hashing are too fancy for their own good. Simpler is better! But hopskotch is basically a fancier version of robin hood hashing.
- Hash Table Contents
  - (I didn’t pay too much attention to this part…)
  - But basically there’s a decision to be made on whether to store the entire tuple in the HT or just a pointer to it for some other data structure.