Lecture #14: Query Optimizer Implementation 2

Date: Mar 20 2024

Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/14-optimizer2.pdf

Reading

**Unnesting Arbitrary Queries (T. Neumann, et al., BTW 2015)**
Dynamic Programming Strikes Back (G. Moerkotte, et al., SIGMOD 2008) (Optional)
The Complete Story of Joins (in HyPer) (T. Neumann, et al., BTW 2017) (Optional)
Adaptive Optimization ofVery Large Join Queries (T. Neumann, et al., SIGMOD 2018) (Optional)

Raw Lecture Notes

top-down optimization
- start from the outcome and then find the plan that gets us there
- volcano, cascades
bottom-up optimization
- build up the plan to get the outcome we want
agenda
- unified search
- randomized search
- real-word implementations
- unnesting subqueries
cascades optimizer (recap from last week)
- 3rd generation of a query optimizer
- key ideas:
  - patterns to match + transformation rule to apply
  - dynamic task priorities to find optimal plan more quickly
  - use same pattern/rule engine for expressions (predicates as logical/physical operators)
Key idea: Memo Table
- for every expression (which is a higher-level representation of a block of a query), store the actual physical operators (sub-plan) that have been used for that expression before and how well they have performed in the past
- this table can be used in the future to pick a physical plan for an expression (based on what’s been learned so far)
- principle of optimality: every sub-plan of an optimal plan is itself optimal
- (paper: IDEAS 2001: exploting upper and lower bounds in top-down query optimization)
cascades implementations in the real world
- sql server (most mature, no DSL actually)
- cockroach
- apache calcite (pure query engine, we studied this in the first few lectures as part of the decoupled data platform/database narrative)
SQL server
- they actually pre-populate the memo table with certain “universal truths”
Cockroach
- custom cascades implementation in 2018
- DSL (OptGen)
Postgres Genetic Optimizer
- if joining (R<>S)<>T for the first time, postgres will attribute random costs to different ways of joining these relations. As it learns, it will adjust.
Subqueries
- SQL allows a nested SELECT subquery to exist almost anywhere in another query. Examples: Projection, FROM, WHERE, LIMIT, HAVING, …
- Result of the inner query is passed to the outer query
- if the subquery is uncorrelated to the parent query, a good query optimizer will run it once and cache for all tuples that are evaluated
- THIS PAPER’s CLASS: Unnesting Arbitrary Queries
  - this is the 2015 paper, there’s a follow-up one with more details
- can we move up the subqueries so that the DBMS can execute it as join?
- Yes. an important implementation detail includes introducing a dependent join operator that denotes a correlated subquery.
- Only HyPer, Umbra and DuckDB do this fully. (Databricks might do a bit as well)
Next class we’ll start looking at changing the execution strategy during the execution