Date: Mar 20 2024
Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/14-optimizer2.pdf
Reading
Raw Lecture Notes
- top-down optimization
- start from the outcome and then find the plan that gets us there
- volcano, cascades
- bottom-up optimization
- build up the plan to get the outcome we want
- agenda
- unified search
- randomized search
- real-word implementations
- unnesting subqueries
- cascades optimizer (recap from last week)
- 3rd generation of a query optimizer
- key ideas:
- patterns to match + transformation rule to apply
- dynamic task priorities to find optimal plan more quickly
- use same pattern/rule engine for expressions (predicates as logical/physical operators)
- Key idea: Memo Table
- for every expression (which is a higher-level representation of a block of a query), store the actual physical operators (sub-plan) that have been used for that expression before and how well they have performed in the past
- this table can be used in the future to pick a physical plan for an expression (based on what’s been learned so far)
- principle of optimality: every sub-plan of an optimal plan is itself optimal
- (paper: IDEAS 2001: exploting upper and lower bounds in top-down query optimization)
- cascades implementations in the real world
- sql server (most mature, no DSL actually)
- cockroach
- apache calcite (pure query engine, we studied this in the first few lectures as part of the decoupled data platform/database narrative)
- SQL server
- they actually pre-populate the memo table with certain “universal truths”
- Cockroach
- custom cascades implementation in 2018
- DSL (OptGen)
- Postgres Genetic Optimizer
- if joining (R<>S)<>T for the first time, postgres will attribute random costs to different ways of joining these relations. As it learns, it will adjust.
- Subqueries
- SQL allows a nested SELECT subquery to exist almost anywhere in another query. Examples: Projection, FROM, WHERE, LIMIT, HAVING, …
- Result of the inner query is passed to the outer query
- if the subquery is uncorrelated to the parent query, a good query optimizer will run it once and cache for all tuples that are evaluated
- THIS PAPER’s CLASS: Unnesting Arbitrary Queries
- this is the 2015 paper, there’s a follow-up one with more details
- can we move up the subqueries so that the DBMS can execute it as join?
- Yes. an important implementation detail includes introducing a dependent join operator that denotes a correlated subquery.
- Only HyPer, Umbra and DuckDB do this fully. (Databricks might do a bit as well)
- Next class we’ll start looking at changing the execution strategy during the execution