Lecture #13: Query Optimizer Implementation 1

Date: Mar 18 2024

Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/14-optimizer2.pdf

Reading

An Overview of Query Optimization in Relational Systems (S. Chaudhuri, PODS 1998)
The Cascades Framework for Query Optimization (G. Graefe, IEEE Data Engineering Bulletin 1995) (Optional)
Orca: A Modular Query Optimizer Architecture for Big Data (M.A. Soliman, et al., SIGMOD 2014) (Optional)
Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources (E. Begoli, et al., SIGMOD 2018) (Optional)

Raw Lecture Notes

QO is the part of DBMSs that Andy knows the least about
next 2 weeks (4 lectures) will all be about QO
- (we will learn, for instance, about adaptive query planning where the query runs for a bit and the plan is generated in chunks)
the query planner/optimizer is very hard to build. Its goal is to find a correct physical execution plan for a query with the lowest “cost”.
- use heuristics
- use estimation techniques to guess the real plan cost along the way
the “cost” is going to be some internal metric that we use (a heuristic)
some systems will not even send certain queries to the optimizer (if they’re super super simple)
agenda
- heuristics
- heuristics + cost-based
- stratified search
- unified search
- randomized search (basically only PG does this? and only when you have 13+ tables in a join?)
it is possible to just generate a query plan without a cost model. Very basic, early DBMSs did this. just heuristics
- a very simple algorithm that performs most restrictive selection early, all selects before joins, all limits/predicates are pushed down, join orderings are based on very basic mathematical comparison between the join sides
- this is actually pretty good, mongodb did nothing more than this for many years
but then came cost-based optimizer + heuristics
- 2 high-level approaches
  - Start with nothing and then iteratively assemble and add building blocks to generate a query plan.
  - Start with the outcome that the query wants, and then transform it to equivalent alternative sub-plans to find the optimal plan that gets to that goal.
- IBM System R went with the first approach. They generate a set of physical operators for each “logical block” in the query and then iteratively choose one option based on the estimated amount of work to execute the plan.
- But this optimization process can take an infinite amount of time. We can do a few things (or a combination of these things):
  - Stop the optimizer after a certain fixed amount of wall-clock time.
  - Stop when the optimizer finds a plan that has a lower estimated cost than some threshold.
  - Stop when there are no more enumerations of the target plan. (? weird)
  - Stop after a certain number of transformations have been considered.
Writing a query optimizer in procedural code is extremely hard and error-prone. Generating physical operators from logical operators is decoupled from deeper semantics about the query… This can very easily lead to incorrect query plans.
- So, a better approach is to come up with some sort of DSL to write the transformation rules and then have the optimizer enforce them during planning.
- The DSL basically declares which patterns we want to optimize for and then the optimizer goes and takes those patterns and makes sure to generate the right query plans for it.
- So the concept of “optimizer generator” arises…
And both stratified search and unified search follow this model
- (they codegen the optimizer based on some higher-level patterns)
- the stratified search does some heuristics in the beginning and then cost-based search; the unified search tries to do this all at once
Cascades
- …