Date: Mar 13 2024
Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/12-networking.pdf
Reading
- Don’t Hold My Data Hostage – A Case For Client Protocol Redesign (M. Raasveldt, et al., VLDB 2017)
- ConnectorX: Accelerating Data Loading From Databases to Dataframes (X. Wang, et al., VLDB 2022) (Optional)
- Tigger: A Database Proxy That Bounces With User-Bypass, (M. Butrovich, et al., VLDB 2023) (Optional)
- Accelerating Relational Databases by Leveraging Remote Memory and RDMA (F. Li, et al., SIGMOD 2016) (Optional)
- The End of Slow Networks: It's Time for a Redesign (F. Binnig, et al., VLDB 2016) (Optional)
Raw Lecture Notes
- How can we bring data from the database to the application so the application can process it
- We’ll talk about Apache Arrow (this class’s paper came before the Arrow protocol, but the point is the same)
- We’ll also talk about kernel/user bypass methods and other optimizations we can make client-side
- ODBC / JDBC
- ODBC
- Standard API for accessing a DBMS. Designed to be independent of the DBMS and OS.
- Originally developed in the early 1990s by Microsoft and Simba Technologies.
- Every major DBMS has an ODBC implementation.
- JDBC → basically ODBC but for Java instead of C
- Most databases implement their own proprietary TCP protocol
- results typically get serialized, there may be a TLS handshake and there’s almost definitely going to be some authentication
- some systems have cursors and send results in real-time while the query is running
- A lot of new DBMSs just implement a wire protocol of existing DBs (MySQL or Postgres)
- Protocol Design Space
- row vs. column layout
- ODBC/JDBC are row-oriented
- a PAX model is probably the best solution here (arrow does this)
- compression
- snowflake, oracle, mysql do something generic like LZ4
- the best solution here is encoding (which Arrow ADBC does), unless the network is very slow in which case compression is better
- data serialization
- (option 1) binary encoding
- DBMS can implement its own or use protobuffers or flatbuffers, but the closer to the internal representation of the data the better because there will be less conversion header
- (option 2) ****text encoding
- everything comes as string, so a missing value could just be “NULL”
- endianess is not an issue either
- string handling
- (option 1) null termination
- (option 2) length-prefixes
- (option 3) fixed width
- which is fastest? depends on the strings, no universal answer really
- (In the lambda function model, because clients are paying per CPU time, it’s very important that the protocol is lean.)
- https://arrow.apache.org/docs/format/ADBC.html arrow database connectivity protocol
- postgres wire protocol does not have any type of compression (but mysql and oracle do)