Lecture #12: Networking Protocols

Date: Mar 13 2024

Slides: https://15721.courses.cs.cmu.edu/spring2024/slides/12-networking.pdf

Reading

Don’t Hold My Data Hostage – A Case For Client Protocol Redesign (M. Raasveldt, et al., VLDB 2017)
ConnectorX: Accelerating Data Loading From Databases to Dataframes (X. Wang, et al., VLDB 2022) (Optional)
Tigger: A Database Proxy That Bounces With User-Bypass, (M. Butrovich, et al., VLDB 2023) (Optional)
Accelerating Relational Databases by Leveraging Remote Memory and RDMA (F. Li, et al., SIGMOD 2016) (Optional)
The End of Slow Networks: It's Time for a Redesign (F. Binnig, et al., VLDB 2016) (Optional)

Raw Lecture Notes

How can we bring data from the database to the application so the application can process it
- We’ll talk about Apache Arrow (this class’s paper came before the Arrow protocol, but the point is the same)
We’ll also talk about kernel/user bypass methods and other optimizations we can make client-side
ODBC / JDBC
ODBC
- Standard API for accessing a DBMS. Designed to be independent of the DBMS and OS.
- Originally developed in the early 1990s by Microsoft and Simba Technologies.
- Every major DBMS has an ODBC implementation.
JDBC → basically ODBC but for Java instead of C
Most databases implement their own proprietary TCP protocol
- results typically get serialized, there may be a TLS handshake and there’s almost definitely going to be some authentication
- some systems have cursors and send results in real-time while the query is running
A lot of new DBMSs just implement a wire protocol of existing DBs (MySQL or Postgres)
Protocol Design Space
- row vs. column layout
  - ODBC/JDBC are row-oriented
  - a PAX model is probably the best solution here (arrow does this)
- compression
  - snowflake, oracle, mysql do something generic like LZ4
  - the best solution here is encoding (which Arrow ADBC does), unless the network is very slow in which case compression is better
- data serialization
  - (option 1) binary encoding
    - DBMS can implement its own or use protobuffers or flatbuffers, but the closer to the internal representation of the data the better because there will be less conversion header
  - (option 2) ****text encoding
    - everything comes as string, so a missing value could just be “NULL”
    - endianess is not an issue either
- string handling
  - (option 1) null termination
  - (option 2) length-prefixes
  - (option 3) fixed width
  - which is fastest? depends on the strings, no universal answer really
(In the lambda function model, because clients are paying per CPU time, it’s very important that the protocol is lean.)
https://arrow.apache.org/docs/format/ADBC.html arrow database connectivity protocol
- Snowflake supports this
postgres wire protocol does not have any type of compression (but mysql and oracle do)