Don’t Hold My Data Hostage – A Case For Client Protocol Redesign

Overview of the main idea (3 sentences)

The main idea behind this paper is that exporting data from a database (materializing the result set, serializing it and wiring it) is actually very expensive. In a lot of queries, this will be more expensive than the query execution time itself. For ML and other data science use cases, this is very bad because not all computation can and should be moved to the database. So, the authors looks at the most common database wire protocols and test them thoroughly, while also looking for areas of improvement for them. Then, they implement a new protocol on top of MonetDB and PostgreSQL and test their improvements.

Key findings / takeaways from the paper (2-3 sentences)

A PAX model is probably the best option for how rows should be streamed (instead of pure rowstore or pure columnstore)
Other smaller optimizations should be made to existing protocols.
In terms of the existing protocols, the MySQL one seems to be quite good all around! (But the Postgres one is much more widely used)
In general, database protocols are not very optimized for the new age (but they were fine when all databases were only OLTP and result sets were never too large). It’s probably time for a do-over, a lot of small things need to be rethought. This change will be painful (because of backwards compatibility) but worth it.

System used in evaluation and how it was modified/extended (1 sentence)

The authors test a variety of DBMSs (Oracle, MonetDB, PG, MySQL), but they specifically modify Postgres and MonetDB with a new serializer.

Workload Evaluated (1 sentence)

lineitem from the TPC-H benchmark
American Community Survey dataset (274 columns, with the majority of type INTEGER)
Airline On-Time Statistics (109 columns, 10 million rows, 3.6GB as CSV)

Raw Notes

this paper is from 2017
the authors show that different DBMSs have wildly different wall clock times for returning the same amount of data
- furthermore, for this specific query the return time is MUCH larger than the execution time
returning large results is indeed important. Not all computation can and should be moved to the DBMS
they will implement in pg and monetdb
this paper will not focus on odbc/jdbc drivers, just on time to get the query results with the raw protocol