Overview of the main idea (3 sentences)
Andy and team study the high-level design of both Parquet and ORC. Then, they run a series of workload tests against both file systems. The lessons learnt section at the end is a great starting point for anybody designing a replacement to Parquet/ORC.
Key findings / takeaways from the paper (2-3 sentences)
- Beware of block compression!
- Dictionary Encoding is king
- More filtering, zone maps and other filters could be used to accelerate query processing on top of Parquet/ORC files or future formats
System used in evaluation and how it was modified/extended (1 sentence)
Parquet and ORC were used, but not modified.
Workload Evaluated (1 sentence)
Various different datasets were evaluated:
- Public BI Benchmark
- Clickhouse docs/tutorials
- UCI-ML
Yelp
- LOG
- Geonames
- IMDb