Justin's Linklog

Questioning an Interface: From Parquet to Vortex

Interesting -- a new, GPU-optimised storage format:

Like Parquet, Vortex minimizes bytes on disk. However, Vortex is also designed with a core use-case in mind: decoding and querying data directly from object storage on GPUs. This key idea translates very well to our use-case even though we don’t run our queries on GPUs (yet?). Specifically, the file format is designed to maximize throughput and parallelism from the metadata format to the SIMD/SIMT friendly encodings used.

Crucially, it also acknowledges that part of making queries fast is not only good filter pushdown, but also general-purpose compute pushdown. If anything cannot be pushed down, Vortex’s encodings can be tuned to offer zero-copy conversion to Arrow for further query execution using any general-purpose query execution engine.

Vortex also learns from Parquet’s limitations around extensibility and aims to be as future-proof as possible. New encodings can ship with WASM decoders so encoding adoption is not limited by reader libraries having to implement support. The main Rust library is also designed to be fully extensible, so you can write your own layouts/encodings and plug them in as first-class citizens.

Given how well Vortex’s design matched our needs, we tried it out and got a 70% average performance improvement on all our queries. With the newer encodings that Vortex offers, we got 10% better uncompressed storage size and only 3% larger compressed storage size compared to snappy-compressed Parquet.

Tags: gpu vortex parquet compression storage file-formats files pushdown simd

Archives

Questioning an Interface: From Parquet to Vortex