Justin's Linklog

DeepSeek’s smallpond

Some interesting notes about smallpond, a new high-performance DuckDB-based distributed data lake query system from DeepSeek:

DeepSeek is introducing smallpond, a lightweight open-source framework, leveraging DuckDB to process terabyte-scale datasets in a distributed manner. Their benchmark states: “Sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.”

The benchmark on 100TB mentioned is actually using the custom DeepSeek 3FS framework: Fire-Flyer File System is a high-performance distributed file system designed to address the challenges of AI training and inference workloads. [...] compared to AWS S3, 3FS is built for speed, not just storage. While S3 is a reliable and scalable object store, it comes with higher latency and eventual consistency [...] 3FS, on the other hand, is a high-performance distributed file system that leverages SSDs and RDMA networks to deliver low-latency, high-throughput storage. It supports random access to training data, efficient checkpointing, and strong consistency.

So -- this is very impressive. However!
1. RDMA (remote direct memory access) networking for a large-scale storage system! That is absolutely bananas. I wonder how much that benchmark cluster cost to run... still, this is a very interesting technology for massive-scale super-low-latency storage. https://www.definite.app/blog/smallpond also notes "3FS achieves a remarkable read throughput of 6.6 TiB/s on a 180-node cluster, which is significantly higher than many traditional distributed file systems."
2. it seems smallpond operates strictly with partition-level parallelism, so if your data isn't partitioned in exactly the right way, you may still find your query bottlenecked:
Smallpond’s distribution leverages Ray Core at the Python level, using partitions for scalability. Partitioning can be done manually, and Smallpond supports:
- Hash partitioning (based on column values);
- Even partitioning (by files or row counts);
- Random shuffle partitioning
As I understand it, Trino has a better idea of how to scale out queries across worker nodes even without careful pre-partitioning, which is handy.

Tags: data-lakes deepseek duckdb rdma networking 3fs smallpond trino ray

Archives

DeepSeek’s smallpond