Skip to content

Archives

DeepSeek’s smallpond

  • DeepSeek’s smallpond

    Some interesting notes about smallpond, a new high-performance DuckDB-based distributed data lake query system from DeepSeek:

    DeepSeek is introducing smallpond, a lightweight open-source framework, leveraging DuckDB to process terabyte-scale datasets in a distributed manner. Their benchmark states: “Sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.”

    The benchmark on 100TB mentioned is actually using the custom DeepSeek 3FS framework: Fire-Flyer File System is a high-performance distributed file system designed to address the challenges of AI training and inference workloads. […] compared to AWS S3, 3FS is built for speed, not just storage. While S3 is a reliable and scalable object store, it comes with higher latency and eventual consistency […] 3FS, on the other hand, is a high-performance distributed file system that leverages SSDs and RDMA networks to deliver low-latency, high-throughput storage. It supports random access to training data, efficient checkpointing, and strong consistency.

    So — this is very impressive. However!

    1. RDMA (remote direct memory access) networking for a large-scale storage system! That is absolutely bananas. I wonder how much that benchmark cluster cost to run… still, this is a very interesting technology for massive-scale super-low-latency storage. https://www.definite.app/blog/smallpond also notes “3FS achieves a remarkable read throughput of 6.6 TiB/s on a 180-node cluster, which is significantly higher than many traditional distributed file systems.”

    2. it seems smallpond operates strictly with partition-level parallelism, so if your data isn’t partitioned in exactly the right way, you may still find your query bottlenecked:

    Smallpond’s distribution leverages Ray Core at the Python level, using partitions for scalability. Partitioning can be done manually, and Smallpond supports:

    • Hash partitioning (based on column values);
    • Even partitioning (by files or row counts);
    • Random shuffle partitioning

    As I understand it, Trino has a better idea of how to scale out queries across worker nodes even without careful pre-partitioning, which is handy.

    Tags: data-lakes deepseek duckdb rdma networking 3fs smallpond trino ray