Justin's Linklog

Discretized Streams: Fault Tolerant Stream Computing at Scale

The paper describing the innards of Spark Streaming and its RDD-based recomputation algorithm:
we use a data structure called Resilient Distributed Datasets (RDDs), which keeps data in memory and can recover it without replication by tracking the lineage graph of operations that were used to build it. With RDDs, we show that we can attain sub-second end-to-end latencies. We believe that this is sufficient for many real-world big data applications, where the timescale of the events tracked (e.g., trends in social media) is much higher.

(tags: rdd spark streaming fault-tolerance batch distcomp papers big-data scalability)
Improving testing by using real traffic from production

Gor, a very nice-looking tool to log and replay HTTP traffic, specifically designed to "tee" live traffic from production to staging for pre-release testing

(tags: gor performance testing http tcp packet-capture tests staging tee)
Git team workflows: merge or rebase?

Well-written description of the pros and cons. I'm a rebaser, fwiw. (via Darrell)

(tags: via:darrell git merging rebasing history git-log coding workflow dev teams collaboration github)
How to receive a million packets per second on Linux

To sum up, if you want a perfect performance you need to: Ensure traffic is distributed evenly across many RX queues and SO_REUSEPORT processes. In practice, the load usually is well distributed as long as there are a large number of connections (or flows). You need to have enough spare CPU capacity to actually pick up the packets from the kernel. To make the things harder, both RX queues and receiver processes should be on a single NUMA node.

(tags: linux networking performance cloudflare packets numa so_reuseport sockets udp)

Archives

Links for 2015-06-19