Debugging Azure Networking for Elastic Cloud Serverless
Good writeup of fixing a Linux packet loss issue in Azure, using low-level access to the VMs running k8s nodes.
Elastic's Site Reliability Engineering team (SRE) observed unstable throughput and packet loss in Elastic Cloud Serverless running on Azure Kubernetes Service (AKS). After investigation, we identified the primary contributing factors to be RX ring buffer overflows and kernel input queue saturation on SR-IOV interfaces. To address this, we increased RX buffer sizes and adjusted the netdev backlog, which significantly improved network stability.
Tags: sr-iov linux networking bugs azure debugging ops sre drivers