Join WhatsApp
Join Now
Join Telegram
Join Now

Container Networking Fixes: sysctl Tweaks to Stop TCP Retransmits in Pods

By Noman Mohammad

Published on:

Your rating ?

Why your pods feel like dial-up internet

Picture this: it’s 2 a.m., your Slack is on fire, and every health-check is failing. The app? Fine on your laptop. In prod? Timeouts everywhere.

Here’s the twist — your code isn’t slow. It’s just yelling the same sentence fifteen times because Linux thinks the network is playing telephone.

I’ve debugged this exact scenario three times this year. Same root cause: too many TCP retransmits. Once we fixed the kernel settings, response times dropped from 3 seconds to 200 ms. No redeploy, no rewrite, just three YAML lines.

What on earth is a “TCP retransmit”?

Simple story: the internet loses packets. TCP notices and says, “Hey, send that again.” Helpful… until it happens on every hop inside your cluster. Then each microservice chat turns into:

  • “Hi”
  • “What?”
  • “HI”
  • “Still didn’t catch that.”
  • “FOR THE LOVE OF— HI!”

Laughable in a comic strip. Murder on latency.

The sysctl cheat-sheet that actually works

Below are the five knobs I always twist first. Copy-paste friendly, tested on GKE, EKS, and bare-metal.

1. Stop the endless “are you still there?” loop

Linux retries a lost packet 15 times. That’s 15 × RTO (retransmission timeout) — easily 20+ seconds. For containers talking to each other in the same rack, that’s ridiculous.

securityContext:
  sysctls:
    - name: net.ipv4.tcp_retries2
      value: "5"

Five tries is plenty. After that, fail fast and let the app retry or circuit-break.

2. Kill slow-start after coffee breaks

If a socket sits idle for even one second, TCP halves the congestion window. Next request crawls until the window ramps up again.

securityContext:
  sysctls:
    - name: net.ipv4.tcp_slow_start_after_idle
      value: "0"

Result: consistent throughput for long-lived connections (think gRPC or Postgres).

3. Detect dead peers before your pager screams

Default keepalive waits two hours. Two. Hours. By then your users are long gone.

securityContext:
  sysctls:
    - name: net.ipv4.tcp_keepalive_time
      value: "600"      # 10 minutes
    - name: net.ipv4.tcp_keepalive_probes
      value: "5"
    - name: net.ipv4.tcp_keepalive_intvl
      value: "30"       # 30 seconds between probes

Dead connection? Gone in 12.5 minutes max.

4. Give the buffers some leg room

Small socket buffers = dropped packets = retransmits. 16 MB sounds huge, but on 10 GbE with microbursts it’s just right.

securityContext:
  sysctls:
    - name: net.core.rmem_max
      value: "16777216"
    - name: net.core.wmem_max
      value: "16777216"

5. Turn on window scaling (seriously, why is this off by default?)

securityContext:
  sysctls:
    - name: net.ipv4.tcp_window_scaling
      value: "1"

Enables big windows on high-latency links. One line, instant 2× throughput on cross-AZ traffic.

Before you apply anything

Your cluster needs to allow these knobs, otherwise the pod will sit in CreateContainerConfigError.

Edit the kubelet args or set a PodSecurityPolicy:

--allowed-unsafe-sysctls 'net.ipv4.*,net.core.*'

Then roll nodes or create a new node-pool. On managed services, check their docs — GKE lets you toggle this in the node-pool config UI under “Security → Sysctl parameters”.

Spot-check your fix

SSH into a pod and run:

cat /proc/net/netstat | awk '/TcpExt/ {print $13 " retransmits"}'

Watch that number. After the tweak, retransmits should flat-line under load. I usually run a 5-minute iperf3 test between two pods to confirm.

Watch out for these gotchas

  • MTU weirdness with Calico VXLAN? Turn on MTU probing: net.ipv4.tcp_mtu_probing=1.
  • Cross-region flows love BBR: net.ipv4.tcp_congestion_control=bbr (needs root on the node).
  • Too aggressive? Drop tcp_retries2 to 3 and see if apps handle it. Some legacy JDBC drivers hate early RSTs.

Parting thought

Next time the on-call phone buzzes with “timeouts again,” don’t reach for the profiler first. Pop open /proc/net/netstat and look at those retransmit counters. One YAML file and your containers might just stop sounding like a broken record.

Happy tuning!

Leave a Comment

Exit mobile version