Why your 3 AM nightmare keeps happening
Picture this. It’s 3:17 AM. Your phone is buzzing. Slack is on fire. Users are furious. Your dashboards? All green.
Been there. Last winter I spent four hours chasing a “phantom” latency spike that cost my team thousands in lost sales. The culprit? A single Python microservice doing batch uploads every 15 minutes. Traditional tools showed normal traffic patterns. Meanwhile, our checkout flow crawled.
Here’s what they don’t tell you: 60% of network latency hides where most tools can’t see it. A USENIX study proved this. One bad process. That’s all it takes.
The tools aren’t broken. They’re blind.
I used to love iftop. Thought it was magic. Then I realized…
It shows interfaces. Not processes. It’s like having a city’s traffic report when you need to know which specific driver keeps blocking the bridge.
What happens next is predictable:
- We restart services randomly
- Add more servers (expensive guesswork)
- Watch users leave when 3-second delays become minutes
Sound familiar?
Meet your new best friend: bpftool
Think of eBPF as X-ray vision for your network. bpftool is the remote control.
Best part? It’s already on your Linux box. Free. No vendors. No sales calls.
Let’s set this up in 10 minutes
Step 1: Check your kernel
grep BPF /proc/filesystems
See nodev bpf? You’re golden. If not, grab coffee and update your kernel.
Step 2: Install bpftool
# Ubuntu/Debian
sudo apt install linux-tools-common linux-tools-$(uname -r)
# RHEL/CentOS
sudo yum install bpftool
Step 3: The actual magic
Create latency.c:
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 4096);
__type(key, u32);
__type(value, u64);
} latency_map SEC(".maps");
SEC("kprobe/tcp_sendmsg")
int BPF_KPROBE(tcp_sendmsg_entry)
{
u64 start_time = bpf_ktime_get_ns();
u32 pid = bpf_get_current_pid_tgid() >> 32;
bpf_map_update_elem(&latency_map, &pid, &start_time, BPF_ANY);
return 0;
}
SEC("kretprobe/tcp_sendmsg")
int BPF_KRETPROBE(tcp_sendmsg_exit)
{
u64 end_time = bpf_ktime_get_ns();
u32 pid = bpf_get_current_pid_tgid() >> 32;
u64 *start_time = bpf_map_lookup_elem(&latency_map, &pid);
if (start_time) {
u64 latency = end_time - *start_time;
bpf_printk("PID %d latency: %llu ns", pid, latency);
bpf_map_delete_elem(&latency_map, &pid);
}
return 0;
}
Step 4: Compile and run
clang -O2 -target bpf -c latency.c -o latency.o
sudo bpftool prog load latency.o /sys/fs/bpf/latency
sudo bpftool perf attach /sys/fs/bpf/latency kprobe/tcp_sendmsg
sudo bpftool perf attach /sys/fs/bpf/latency kretprobe/tcp_sendmsg
Step 5: Watch the culprits
sudo cat /sys/kernel/debug/tracing/trace_pipe
You’ll see lines like:
PID 2341 latency: 125000 ns (0.125ms) PID 5678 latency: 890000 ns (0.89ms)
Making it actually useful
Raw numbers are nice. Context is better.
I pipe this to a simple script that:
- Maps PIDs to service names
- Tracks 95th percentile latency
- Alerts when any service hits 5ms+
The difference? Instead of guessing, I know exactly which container needs attention.
Real-world wins
Last month, this caught a logging service that spiked to 200ms every 5 minutes. Turned out someone enabled debug mode. Fixed in 30 seconds.
Another time, it exposed a Redis client that wasn’t pooling connections. Saved us from a $12k/month over-provision.
Beyond the basics
Once you’re comfortable:
- Swap
tcp_sendmsgforudp_sendmsgto catch UDP issues - Add
BPF_MAP_TYPE_PERCPU_ARRAYfor better performance at scale - Set latency thresholds to reduce noise
Remember: Every millisecond you save is a millisecond your users don’t wait.
Your turn
Try this on a test server first. Run a few curl commands. Watch the output. Then imagine having this running 24/7.
The 3 AM calls? They become 3 PM coffee breaks.
Questions? Hit me up. I’ve got the scars to prove this works.







