- 1 Why eBPF Feels Like Having X-Ray Vision For Your Linux Box
- 2 Traditional vs. eBPF – A 60-second Comparison
- 3 My 15-minute Start-Up Routine Anytime Something Feels Sluggish
- 4 Two Mini Case Studies (Copy-paste to Try)
- 5 BCC or bpftrace – Which to Reach For?
- 6 Quick Safety and Setup Notes
- 7 Cheat-Sheet of My Top 5 Tools
- 8 One Last Thought
Why eBPF Feels Like Having X-Ray Vision For Your Linux Box
Ever watched a server grind to a crawl and thought, *what on Earth is it doing in there?*
Old tools like top or strace give you the **what**, but rarely the **why**.
I hit this wall last year when a customer’s database started stalling every few minutes.
perf said *kernel time – 78 %*. Nice, but where inside the kernel?
A friend nudged me toward eBPF. Two hours later I was staring at the exact line of kernel code that held a spin-lock too long. **Problem fixed before dinner.**
eBPF is basically a tiny, super-fast virtual machine that lives inside the Linux kernel.
It lets us drop little probes in there **while the machine is running**.
Two easy ways to talk to it are:
- BCC – big toolbox written in Python/C: loads of ready-made commands.
- bpftrace – mini scripting language for “explain this weird blip **now**”.
Traditional vs. eBPF – A 60-second Comparison
Classic profiler
Collect stack traces → dump 5 MB/s to disk → crunch for ten minutes → maybe find the bottleneck.
eBPF one-liner
Count how many times every process hits a slow path function, **live in RAM**, zero disk IO. Ctrl-C to print a table, done.
That order-of-magnitude reduction in effort? It changes how you think about debugging.
My 15-minute Start-Up Routine Anytime Something Feels Sluggish
- Run
sudo biolatency-bpfcc 1
Shows a histogram of disk latency every second. Quick eyeball test for “disk is trashing”. - If disk is clean, try:
sudo execsnoop-bpfcc
Tells me which new commands just spawn. Often it’s a rogue cron job or healthcheck script. - Still no clue?
sudo bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'
Samples every running CPU 99 times a second and prints the hottest kernel stacks.
I see a wall of *spinlock*, notice it’s the same file system code line every time – SMR disk firmware bug.
Zero restarts, zero downtime, answers in under a minute.
Two Mini Case Studies (Copy-paste to Try)
Case 1 – Finding the Chatty Container
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat /container_id != 0/
{ @[comm, container_id] = count(); }'
I pipe the container ID into the filter. Within 30 seconds I spot the log spitter that opens /var/log/debug.log 42,000 times an hour.
Case 2 – Unexplained TCP Retransmits
sudo bpftrace -e 'kprobe:tcp_retransmit_skb {
@retransmits[comm] = count();
@total = count();
}'
A single Go binary pops at 5 % of all retransmits. Turns out the dev forgot to set GSO offloading. Fixing that cut latency by 25 ms at the 95-th percentile.
BCC or bpftrace – Which to Reach For?
BCC** when I need a reusable one-bin tool. Example: I always keep tcptop aliased so I can see, by connection, who is chewing bandwidth in real time.
bpftrace** when the problem is new, weird, and small. One Friday I randomly traced brk syscalls inside elasticsearch to prove the JVM wasn’t resizing the heap after all – it was a transparent hugepage compaction issue instead.
Quick Safety and Setup Notes
Kernel check?
uname -r
If it’s 5.x or newer, you’re golden. 4.x maybe needs backports.
Install chain (Ubuntu/Debian one-liner)
sudo apt-get install bpfcc-tools linux-headers-$(uname -r)
That’s it. No recompilation, no kernel modules.
Cheat-Sheet of My Top 5 Tools
- opensnoop-bpfcc – see every file open call in real time
- biolatency-bpfcc -Q – disk delay, split by individual spindle
- execsnoop-bpfcc – catch short-lived processes
- tcplife – lifespan and traffic of each TCP flow
- profile-bpfcc -F 99 -adf – full system stack flame graph
Pin those to an alias, and you’ve got a portable MRI for almost any Linux box.
One Last Thought
eBPF isn’t some next-gen magic only kernel hackers should touch.
It’s more like strace got supercharged and moved to kernel mode.
The first time you find a 3-line script that saves you a 2-hour outage, you’ll never **not** have eBPF in your back pocket.
Useful links:
- BCC Documentation
- bpftrace Reference Guide
- Brendan Gregg’s eBPF Blog (the textbook on real-world tricks)
- eBPF.io
Go grab one command, run it, and see what surprises your server has to show you tonight.







