Stop Guessing Where Your Disk is Dying
I once watched a team spend two weeks tuning their MySQL server. They tweaked every buffer, index, and cache they could find. Yet every night at 2:17 a.m. the queries crawled. CPU was fine, RAM was fine, disk usage only 40 %. They were chasing phantoms.
The real culprit? A single nightly backup job that flushed the RAID cache and turned the storage layer into molasses. Classic story, right? The twist: I found the spike in eleven minutes once I pointed biosnoop at the box.
Today I’ll show you the simple setup I used—no PhD in kernel internals required.
Why iotop Lies to You
Old tools like iostat or iotop give you one number: “disk busy 35 %”. That’s like saying “the highway is 35 % full” while ignoring the mile-long pile-up causing the traffic jam.
What we need is a per-operation movie, not a single snapshot. That’s where eBPF comes in. Think of it as strapping a GoPro to every read and write that hits your kernel.
- Zero extra load. Your app won’t even notice.
- Per-process view. Spot the exact PID that’s hammering the disk.
- Microsecond resolution. You’ll see the 9 ms spike the old tools round down to zero.
Install It in Three Commands
Ubuntu / Debian
sudo apt install bpfcc-tools linux-headers-$(uname -r)
RHEL / CentOS / Alma
sudo yum install bcc-tools kernel-devel-$(uname -r)
That’s it. If you’re on a recent kernel (4.1+) you’re ready to roll.
Your First Five Minutes
Open two terminal windows.
In the first, run:
sudo biosnoop > io.log
In the second, start your slow job: a backup, a batch import, whatever. Let it run for thirty seconds then kill biosnoop with Ctrl-C.
You now have a file that looks like:
TIME PID COMM DISK T SECTOR BYTES LAT(ms) 19:02:01.123 892 mysqld sdb W 88172664 4096 47.92 19:02:01.124 892 mysqld sdb W 88172672 4096 48.11 ...
Every row is a single disk operation. PID, disk, latency, timestamp. No guessing.
Turn Raw Lines into a Picture
Numbers are boring. Colors are fast. Paste this tiny Python script (I keep it as heatmap.py):
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt
df = pd.read_csv('io.log', delim_whitespace=True, parse_dates=['TIME'])
df['bucket'] = df['TIME'].dt.floor('2s')
pivot = df.pivot_table(index='bucket', columns='DISK', values='LAT(ms)', aggfunc='mean')
plt.figure(figsize=(12,4))
sns.heatmap(pivot, cmap='RdYlBu_r', linewidths=.5)
plt.title('Disk Latency Heatmap (darker = slower)')
plt.tight_layout()
plt.savefig('io_heatmap.png')
Run:
python3 heatmap.py
Open io_heatmap.png. Dark red stripes? That’s pain. Bright blue? All good.
Reading the Pain
- Vertical red streak on
sdbat 02:17–02:19? That’s your backup. - Diagonal red line? Sequential scan that’s turned random—index missing its cache.
- Single bright cell? One process doing a huge synchronous write. Probably logging.
Overlay the heatmap with your cron schedule. You’ll see the match in seconds.
Real-World Fix in 11 Minutes
Back to that MySQL story:
- I ran
biosnoopfor one backup cycle. - The heatmap lit up
sdbexactly at 02:17. - Latency jumped from 1 ms to 50 ms for the entire window.
- We moved the backup to 04:00 and added
ionice -c 3. - Problem gone. Two weeks of tuning avoided.
Sometimes the fastest optimization is not running the wrong job at the wrong time.
Next Steps
You’re already faster than iostat. To go deeper:
- Filter by PID:
sudo biosnoop -p $(pgrep mysqld) - Live view in the terminal:
sudo biolatency -m 1 - Cron it: Dump daily logs and auto-mail the heatmap.
Storage mysteries hate sunlight. Shine the eBPF flashlight and they disappear.
Resources (no fluff, just links):