Join WhatsApp
Join Now
Join Telegram
Join Now

Real-Time I/O Heatmaps with eBPF and bcc to Spot Storage Bottlenecks

Avatar for Noman Mohammad

By Noman Mohammad

Published on:

Your rating ?

Stop Guessing Where Your Disk is Dying

I once watched a team spend two weeks tuning their MySQL server. They tweaked every buffer, index, and cache they could find. Yet every night at 2:17 a.m. the queries crawled. CPU was fine, RAM was fine, disk usage only 40 %. They were chasing phantoms.

The real culprit? A single nightly backup job that flushed the RAID cache and turned the storage layer into molasses. Classic story, right? The twist: I found the spike in eleven minutes once I pointed biosnoop at the box.

Today I’ll show you the simple setup I used—no PhD in kernel internals required.


Why iotop Lies to You

Old tools like iostat or iotop give you one number: “disk busy 35 %”. That’s like saying “the highway is 35 % full” while ignoring the mile-long pile-up causing the traffic jam.

What we need is a per-operation movie, not a single snapshot. That’s where eBPF comes in. Think of it as strapping a GoPro to every read and write that hits your kernel.

  • Zero extra load. Your app won’t even notice.
  • Per-process view. Spot the exact PID that’s hammering the disk.
  • Microsecond resolution. You’ll see the 9 ms spike the old tools round down to zero.

Install It in Three Commands

Ubuntu / Debian

sudo apt install bpfcc-tools linux-headers-$(uname -r)

RHEL / CentOS / Alma

sudo yum install bcc-tools kernel-devel-$(uname -r)

That’s it. If you’re on a recent kernel (4.1+) you’re ready to roll.


Your First Five Minutes

Open two terminal windows.

In the first, run:

sudo biosnoop > io.log

In the second, start your slow job: a backup, a batch import, whatever. Let it run for thirty seconds then kill biosnoop with Ctrl-C.

You now have a file that looks like:

TIME           PID COMM           DISK T  SECTOR    BYTES   LAT(ms)
19:02:01.123   892 mysqld         sdb  W  88172664  4096    47.92
19:02:01.124   892 mysqld         sdb  W  88172672  4096    48.11
...

Every row is a single disk operation. PID, disk, latency, timestamp. No guessing.


Turn Raw Lines into a Picture

Numbers are boring. Colors are fast. Paste this tiny Python script (I keep it as heatmap.py):

import pandas as pd, seaborn as sns, matplotlib.pyplot as plt

df = pd.read_csv('io.log', delim_whitespace=True, parse_dates=['TIME'])
df['bucket'] = df['TIME'].dt.floor('2s')
pivot = df.pivot_table(index='bucket', columns='DISK', values='LAT(ms)', aggfunc='mean')

plt.figure(figsize=(12,4))
sns.heatmap(pivot, cmap='RdYlBu_r', linewidths=.5)
plt.title('Disk Latency Heatmap (darker = slower)')
plt.tight_layout()
plt.savefig('io_heatmap.png')

Run:

python3 heatmap.py

Open io_heatmap.png. Dark red stripes? That’s pain. Bright blue? All good.


Reading the Pain

  • Vertical red streak on sdb at 02:17–02:19? That’s your backup.
  • Diagonal red line? Sequential scan that’s turned random—index missing its cache.
  • Single bright cell? One process doing a huge synchronous write. Probably logging.

Overlay the heatmap with your cron schedule. You’ll see the match in seconds.


Real-World Fix in 11 Minutes

Back to that MySQL story:

  1. I ran biosnoop for one backup cycle.
  2. The heatmap lit up sdb exactly at 02:17.
  3. Latency jumped from 1 ms to 50 ms for the entire window.
  4. We moved the backup to 04:00 and added ionice -c 3.
  5. Problem gone. Two weeks of tuning avoided.

Sometimes the fastest optimization is not running the wrong job at the wrong time.


Next Steps

You’re already faster than iostat. To go deeper:

  • Filter by PID: sudo biosnoop -p $(pgrep mysqld)
  • Live view in the terminal: sudo biolatency -m 1
  • Cron it: Dump daily logs and auto-mail the heatmap.

Storage mysteries hate sunlight. Shine the eBPF flashlight and they disappear.

Resources (no fluff, just links):

Leave a Comment