- 1 “Why is my server crawling?” – the question that keeps me up at night
- 2 Meet perf trace—quiet, low-drama, high-insight
- 3 Stop drowning in noise—filter like a pro
- 4 Real story—debugging that July outage
- 5 Level-up tricks that save even more time
- 6 Quick comparison cheat-sheet
- 7 Common gotchas and how to dodge them
- 8 Still curious? Read the manuals
“Why is my server crawling?” – the question that keeps me up at night
I was on-call last July when the alerts started. Page-load times tripled, CPU looked fine, memory looked fine, but users were furious. After two hours of educated guesses—restart services, tweak nginx, scroll through logs—I still had no clue.
Sound familiar?
It turns out 7 out of 10 first “diagnoses” are flat-out wrong. Every extra minute of downtime costs real money (think coffee-budget for a year gone in sixty seconds). The old go-to, strace, is like using a fire hose on a houseplant—massive overhead and the numbers you see aren’t the numbers your app actually sees.
There is a better way, and it’s already on your machine.
Meet perf trace—quiet, low-drama, high-insight
perf trace is part of the perf family that ships with modern Linux kernels. Think of it as a stethoscope that doesn’t wake the patient.
Installation in 30 seconds
sudo apt install linux-tools-generic # Ubuntu / Debian
sudo dnf install perf # Fedora / RHEL
Done. Type perf --version to confirm you’re set.
The commands you’ll actually use
- Trace a single command
perf trace ls -la
See every syscalllsfires and how long each one takes. - Attach to a running process
sudo perf trace -p 1234
Replace 1234 with the PID that’s misbehaving. - System-wide snapshot
sudo perf trace -a
Warning: floods the screen—filter first.
Always run with sudo when you’re looking beyond your own processes.
Stop drowning in noise—filter like a pro
Raw output is useless if you can’t read it. Narrow the view:
perf trace -e 'syscalls:sys_enter_open*,syscalls:sys_exit_open*' -p 1234
This zeroes in on every open call your app makes. Add --duration and you get timing for free:
perf trace --duration -p 1234
Anything over, say, 5 ms jumps out like a sore thumb.
Real story—debugging that July outage
Back to my midnight crisis. I attached perf trace to the web-app PID and filtered for file reads:
sudo perf trace -e syscalls:sys_enter_read --duration -p 1122
One line hit me in the face:
1122 read(42, "/var/cache/slow.db-journal", 4096) = 4096 <8.2 ms>
Eight milliseconds for a read? That’s forever. A quick ls showed the journal file had ballooned to 3 GB. Truncating it and moving the cache to SSD fixed the issue in minutes—not hours.
Without perf trace, I’d still be guessing.
Level-up tricks that save even more time
- Record first, analyze later
sudo perf record -e sched:* -a -- sleep 10
sudo perf report
Great for catching elusive latency spikes. - Export to JSON for pretty graphs
perf script -F time,pid,syscall --input perf.data > trace.json - Install debug symbols
sudo apt install linux-image-$(uname -r)-dbgsym
Suddenly cryptic addresses turn into readable function names. - Increase buffer on noisy servers
sudo perf trace --mmap-pages 8192 -p 1234
Quick comparison cheat-sheet
| Tool | Overhead | Best for |
|---|---|---|
perf trace |
Low | Production, system-wide, latency hunting |
strace |
High | Quick one-off checks on dev boxes |
ftrace |
Ultra-low | Kernel developers, kernel internals |
Common gotchas and how to dodge them
- Permission denied?
sudo setcap cap_perfmon+ep /usr/bin/perf - Nothing but hex addresses?
Install debug packages for your kernel and apps. - Container confusion?
Runperf traceon the host and target the container’s PID namespace.
Still curious? Read the manuals
The official docs are short and sweet:
Next time your system feels sluggish, skip the guesswork. Fire up perf trace, filter for the calls that matter, and watch the real culprit appear in seconds.
You’ll sleep better. I promise.







