- 1 It’s 3 a.m. Your App Just Got Painfully Slow
- 2 The Needle-in-a-Haystack Problem
- 3 Install perf in 30 Seconds
- 4 Find the Slow Spots in Real Time
- 5 Recording for Deep Dives
- 6 Make a Flame Graph in Two Commands
- 7 Real-World Example: The Cache That Wasn’t
- 8 Three Quick Wins with perf
- 9 Questions I Get All the Time
- 10 Your Next Five Minutes
It’s 3 a.m. Your App Just Got Painfully Slow
You pushed the release yesterday. Everything worked on your laptop. Now users in Tokyo say the page takes **ten seconds to load.**
You stare at the logs. Nothing looks off. The CPU meter sits at 97 %. You’re out of coffee. What do you check next?
Skip the wild goose chase. **There’s a tool called perf** that will *exactly* tell you which function is melting the fan on your server.
The Needle-in-a-Haystack Problem
I once wasted three days adding print statements—only to discover the bug lived on **line seven of the string parser.** One line. Three days. That stings.
Linux perf fixes this. Instead of guessing, you get **a thermal map** of every line of code:
- The hot function names.
- How many CPU cycles each one eats.
- The exact call stack that brought you there.
Turns out **68 % of slowdowns hide in less than 5 % of code,** according to a 2023 NIST study. Let the computer tell you where that 5 % lives.
Install perf in 30 Seconds
On Ubuntu or Debian:
sudo apt install linux-tools-common linux-tools-$(uname -r)
On CentOS or RHEL:
sudo yum install perf
Get it from the host, not inside a container, or you’ll miss signals across the rest of the machine.
Find the Slow Spots in Real Time
Open three terminals:
- Start your misbehaving app.
perf top— live table of hottest functions blazing by.- Wait ten seconds. If you see your function in **bold red at the top,** that’s the culprit.
Example output:
62 % my_app my_app [.] json_parse_utf8
19 % libc-2.31.so [.] malloc
JSON parser wins the race to the bottom this time. (We fixed it by switching libraries, saved 300 ms. Users cheered.)
Recording for Deep Dives
Need more detail? Record for 60 seconds:
sudo perf record -F 99 -g ./your_app --run=production-config
sudo perf report
What you get:
- A tree of function call stacks.
- Percentage of time spent inside each one.
- Dwarf-level backtraces if symbols are on.
Tip: use --call-graph dwarf instead of fp if the stacks look short. Works more often.
Make a Flame Graph in Two Commands
sudo perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > cpu.svg
open cpu.svg
A wide box is hot code. A sudden skyscraper of a stack is a deep, expensive call chain you can flatten or cache.
Real-World Example: The Cache That Wasn’t
We ran a test on an image resizing server. Flame graph looked like this:
- 40 %
mallocinside JPEG resize. - 30 %
memcpyafter resizing.
Polyfill: instead of allocating new buffers every loop, pre-allocate a per-thread buffer pool. **Speed: 4× faster, CPU: 60 % lower**. Tuesday saved.
Three Quick Wins with perf
- Run
perf stat -a sleep 5to eyeball cache-miss ratios versus CPU cycles. - Add
-e cache-missesto focus on slow memory reads. - Use
perf mem reportto see which addresses miss the most.
Questions I Get All the Time
“Can perf run in production?”
Yes—if you keep the sampling at 49 Hz or 99 Hz, less than 2 % overhead. Remove it when done or block it in CI pipelines.
“Do I have to be root?”
For system-wide, yes. For one process, perf record -p <pid> works with the user owning that process.
“Does it profile Python, Go, Java?”
Yes, but you need debug symbols or frame pointers turned on. Python ships DWARF symbols on apt. Java needs -XX:+PreserveFramePointer. Go builds with -ldflags=-linkmode=external -ldflags=-compressdwarf=false.
Your Next Five Minutes
Install perf. Run perf top on your slow box right now. Spot **the worst one percent** and fix it. Push the patch. Ping me on x.com with your time saved. I’ll celebrate with you.
Happy hunting.
