Linux SRE Runbook: Systematic High CPU Troubleshooting
Scenario
In production environments, high CPU usage is a common yet critical issue that can degrade application performance and trigger alerts. As an SRE, you need a structured approach to diagnose and mitigate CPU spikes without causing further disruption. This runbook covers the essential steps for troubleshooting high CPU on Linux servers.
Symptoms
- Monitoring alerts: CPU utilization exceeds thresholds (e.g., >90% for 5 minutes).
- User complaints: Application response times increase; users experience timeouts.
- System metrics: In
toporhtop, %idle is very low, load average is above the number of cores, and the process list shows high CPU consumption.
Diagnosis
- Identify top CPU consumers: Run
ps aux --sort=-%cpu | head -20to list processes sorted by CPU usage. Note the PID, %CPU, and command. - Check system-wide CPU breakdown: Use
mpstat -P ALL 1 3to see per-core utilization and identify if one core is pegged (e.g., due to kernel or interrupt handling). - Differentiate user vs kernel time:
vmstat 1 5shows columnsus(user) andsy(system). Highsyindicates kernel-level activity (system calls, interrupts). - Inspect I/O wait: High
wainvmstatoriostat -x 1suggests CPU is waiting for I/O. This can masquerade as CPU saturation. - Look at thread-level details: For a specific PID,
top -H -p <PID>orps -Lp <PID> -o pid,tid,%cpu,commreveals threads causing the issue. - Trace system calls (advanced): Use
strace -c -p <PID>to see syscall counts, but only if you have permission and impact is minimal. Better to useperf top -p <PID>for sampling.
Commands (with safety notes)
# Safe: list top processes without modifying state
ps aux --sort=-%cpu | head -20
# Safe periodic monitoring (non-invasive)
top -b -d 5 -n 10
# Per-core CPU breakdown (safe)
mpstat -P ALL 1 3
# Check for process creation bursts (use execsnoop from perf-tools if available)
execsnoop -T
# To reduce CPU impact of a specific process (use with caution):
# Lower priority:
renice +10 -p <PID>
# Note: renice may not affect CPU-bound processes significantly if they are compute-heavy.
# Kill process only if it's confirmed non-critical and after approval:
# kill -9 <PID> # Avoid unless absolutely necessary
# If you suspect an infinite loop in Java/Python, generate thread dump:
# kill -3 <PID> # sends SIGQUIT to JVM
Risk Controls
- Verify before killing: Use
systemctl status <service>or consult monitoring dashboards to understand the process's role. - Document actions: Log every command and observation for post-mortem.
- Avoid strace/perf in production unless the team has explicit approval and low-overhead tools (e.g.,
perf record -F 99 -p <PID> -g -- sleep 10is safe for short bursts). - Set up alerts for CPU usage to preemptively address patterns.
Rollback
- If you killed a critical process, restart it immediately:
systemctl start <service>. - If you changed niceness, revert to default:
renice 0 -p <PID>. - If you applied temporary configuration changes, revert the changes.
Verification
- After mitigation, monitor CPU usage for 5-10 minutes using
topor your preferred tool. - Check application health endpoints (e.g.,
/healthz). - Ensure load average drops below number of cores.
- Confirm that other metrics (memory, disk I/O) are not affected.
When to Submit an OpsGlobal Ticket
- If CPU remains high after applying basic mitigation (renice, restart).
- If the root cause is unclear (e.g., kernel bug, hardware interrupt storm).
- If the issue requires code profiling or changes to application logic.
- If you need an SRE with deeper Linux kernel expertise or vendor support.
OpsGlobal can provide 24/7 remote SRE assistance to diagnose and resolve complex production issues, helping your team maintain high availability.
Use cases
Useful for teams handling DevOps issues and needing a clear troubleshooting and delivery workflow.
Problem background
A practical guide for SREs to diagnose and resolve high CPU usage on Linux production servers, with step-by-step commands, safety controls, and rollback procedures.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.