Linux SRE Runbook: Systematic High CPU Troubleshooting

Scenario

In production environments, high CPU usage is a common yet critical issue that can degrade application performance and trigger alerts. As an SRE, you need a structured approach to diagnose and mitigate CPU spikes without causing further disruption. This runbook covers the essential steps for troubleshooting high CPU on Linux servers.

Symptoms

Monitoring alerts: CPU utilization exceeds thresholds (e.g., >90% for 5 minutes).
User complaints: Application response times increase; users experience timeouts.
System metrics: In top or htop, %idle is very low, load average is above the number of cores, and the process list shows high CPU consumption.

Diagnosis

Identify top CPU consumers: Run ps aux --sort=-%cpu | head -20 to list processes sorted by CPU usage. Note the PID, %CPU, and command.
Check system-wide CPU breakdown: Use mpstat -P ALL 1 3 to see per-core utilization and identify if one core is pegged (e.g., due to kernel or interrupt handling).
Differentiate user vs kernel time: vmstat 1 5 shows columns us (user) and sy (system). High sy indicates kernel-level activity (system calls, interrupts).
Inspect I/O wait: High wa in vmstat or iostat -x 1 suggests CPU is waiting for I/O. This can masquerade as CPU saturation.
Look at thread-level details: For a specific PID, top -H -p <PID> or ps -Lp <PID> -o pid,tid,%cpu,comm reveals threads causing the issue.
Trace system calls (advanced): Use strace -c -p <PID> to see syscall counts, but only if you have permission and impact is minimal. Better to use perf top -p <PID> for sampling.

Commands (with safety notes)

# Safe: list top processes without modifying state
ps aux --sort=-%cpu | head -20

# Safe periodic monitoring (non-invasive)
top -b -d 5 -n 10

# Per-core CPU breakdown (safe)
mpstat -P ALL 1 3

# Check for process creation bursts (use execsnoop from perf-tools if available)
execsnoop -T

# To reduce CPU impact of a specific process (use with caution):
# Lower priority:
renice +10 -p <PID>
# Note: renice may not affect CPU-bound processes significantly if they are compute-heavy.

# Kill process only if it's confirmed non-critical and after approval:
# kill -9 <PID>   # Avoid unless absolutely necessary

# If you suspect an infinite loop in Java/Python, generate thread dump:
# kill -3 <PID>   # sends SIGQUIT to JVM

Risk Controls

Verify before killing: Use systemctl status <service> or consult monitoring dashboards to understand the process's role.
Document actions: Log every command and observation for post-mortem.
Avoid strace/perf in production unless the team has explicit approval and low-overhead tools (e.g., perf record -F 99 -p <PID> -g -- sleep 10 is safe for short bursts).
Set up alerts for CPU usage to preemptively address patterns.

Rollback

If you killed a critical process, restart it immediately: systemctl start <service>.
If you changed niceness, revert to default: renice 0 -p <PID>.
If you applied temporary configuration changes, revert the changes.

Verification

After mitigation, monitor CPU usage for 5-10 minutes using top or your preferred tool.
Check application health endpoints (e.g., /healthz).
Ensure load average drops below number of cores.
Confirm that other metrics (memory, disk I/O) are not affected.

When to Submit an OpsGlobal Ticket

If CPU remains high after applying basic mitigation (renice, restart).
If the root cause is unclear (e.g., kernel bug, hardware interrupt storm).
If the issue requires code profiling or changes to application logic.
If you need an SRE with deeper Linux kernel expertise or vendor support.

OpsGlobal can provide 24/7 remote SRE assistance to diagnose and resolve complex production issues, helping your team maintain high availability.

Use cases

Useful for teams handling DevOps issues and needing a clear troubleshooting and delivery workflow.

Problem background

A practical guide for SREs to diagnose and resolve high CPU usage on Linux production servers, with step-by-step commands, safety controls, and rollback procedures.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

Related service CTA

If you are facing a similar Linux SRE Runbook: Systematic High CPU Troubleshooting issue, submit a ticket for remote OpsGlobal support.

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Submit Incident Ticket Book Technical Consultation

Book Technical Consultation Back to Blog