Book Consultation Submit Ticket

Linux SRE Runbook: Production Troubleshooting Deep Dive

This article walks through a real-world scenario of high CPU load on a production Linux server, covering symptoms, diagnostic commands, risk controls, rollback steps, and when to escalate to OpsGlobal.

Linux SRE Runbook: Production Troubleshooting Deep Dive
DevOps 6min 3 views 2026-06-14
LinuxSRETroubleshooting

Scenario: A production Linux server running a Java application (e.g., a Kubernetes worker node) suddenly experiences high CPU load (load average > 4x CPU cores). Users report slow response times and timeouts. Alerts triggered from monitoring (e.g., Prometheus + Alertmanager).

Symptoms: - High load average: uptime shows load > 8 on 4-core machine - High CPU usage: top shows >90% us or sy - Application response latency spikes - Possibly increased context switches: vmstat shows cs > 10000

Diagnosis: 1. Check overall system health: top (sort by CPU), htop for per-core view. 2. Identify the culprit process: ps aux --sort=-%cpu | head -5 3. Drill down with strace -c -p <PID> to see system call counts, or strace -T -p <PID> to see timing. 4. Use perf top to see which functions are hot. 5. Check if it's kernel-related: vmstat 1 and watch for high sy (system) time; mpstat -P ALL 1 to see per-CPU breakdown. 6. For Java apps, use jstack or jcmd to capture thread dumps and analyze with tools like FastThread.

Commands:

# Quick assessment
uptime
top -b -n1 | head -20

# Find top CPU-consuming processes
ps aux --sort=-%cpu | head -10

# Monitor system stats
vmstat 1 5
mpstat -P ALL 1 5

# Profile a specific process
sudo strace -c -p <PID>  # run for 10 seconds, then Ctrl+C
sudo perf top -p <PID>   # real-time profiling

# For Java: get thread dump
jstack <PID> > threaddump.txt

Risk Controls: - Never kill a process without understanding its role. If it's a critical service, consider restarting it gracefully. - Use strace with caution on production: it can slow down the process. Limit duration (e.g., strace -c -p <PID> & sleep 10; kill %1). - Avoid running perf without proper permissions (use sudo) and be mindful of performance impact. - For Java, ensure you have proper logging and heap dump settings before forcing a restart.

Rollback: - If the issue started after a recent deployment or configuration change, revert that change. - Restart the service: systemctl restart <service> or kubectl rollout restart deployment/<name>. - If necessary, scale up replicas to distribute load temporarily. - For kernel- or system-level issues, consider rebooting into a known good kernel version.

Verification: - After remediation, check uptime and top to confirm load has reduced. - Verify application health via endpoints or monitoring dashboards. - Run a few test transactions to ensure response times are normal. - Monitor for the next 15-30 minutes for recurrence.

When to Submit an OpsGlobal Ticket: - If the root cause is unclear despite thorough diagnostics. - If the issue persists after applying standard rollback steps. - If kernel-level tuning is needed (e.g., scheduler, IRQ balance). - If you suspect a memory leak or hardware issue. - Any time you need a second pair of eyes or 24/7 coverage.

Use cases

Useful for teams handling DevOps issues and needing a clear troubleshooting and delivery workflow.

Problem background

This article walks through a real-world scenario of high CPU load on a production Linux server, covering symptoms, diagnostic commands, risk controls, rollback steps, and when to escalate to OpsGlobal.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

!

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Ticket Contact on WhatsApp Consult