Linux SRE Runbook: Production Troubleshooting

DevOps 6min 65 views 2026-06-16

LinuxSRETroubleshootingPerformance Tuning

Scenario

A production Linux server running a critical web service becomes unresponsive: health checks fail, response times spike, and users report errors.

Symptoms

uptime shows load average exceeding CPU core count (e.g., load 20 on a 16-core machine).
top reveals processes consuming high CPU/memory or in uninterruptible sleep (D state).
vmstat indicates run queue (r) consistently above core count, or significant swapping (si/so nonzero).
iostat reports disk utilization near 100% or await times > 100ms.
External curl requests time out or return 5xx errors.

Diagnosis

Quick check: uptime, free -h, df -h.
CPU analysis: top -bn1 | head -20 (sort by CPU with P, memory with M). Identify abnormal processes.
Memory analysis: vmstat 1 5; look at swap activity. cat /proc/meminfo for details.
Disk analysis: iostat -x 1 5; focus on %util, r/s, w/s, await. Use iotop to find I/O-heavy PIDs.
Network analysis: netstat -tan | grep :80 | wc -l or faster ss -tn. Check for connection exhaustion.
System logs: journalctl -u <service> --since "5 minutes ago" or tail -100 /var/log/syslog.
Process tracing: strace -p <PID> -c to summarize syscalls; strace -p <PID> -e trace=network for network calls.

Risk Controls

Never kill -9 blindly; use kill -0 <PID> to test, or kill -3 <PID> for thread dump.
Avoid restarting services unless root cause is understood; snapshot state first (ps aux > /tmp/ps.before).
If memory leak suspected, capture top and pmap before any action.

Rollback

Configuration change: restore from backup (e.g., cp /etc/nginx/nginx.conf.bak /etc/nginx/nginx.conf).
Software update: downgrade via package manager (apt-get install <pkg>=<version> or yum downgrade).
If unknown, restart service or scale out (if orchestrated).

Verification

Service health: curl -I http://localhost:80 should return 200.
Load normalised: uptime load back to baseline.
Metrics: CPU/memory usage drops, disk I/O normalizes.
Run existing monitoring checks.

When to Submit an OpsGlobal Ticket

After 30 minutes of troubleshooting without clear root cause.
Need advanced kernel debugging (perf, ftrace, crash dump analysis).
Issue spans multiple nodes or clusters.
Need hardware or vendor escalation.

Attach to ticket: - Incident timeline and symptoms. - Output of executed commands (top -bn1, vmstat, iostat). - Relevant log snippets.

Use cases

Useful for teams handling DevOps issues and needing a clear troubleshooting and delivery workflow.

Problem background

A practical guide for diagnosing Linux server performance issues in production, including symptoms, commands, risk controls, rollback, and when to escalate to OpsGlobal.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

Related service CTA

If you are facing a similar Linux SRE Runbook: Production Troubleshooting issue, submit a ticket for remote OpsGlobal support.

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Submit Incident Ticket Book Technical Consultation

Book Technical Consultation Back to Blog