Scenario
A production Linux server running a critical web service becomes unresponsive: health checks fail, response times spike, and users report errors.
Symptoms
uptimeshows load average exceeding CPU core count (e.g., load 20 on a 16-core machine).topreveals processes consuming high CPU/memory or in uninterruptible sleep (D state).vmstatindicates run queue (r) consistently above core count, or significant swapping (si/sononzero).iostatreports disk utilization near 100% or await times > 100ms.- External
curlrequests time out or return 5xx errors.
Diagnosis
- Quick check:
uptime,free -h,df -h. - CPU analysis:
top -bn1 | head -20(sort by CPU with P, memory with M). Identify abnormal processes. - Memory analysis:
vmstat 1 5; look at swap activity.cat /proc/meminfofor details. - Disk analysis:
iostat -x 1 5; focus on%util,r/s,w/s,await. Useiotopto find I/O-heavy PIDs. - Network analysis:
netstat -tan | grep :80 | wc -lor fasterss -tn. Check for connection exhaustion. - System logs:
journalctl -u <service> --since "5 minutes ago"ortail -100 /var/log/syslog. - Process tracing:
strace -p <PID> -cto summarize syscalls;strace -p <PID> -e trace=networkfor network calls.
Risk Controls
- Never
kill -9blindly; usekill -0 <PID>to test, orkill -3 <PID>for thread dump. - Avoid restarting services unless root cause is understood; snapshot state first (
ps aux > /tmp/ps.before). - If memory leak suspected, capture
topandpmapbefore any action.
Rollback
- Configuration change: restore from backup (e.g.,
cp /etc/nginx/nginx.conf.bak /etc/nginx/nginx.conf). - Software update: downgrade via package manager (
apt-get install <pkg>=<version>oryum downgrade). - If unknown, restart service or scale out (if orchestrated).
Verification
- Service health:
curl -I http://localhost:80should return 200. - Load normalised:
uptimeload back to baseline. - Metrics: CPU/memory usage drops, disk I/O normalizes.
- Run existing monitoring checks.
When to Submit an OpsGlobal Ticket
- After 30 minutes of troubleshooting without clear root cause.
- Need advanced kernel debugging (
perf,ftrace, crash dump analysis). - Issue spans multiple nodes or clusters.
- Need hardware or vendor escalation.
Attach to ticket:
- Incident timeline and symptoms.
- Output of executed commands (top -bn1, vmstat, iostat).
- Relevant log snippets.
Use cases
Useful for teams handling DevOps issues and needing a clear troubleshooting and delivery workflow.
Problem background
A practical guide for diagnosing Linux server performance issues in production, including symptoms, commands, risk controls, rollback, and when to escalate to OpsGlobal.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.