Kubernetes Incident Response - Node Pressure & Pod Evictions

Kubernetes Incident Response and Cluster Reliability: Handling Node Pressure and Pod Evictions

Kubernetes 6min 68 views 2026-06-15

KubernetesIncident ResponseNode PressurePod EvictionSRE

Scenario

A node in a Kubernetes cluster experiences disk space shortage or memory pressure, triggering the kubelet eviction mechanism. Running pods are forcibly terminated and rescheduled.

Symptoms

Pod status shows Evicted or Failed
kubectl describe node shows DiskPressure or MemoryPressure condition as True
On the node, df -h shows disk usage > 85% or free -m shows critically low memory
Cluster monitoring alerts (e.g., Prometheus + Alertmanager) fire NodeDiskPressure or NodeMemoryPressure

Diagnosis

Identify affected nodes and pods: bash kubectl get pods --all-namespaces | grep Evicted kubectl describe node <node-name> | grep -A5 Conditions
Log into the node and check resource usage: bash ssh <node-ip> df -h free -m docker system df # if using Docker journalctl -u kubelet -n 100 --no-pager | grep -i evict
Review kubelet eviction logs: bash journalctl -u kubelet -n 200 --no-pager | grep -E "(eviction|pressure|threshold)"

Risk Controls

Evict non-critical pods or clean up old logs/images to free space: bash # Safe: remove unused Docker images older than 24h docker image prune -a --force --filter "until=24h" # Clean up stale container logs (be cautious not to delete active logs) find /var/log/containers -name "*.log" -mtime +7 -delete
Temporarily increase node resources (e.g., adjust --eviction-hard thresholds, requires kubelet restart, high risk)
If the node is critical, use kubectl cordon and kubectl drain with caution

Rollback

Cleanup actions are usually irreversible; rollback focuses on evicted pods: - Evicted pods managed by a Deployment will be automatically rescheduled. Verify: bash kubectl rollout status deployment/<name> -n <namespace> - If pods are not recreated, manually scale or restart the Deployment: bash kubectl scale deployment <name> --replicas=3 -n <namespace> kubectl rollout restart deployment <name> -n <namespace> - To rollback node-level changes (e.g., kubelet config), restore original values and restart kubelet.

Verification

Node conditions return to False: bash kubectl describe node <node-name> | grep -A5 Conditions | grep -E "(DiskPressure|MemoryPressure)"
All expected pods are Running: bash kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
Monitoring alerts are cleared.

When to Submit an OpsGlobal Ticket

Node pressure recurs despite temporary cleanup
Multiple nodes are under pressure, affecting cluster availability
Long-term optimization needed: cluster sizing, eviction policies, or autoscaling
Team lacks deep Kubernetes ops experience; expert assistance required for configuration and postmortem

When submitting to OpsGlobal, include diagnostic command outputs, node resource trend graphs, and names of affected pods for faster response.

Use cases

Useful for teams handling Kubernetes issues and needing a clear troubleshooting and delivery workflow.

Problem background

Learn how to diagnose and resolve node pressure incidents causing pod evictions in Kubernetes clusters, with practical commands and rollback strategies.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

Related service CTA

If you are facing a similar Kubernetes Incident Response and Cluster Reliability: Handling Node Pressure and Pod Evictions issue, submit a ticket for remote OpsGlobal support.

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Submit Incident Ticket Book Technical Consultation

Book Technical Consultation Back to Blog