Scenario
A node in a Kubernetes cluster experiences disk space shortage or memory pressure, triggering the kubelet eviction mechanism. Running pods are forcibly terminated and rescheduled.
Symptoms
- Pod status shows
EvictedorFailed kubectl describe nodeshowsDiskPressureorMemoryPressurecondition asTrue- On the node,
df -hshows disk usage > 85% orfree -mshows critically low memory - Cluster monitoring alerts (e.g., Prometheus + Alertmanager) fire
NodeDiskPressureorNodeMemoryPressure
Diagnosis
- Identify affected nodes and pods:
bash kubectl get pods --all-namespaces | grep Evicted kubectl describe node <node-name> | grep -A5 Conditions - Log into the node and check resource usage:
bash ssh <node-ip> df -h free -m docker system df # if using Docker journalctl -u kubelet -n 100 --no-pager | grep -i evict - Review kubelet eviction logs:
bash journalctl -u kubelet -n 200 --no-pager | grep -E "(eviction|pressure|threshold)"
Risk Controls
- Evict non-critical pods or clean up old logs/images to free space:
bash # Safe: remove unused Docker images older than 24h docker image prune -a --force --filter "until=24h" # Clean up stale container logs (be cautious not to delete active logs) find /var/log/containers -name "*.log" -mtime +7 -delete - Temporarily increase node resources (e.g., adjust
--eviction-hardthresholds, requires kubelet restart, high risk) - If the node is critical, use
kubectl cordonandkubectl drainwith caution
Rollback
Cleanup actions are usually irreversible; rollback focuses on evicted pods:
- Evicted pods managed by a Deployment will be automatically rescheduled. Verify:
bash
kubectl rollout status deployment/<name> -n <namespace>
- If pods are not recreated, manually scale or restart the Deployment:
bash
kubectl scale deployment <name> --replicas=3 -n <namespace>
kubectl rollout restart deployment <name> -n <namespace>
- To rollback node-level changes (e.g., kubelet config), restore original values and restart kubelet.
Verification
- Node conditions return to
False:bash kubectl describe node <node-name> | grep -A5 Conditions | grep -E "(DiskPressure|MemoryPressure)" - All expected pods are
Running:bash kubectl get pods --all-namespaces | grep -v Running | grep -v Completed - Monitoring alerts are cleared.
When to Submit an OpsGlobal Ticket
- Node pressure recurs despite temporary cleanup
- Multiple nodes are under pressure, affecting cluster availability
- Long-term optimization needed: cluster sizing, eviction policies, or autoscaling
- Team lacks deep Kubernetes ops experience; expert assistance required for configuration and postmortem
When submitting to OpsGlobal, include diagnostic command outputs, node resource trend graphs, and names of affected pods for faster response.
Use cases
Useful for teams handling Kubernetes issues and needing a clear troubleshooting and delivery workflow.
Problem background
Learn how to diagnose and resolve node pressure incidents causing pod evictions in Kubernetes clusters, with practical commands and rollback strategies.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.