Book Consultation Submit Ticket

Kubernetes Incident Response and Cluster Reliability: Handling Node Pressure and Pod Evictions

Learn how to diagnose and resolve node pressure incidents causing pod evictions in Kubernetes clusters, with practical commands and rollback strategies.

Kubernetes Incident Response and Cluster Reliability: Handling Node Pressure and Pod Evictions
Kubernetes 6min 2 views 2026-06-15
KubernetesIncident ResponseNode PressurePod EvictionSRE

Scenario

A node in a Kubernetes cluster experiences disk space shortage or memory pressure, triggering the kubelet eviction mechanism. Running pods are forcibly terminated and rescheduled.

Symptoms

  • Pod status shows Evicted or Failed
  • kubectl describe node shows DiskPressure or MemoryPressure condition as True
  • On the node, df -h shows disk usage > 85% or free -m shows critically low memory
  • Cluster monitoring alerts (e.g., Prometheus + Alertmanager) fire NodeDiskPressure or NodeMemoryPressure

Diagnosis

  1. Identify affected nodes and pods: bash kubectl get pods --all-namespaces | grep Evicted kubectl describe node <node-name> | grep -A5 Conditions
  2. Log into the node and check resource usage: bash ssh <node-ip> df -h free -m docker system df # if using Docker journalctl -u kubelet -n 100 --no-pager | grep -i evict
  3. Review kubelet eviction logs: bash journalctl -u kubelet -n 200 --no-pager | grep -E "(eviction|pressure|threshold)"

Risk Controls

  • Evict non-critical pods or clean up old logs/images to free space: bash # Safe: remove unused Docker images older than 24h docker image prune -a --force --filter "until=24h" # Clean up stale container logs (be cautious not to delete active logs) find /var/log/containers -name "*.log" -mtime +7 -delete
  • Temporarily increase node resources (e.g., adjust --eviction-hard thresholds, requires kubelet restart, high risk)
  • If the node is critical, use kubectl cordon and kubectl drain with caution

Rollback

Cleanup actions are usually irreversible; rollback focuses on evicted pods: - Evicted pods managed by a Deployment will be automatically rescheduled. Verify: bash kubectl rollout status deployment/<name> -n <namespace> - If pods are not recreated, manually scale or restart the Deployment: bash kubectl scale deployment <name> --replicas=3 -n <namespace> kubectl rollout restart deployment <name> -n <namespace> - To rollback node-level changes (e.g., kubelet config), restore original values and restart kubelet.

Verification

  • Node conditions return to False: bash kubectl describe node <node-name> | grep -A5 Conditions | grep -E "(DiskPressure|MemoryPressure)"
  • All expected pods are Running: bash kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
  • Monitoring alerts are cleared.

When to Submit an OpsGlobal Ticket

  • Node pressure recurs despite temporary cleanup
  • Multiple nodes are under pressure, affecting cluster availability
  • Long-term optimization needed: cluster sizing, eviction policies, or autoscaling
  • Team lacks deep Kubernetes ops experience; expert assistance required for configuration and postmortem

When submitting to OpsGlobal, include diagnostic command outputs, node resource trend graphs, and names of affected pods for faster response.

Use cases

Useful for teams handling Kubernetes issues and needing a clear troubleshooting and delivery workflow.

Problem background

Learn how to diagnose and resolve node pressure incidents causing pod evictions in Kubernetes clusters, with practical commands and rollback strategies.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

!

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Ticket Contact on WhatsApp Consult