Book Consultation Submit Ticket

Mastering Kubernetes Incident Response: Handling Node Disk Pressure

A practical guide to diagnosing and resolving a 'Node Not Ready' incident caused by disk pressure, with commands and risk controls.

Mastering Kubernetes Incident Response: Handling Node Disk Pressure
Kubernetes 6min 17 views 2026-06-17
KubernetesSREIncident ResponseCluster Reliability

Scenario

You are on-call and receive an alert that a node in your cluster has become NotReady. Pods running on that node become unhealthy, some are evicted. Colleagues report application timeouts on that node.

Symptoms

  • kubectl get nodes shows a node as NotReady
  • kubectl describe node <node> shows DiskPressure condition as True
  • Pods on the node are in Pending or Terminating state
  • kubelet logs contain eviction manager: attempting to reclaim ephemeral-storage

Diagnosis

  1. Confirm node status: kubectl get nodes -o wide
  2. Check node conditions: kubectl describe node <node> | grep -A5 Conditions
  3. SSH to the node and check disk usage: df -h and du -sh /var/lib/kubelet
  4. View kubelet logs: journalctl -u kubelet -n 100 --no-pager | grep -i pressure
  5. Check pod eviction events: kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=<node>

Commands

# Cordon the node to prevent new pods
kubectl cordon <node>

# Drain pods from the node (ignore DaemonSets, delete emptyDir data)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# (SSH to node) Free up disk space
ssh user@<node-ip>
sudo journalctl --vacuum-time=1d   # clean old logs
sudo docker system prune -a -f     # clean unused images (if using Docker)
# or containerd: crictl rmi --prune
sudo du -sh /var/log && sudo find /var/log -type f -name "*.log" -mtime +7 -delete

# Verify freed space
df -h /var/lib/kubelet

Risk Controls

  • Ensure PodDisruptionBudgets (PDBs) protect critical workloads before draining.
  • If the node has local storage (hostPath), draining will cause data loss; backup first.
  • --delete-emptydir-data clears emptyDir volumes; confirm no important temporary data.
  • Always cordon before drain to avoid forced evictions.

Rollback

  1. If the node recovers after cleaning (NotReady clears): - Uncordon the node: kubectl uncordon <node> - Verify pods reschedule: kubectl get pods -o wide | grep <node>
  2. If disk pressure persists or hardware failure: - Fully drain the node: kubectl drain <node> --ignore-daemonsets --force (use with caution) - Decommission and replace the node.

Verification

  • Node status becomes Ready: kubectl get nodes
  • All workloads are running: kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
  • Alerts are resolved.

When to Submit an OpsGlobal Ticket

  • The node cannot be recovered via cleanup (hardware failure/persistent issue).
  • Multiple nodes experience disk pressure simultaneously, requiring scaling or storage strategy changes.
  • Uncertainty about the impact of cleanup operations on production.
  • A permanent infrastructure change is needed (e.g., resizing node disks).

OpsGlobal's SRE team provides 24/7 support, automated remediation via Terraform, and designs self-healing cluster architectures.

Use cases

Useful for teams handling Kubernetes issues and needing a clear troubleshooting and delivery workflow.

Problem background

A practical guide to diagnosing and resolving a 'Node Not Ready' incident caused by disk pressure, with commands and risk controls.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

!

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Ticket Contact on WhatsApp Consult