Kubernetes Node Disk Pressure Incident Response Guide

Mastering Kubernetes Incident Response: Handling Node Disk Pressure

Kubernetes 6min 71 views 2026-06-17

KubernetesSREIncident ResponseCluster Reliability

Scenario

You are on-call and receive an alert that a node in your cluster has become NotReady. Pods running on that node become unhealthy, some are evicted. Colleagues report application timeouts on that node.

Symptoms

kubectl get nodes shows a node as NotReady
kubectl describe node <node> shows DiskPressure condition as True
Pods on the node are in Pending or Terminating state
kubelet logs contain eviction manager: attempting to reclaim ephemeral-storage

Diagnosis

Confirm node status: kubectl get nodes -o wide
Check node conditions: kubectl describe node <node> | grep -A5 Conditions
SSH to the node and check disk usage: df -h and du -sh /var/lib/kubelet
View kubelet logs: journalctl -u kubelet -n 100 --no-pager | grep -i pressure
Check pod eviction events: kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=<node>

Commands

# Cordon the node to prevent new pods
kubectl cordon <node>

# Drain pods from the node (ignore DaemonSets, delete emptyDir data)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# (SSH to node) Free up disk space
ssh user@<node-ip>
sudo journalctl --vacuum-time=1d   # clean old logs
sudo docker system prune -a -f     # clean unused images (if using Docker)
# or containerd: crictl rmi --prune
sudo du -sh /var/log && sudo find /var/log -type f -name "*.log" -mtime +7 -delete

# Verify freed space
df -h /var/lib/kubelet

Risk Controls

Ensure PodDisruptionBudgets (PDBs) protect critical workloads before draining.
If the node has local storage (hostPath), draining will cause data loss; backup first.
--delete-emptydir-data clears emptyDir volumes; confirm no important temporary data.
Always cordon before drain to avoid forced evictions.

Rollback

If the node recovers after cleaning (NotReady clears): - Uncordon the node: kubectl uncordon <node> - Verify pods reschedule: kubectl get pods -o wide | grep <node>
If disk pressure persists or hardware failure: - Fully drain the node: kubectl drain <node> --ignore-daemonsets --force (use with caution) - Decommission and replace the node.

Verification

Node status becomes Ready: kubectl get nodes
All workloads are running: kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
Alerts are resolved.

When to Submit an OpsGlobal Ticket

The node cannot be recovered via cleanup (hardware failure/persistent issue).
Multiple nodes experience disk pressure simultaneously, requiring scaling or storage strategy changes.
Uncertainty about the impact of cleanup operations on production.
A permanent infrastructure change is needed (e.g., resizing node disks).

OpsGlobal's SRE team provides 24/7 support, automated remediation via Terraform, and designs self-healing cluster architectures.

Use cases

Useful for teams handling Kubernetes issues and needing a clear troubleshooting and delivery workflow.

Problem background

A practical guide to diagnosing and resolving a 'Node Not Ready' incident caused by disk pressure, with commands and risk controls.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

Related service CTA

If you are facing a similar Mastering Kubernetes Incident Response: Handling Node Disk Pressure issue, submit a ticket for remote OpsGlobal support.

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Submit Incident Ticket Book Technical Consultation

Book Technical Consultation Back to Blog