Book Consultation Submit Ticket

Kubernetes Incident Response and Cluster Reliability: A Practical Guide

This article walks through a real-world Kubernetes incident where a node faces disk pressure, covering symptom identification, diagnosis, commands, risk controls, rollback, verification, and criteria for OpsGlobal ticket submission.

Kubernetes Incident Response and Cluster Reliability: A Practical Guide
Kubernetes 6min 9 views 2026-06-13
Kubernetesincident responsedisk pressurepod evictioncluster reliability

Scenario

A worker node experiences disk pressure due to low disk space, causing pods to be evicted and stuck in Pending state.

Symptoms

  • kubectl get nodes shows node condition DiskPressure
  • Evicted pods show status Pending; kubectl describe pod reveals 0/1 nodes are available: 1 node had taint {node.kubernetes.io/disk-pressure: }
  • kubectl get events -n <namespace> shows Evicted events

Diagnosis

  1. Check node conditions: bash kubectl describe node <node-name> | grep -A5 Conditions Look for DiskPressure set to True.
  2. SSH into the node and inspect disk usage: bash ssh <node-ip> df -h Identify partitions near 100% (commonly /var/lib/docker or /var/lib/kubelet).
  3. Optional: Use kubectl top nodes to check overall resource pressure.

Risk Controls

  1. Stop scheduling: Cordon the node immediately to prevent new pods. bash kubectl cordon <node-name>
  2. Drain workloads: If recovery is not immediate, safely evict all non-daemonset pods: bash kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data Safety note: --delete-emptydir-data removes emptyDir data; ensure it's backed up or disposable.
  3. Preserve system pods: DaemonSet pods (e.g., kube-proxy) remain, node basic functions stay intact.

Recovery Actions

  1. Free disk space: - Remove unused container images: docker image prune -a (requires node access) - Truncate container logs: truncate -s 0 /var/lib/docker/containers/*/*-json.log - Clean kubelet logs: journalctl --vacuum-size=500M
  2. Verify pressure relief: bash kubectl describe node <node-name> | grep -A5 Conditions Confirm DiskPressure is False.

Rollback

  1. Re-enable scheduling: bash kubectl uncordon <node-name>
  2. Restore evicted pods: Usually automatic via controllers; if not, manually scale: bash kubectl scale deployment <deployment-name> --replicas=<original-count>
  3. Verify pod status: bash kubectl get pods -o wide | grep <node-name> Ensure pods are Running and not re-evicted.

Verification

  1. Node health: Re-check node conditions – DiskPressure should be False, Ready normal.
  2. Pod status: All pods should be Running or Completed.
  3. Cluster events: kubectl get events --all-namespaces shows no abnormal evictions or errors.
  4. Business metrics: Confirm application latency, error rates, etc., have returned to baseline.

When to Submit an OpsGlobal Ticket

  • Automated recovery fails: Node disk pressure persists despite steps above.
  • Multiple nodes affected: Two or more worker nodes show similar pressure, possibly indicating a storage issue.
  • Data loss concern: Critical data lost due to eviction (e.g., need persistent volume recovery).
  • Cluster-level impact: Control plane nodes also under pressure, or API server being slow.

When creating the ticket, include: node name, event timestamps, output of kubectl describe node, and disk usage screenshots.

Use cases

Useful for teams handling Kubernetes issues and needing a clear troubleshooting and delivery workflow.

Problem background

This article walks through a real-world Kubernetes incident where a node faces disk pressure, covering symptom identification, diagnosis, commands, risk controls, rollback, verification, and criteria for OpsGlobal ticket submission.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

!

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Ticket Contact on WhatsApp Consult