Scenario
You receive a Prometheus alert: HighMemoryUsage – Kubernetes node <NODE_NAME> has memory usage > 90%.
Symptoms
- Alert shows remaining memory < 10%.
- Grafana dashboard node memory panel spikes.
- Some Pods are OOMKilled or Evicted.
Diagnosis
Step 1: Identify Node and Pods
kubectl get nodes -o wide | grep <NODE_NAME>
kubectl describe node <NODE_NAME> | grep -A5 "Allocated resources"
kubectl top pod --all-namespaces --sort-by=memory | head -20
Step 2: Leverage OpenTelemetry Traces
If your application is instrumented with OpenTelemetry, search for traces associated with that node in Grafana Tempo. Look for memory-related spans to pinpoint the service and API call.
Step 3: Query Prometheus Metrics
In Grafana Explore, run:
sum(container_memory_working_set_bytes{node="<NODE_NAME>",container!=""}) by (pod)
Identify the top memory-consuming pod.
Commands & Actions
# Check pod resource limits
kubectl get pod <POD_NAME> -n <NAMESPACE> -o yaml | grep -A5 resources
# Exec into container to check processes
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- top -b -n 1 | head -10
# If memory leak is suspected, safely restart via rolling update
kubectl rollout restart deployment <DEPLOYMENT_NAME> -n <NAMESPACE>
Risk Controls
- Reproduce in a non-production environment first.
- Before draining a node, ensure DaemonSets allow eviction.
- Capture memory snapshots before restarting Pods.
Rollback
If memory usage increases after restart, revert the last change:
kubectl rollout undo deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
Verification
- Grafana dashboard shows node memory dropping to normal range.
- Prometheus alert resolves automatically.
- No OOM events in logs.
When to Submit an OpsGlobal Ticket
- Node memory remains abnormal despite restarts.
- Root cause involves third-party components requiring patches.
- Need adjustments to cluster resource quotas or Pod limits.
OpsGlobal’s SRE team provides 24/7 support to quickly identify and fix complex infrastructure issues.
Use cases
Useful for teams handling Observability issues and needing a clear troubleshooting and delivery workflow.
Problem background
A hands-on guide to using Prometheus, Grafana, and OpenTelemetry for triaging a high memory alert on a Kubernetes node, covering symptom identification, diagnosis, rollback, and when to escalate to OpsGlobal.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.