Prometheus Grafana OpenTelemetry Alert Triage Guide for SREs

Prometheus, Grafana & OpenTelemetry in Practice: SRE Alert Triage Workflow

Observability 6min 62 views 2026-06-12

PrometheusGrafanaOpenTelemetrySREKubernetesalert triage

Scenario

You receive a Prometheus alert: HighMemoryUsage – Kubernetes node <NODE_NAME> has memory usage > 90%.

Symptoms

Alert shows remaining memory < 10%.
Grafana dashboard node memory panel spikes.
Some Pods are OOMKilled or Evicted.

Diagnosis

Step 1: Identify Node and Pods

kubectl get nodes -o wide | grep <NODE_NAME>
kubectl describe node <NODE_NAME> | grep -A5 "Allocated resources"
kubectl top pod --all-namespaces --sort-by=memory | head -20

Step 2: Leverage OpenTelemetry Traces

If your application is instrumented with OpenTelemetry, search for traces associated with that node in Grafana Tempo. Look for memory-related spans to pinpoint the service and API call.

Step 3: Query Prometheus Metrics

In Grafana Explore, run:

sum(container_memory_working_set_bytes{node="<NODE_NAME>",container!=""}) by (pod)

Identify the top memory-consuming pod.

Commands & Actions

# Check pod resource limits
kubectl get pod <POD_NAME> -n <NAMESPACE> -o yaml | grep -A5 resources
# Exec into container to check processes
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- top -b -n 1 | head -10
# If memory leak is suspected, safely restart via rolling update
kubectl rollout restart deployment <DEPLOYMENT_NAME> -n <NAMESPACE>

Risk Controls

Reproduce in a non-production environment first.
Before draining a node, ensure DaemonSets allow eviction.
Capture memory snapshots before restarting Pods.

Rollback

If memory usage increases after restart, revert the last change:

kubectl rollout undo deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>

Verification

Grafana dashboard shows node memory dropping to normal range.
Prometheus alert resolves automatically.
No OOM events in logs.

When to Submit an OpsGlobal Ticket

Node memory remains abnormal despite restarts.
Root cause involves third-party components requiring patches.
Need adjustments to cluster resource quotas or Pod limits.

OpsGlobal’s SRE team provides 24/7 support to quickly identify and fix complex infrastructure issues.

Use cases

Useful for teams handling Observability issues and needing a clear troubleshooting and delivery workflow.

Problem background

A hands-on guide to using Prometheus, Grafana, and OpenTelemetry for triaging a high memory alert on a Kubernetes node, covering symptom identification, diagnosis, rollback, and when to escalate to OpsGlobal.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

Related service CTA

If you are facing a similar Prometheus, Grafana & OpenTelemetry in Practice: SRE Alert Triage Workflow issue, submit a ticket for remote OpsGlobal support.

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Submit Incident Ticket Book Technical Consultation

Book Technical Consultation Back to Blog