Prometheus Grafana OpenTelemetry Observability Practical Guide

Prometheus, Grafana, and OpenTelemetry: A Practical Guide to Production Observability

Observability 6min 18 views 2026-06-18

KubernetesSRE

Scenario

You manage a microservices application running on Kubernetes. Users report increased response times and occasional timeouts. You need to quickly pinpoint the root cause.

Symptoms

Overall application latency spikes: P99 jumps from 200ms to 2s.
Some requests return HTTP 500 errors.
Monitoring alerts fire, but existing metrics (CPU, memory) show no anomaly.

Diagnosis

Verify that the OpenTelemetry Collector is deployed and receiving traces/metrics from applications.
In Grafana, check the pre-built "Service Latency" dashboard. Notice a database query latency spike in one microservice.
Switch to trace view, find the corresponding trace, and identify a slow SQL query (execution time 1.8s).
Inspect Prometheus for database connection pool metrics; connections are exhausted.

Commands

These commands assume a Kubernetes environment with Helm and kubectl configured.

1. Deploy OpenTelemetry Collector

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm upgrade --install otel-collector open-telemetry/opentelemetry-collector \
  --set config.exporters.prometheus.endpoint="0.0.0.0:8889" \
  --set config.service.pipelines.metrics.exporters="[prometheus]" \
  --namespace observability --create-namespace

2. Configure Prometheus to scrape metrics

cat <<EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    matchLabels:
      app.kubernetes.io/instance: otel-collector
  endpoints:
  - port: metrics
    interval: 15s
EOF

3. Import a Grafana dashboard

Use the Grafana UI or API to import a pre-built OpenTelemetry dashboard (ID: 15905).

Risk Controls

Set appropriate sampling rate (e.g., head-based sampling) to avoid memory exhaustion in the Collector under high load.
Keep Prometheus scrape interval reasonable (not too short) to reduce load.
Backup existing Prometheus rules and Grafana dashboards before making changes.

Rollback

If the changes cause issues:

helm rollback otel-collector -n observability
kubectl delete servicemonitor otel-collector -n observability

Restore Prometheus configuration and Grafana dashboards from backup.

Verification

In Grafana, check that the new dashboard shows correct latency and error rates for all services.
Use kubectl logs to verify the OpenTelemetry Collector has no errors.
Simulate a slow request and confirm new traces are recorded correctly.

When to Submit an OpsGlobal Ticket

Submit a ticket if: - Latency or error rates do not improve after integration. - You need custom dashboards or alerting rules. - Your cluster is large and requires performance tuning for the Collector. - You encounter unexplained errors (e.g., Collector crashes, data loss).

Use cases

Useful for teams handling Observability issues and needing a clear troubleshooting and delivery workflow.

Problem background

This post walks through a real-world scenario to build end-to-end observability with OpenTelemetry, Prometheus, and Grafana in Kubernetes, covering symptoms, diagnosis, commands, risk controls, rollback, verification, and when to submit an OpsGlobal ticket.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

Related service CTA

If you are facing a similar Prometheus, Grafana, and OpenTelemetry: A Practical Guide to Production Observability issue, submit a ticket for remote OpsGlobal support.

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Submit Incident Ticket Book Technical Consultation

Book Technical Consultation Back to Blog