Prometheus Grafana OpenTelemetry Observability Practical Debugging Guide

Building Observability with Prometheus, Grafana, and OpenTelemetry: A Practical Debugging Guide

Observability 6min 91 views 2026-06-14

KubernetesSREPrometheusGrafanaOpenTelemetryObservability

Scenario

Your team runs microservices on Kubernetes and has deployed OpenTelemetry Collector to capture traces, metrics, and logs. Prometheus scrapes metrics from the Collector, and Grafana provides dashboards. However, you notice increased response times for certain services, missing metrics in Grafana panels, and frequent errors in distributed traces.

Symptoms

Users report API response latency exceeding 5 seconds.
Grafana panels for "Request Latency" show empty or incomplete data.
Trace spans in Jaeger or Grafana Tempo show error status.
Prometheus targets show up as 0 or partial failures.

Diagnosis

Check OpenTelemetry Collector Status: Use kubectl get pods -n observability to see if Collector pods are running. Check logs: kubectl logs -n observability <collector-pod> --tail=50 for any export errors or configuration issues.
Verify Prometheus Scrape Configuration: Examine scrape_configs in Prometheus configuration to ensure Collector endpoint is included. Run promtool check config /etc/prometheus/prometheus.yml if promtool is available.
Test Collector Metrics Endpoint: Curl the Collector's /metrics endpoint to confirm Prometheus-format metrics are exposed. For example: kubectl exec -n observability <collector-pod> -- curl localhost:8888/metrics | head
Check Grafana Datasource: Ensure Prometheus datasource is configured correctly and test the connection.
Analyze Traces: In Grafana Explore, use the Tempo or Jaeger datasource to query recent traces, focusing on error spans and their parent spans.

Commands (Examples)

# View Collector pod status
kubectl get pods -n observability -l app=otel-collector

# View Collector logs
LOG=$(kubectl logs -n observability --tail=100 -l app=otel-collector)
echo "$LOG" | grep -i error

# Curl metrics from Collector
METRICS=$(kubectl exec -n observability deploy/otel-collector -- curl -s localhost:8888/metrics)
echo "$METRICS" | grep -E "(otelcol|go_|process_)" | head -20

# Check Prometheus config with promtool (if installed)
promtool check config /etc/prometheus/prometheus.yml 2>&1 || true

# Query Prometheus targets status
kubectl port-forward svc/prometheus-server 9090:80 -n monitoring &
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

Risk Controls

Replicate the issue in a non-production environment before making changes.
Back up the Collector's config.yaml before modifying it.
Use Prometheus's --web.enable-lifecycle flag for hot-reload, but prefer updating via ConfigMap in production.
Never edit production Prometheus or Collector configurations directly; use version control (Git).

Rollback

If changes to Collector config cause problems, restore the previous ConfigMap: kubectl apply -f backup-otel-collector-config.yaml, then restart pods.
For Prometheus config, rollback the ConfigMap and trigger a reload: kubectl exec -n monitoring prometheus-server -- kill -HUP 1.
In Grafana, use dashboard version history to revert to a previous panel or datasource configuration.

Verification

Ensure all pods are running and ready: kubectl get pods -n observability.
Check Prometheus targets are all UP.
Refresh Grafana dashboards; metrics should display correctly.
Perform an end-to-end trace test: use a sample client to send a request and view the complete trace in Grafana Tempo.

When to Submit an OpsGlobal Ticket

Submit a ticket if: - Basic configuration checks do not resolve the issue, and you suspect compatibility between Collector and Prometheus. - The OpenTelemetry Collector keeps restarting or shows memory leaks. - Grafana datasource connects successfully but no data appears, while Prometheus has data. - You need to configure advanced features (e.g., sampling policies, multi-backend exporters) and lack experience.

When submitting, include: - Relevant pod logs and describe output. - Configuration files for Prometheus and Collector (without secrets). - JSON model of the Grafana dashboard. - Specific steps to reproduce the issue.

Use cases

Useful for teams handling Observability issues and needing a clear troubleshooting and delivery workflow.

Problem background

In Kubernetes microservices environments, integrating Prometheus, Grafana, and OpenTelemetry enhances observability. This article walks through a real-world scenario, diagnosing missing data and latency issues, with commands, risk controls, rollback steps, and guidance on when to engage OpsGlobal.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

Related service CTA

If you are facing a similar Building Observability with Prometheus, Grafana, and OpenTelemetry: A Practical Debugging Guide issue, submit a ticket for remote OpsGlobal support.

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Submit Incident Ticket Book Technical Consultation

Book Technical Consultation Back to Blog