Scenario
You manage a microservices application running on Kubernetes. Users report increased response times and occasional timeouts. You need to quickly pinpoint the root cause.
Symptoms
- Overall application latency spikes: P99 jumps from 200ms to 2s.
- Some requests return HTTP 500 errors.
- Monitoring alerts fire, but existing metrics (CPU, memory) show no anomaly.
Diagnosis
- Verify that the OpenTelemetry Collector is deployed and receiving traces/metrics from applications.
- In Grafana, check the pre-built "Service Latency" dashboard. Notice a database query latency spike in one microservice.
- Switch to trace view, find the corresponding trace, and identify a slow SQL query (execution time 1.8s).
- Inspect Prometheus for database connection pool metrics; connections are exhausted.
Commands
These commands assume a Kubernetes environment with Helm and kubectl configured.
1. Deploy OpenTelemetry Collector
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm upgrade --install otel-collector open-telemetry/opentelemetry-collector \
--set config.exporters.prometheus.endpoint="0.0.0.0:8889" \
--set config.service.pipelines.metrics.exporters="[prometheus]" \
--namespace observability --create-namespace
2. Configure Prometheus to scrape metrics
cat <<EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: otel-collector
namespace: observability
spec:
selector:
matchLabels:
app.kubernetes.io/instance: otel-collector
endpoints:
- port: metrics
interval: 15s
EOF
3. Import a Grafana dashboard
Use the Grafana UI or API to import a pre-built OpenTelemetry dashboard (ID: 15905).
Risk Controls
- Set appropriate sampling rate (e.g., head-based sampling) to avoid memory exhaustion in the Collector under high load.
- Keep Prometheus scrape interval reasonable (not too short) to reduce load.
- Backup existing Prometheus rules and Grafana dashboards before making changes.
Rollback
If the changes cause issues:
helm rollback otel-collector -n observability
kubectl delete servicemonitor otel-collector -n observability
Restore Prometheus configuration and Grafana dashboards from backup.
Verification
- In Grafana, check that the new dashboard shows correct latency and error rates for all services.
- Use
kubectl logsto verify the OpenTelemetry Collector has no errors. - Simulate a slow request and confirm new traces are recorded correctly.
When to Submit an OpsGlobal Ticket
Submit a ticket if: - Latency or error rates do not improve after integration. - You need custom dashboards or alerting rules. - Your cluster is large and requires performance tuning for the Collector. - You encounter unexplained errors (e.g., Collector crashes, data loss).
Use cases
Useful for teams handling Observability issues and needing a clear troubleshooting and delivery workflow.
Problem background
This post walks through a real-world scenario to build end-to-end observability with OpenTelemetry, Prometheus, and Grafana in Kubernetes, covering symptoms, diagnosis, commands, risk controls, rollback, verification, and when to submit an OpsGlobal ticket.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.