Scenario
Your team runs microservices on Kubernetes and has deployed OpenTelemetry Collector to capture traces, metrics, and logs. Prometheus scrapes metrics from the Collector, and Grafana provides dashboards. However, you notice increased response times for certain services, missing metrics in Grafana panels, and frequent errors in distributed traces.
Symptoms
- Users report API response latency exceeding 5 seconds.
- Grafana panels for "Request Latency" show empty or incomplete data.
- Trace spans in Jaeger or Grafana Tempo show error status.
- Prometheus targets show
upas 0 or partial failures.
Diagnosis
- Check OpenTelemetry Collector Status: Use
kubectl get pods -n observabilityto see if Collector pods are running. Check logs:kubectl logs -n observability <collector-pod> --tail=50for any export errors or configuration issues. - Verify Prometheus Scrape Configuration: Examine
scrape_configsin Prometheus configuration to ensure Collector endpoint is included. Runpromtool check config /etc/prometheus/prometheus.ymlif promtool is available. - Test Collector Metrics Endpoint: Curl the Collector's /metrics endpoint to confirm Prometheus-format metrics are exposed. For example:
kubectl exec -n observability <collector-pod> -- curl localhost:8888/metrics | head - Check Grafana Datasource: Ensure Prometheus datasource is configured correctly and test the connection.
- Analyze Traces: In Grafana Explore, use the Tempo or Jaeger datasource to query recent traces, focusing on error spans and their parent spans.
Commands (Examples)
# View Collector pod status
kubectl get pods -n observability -l app=otel-collector
# View Collector logs
LOG=$(kubectl logs -n observability --tail=100 -l app=otel-collector)
echo "$LOG" | grep -i error
# Curl metrics from Collector
METRICS=$(kubectl exec -n observability deploy/otel-collector -- curl -s localhost:8888/metrics)
echo "$METRICS" | grep -E "(otelcol|go_|process_)" | head -20
# Check Prometheus config with promtool (if installed)
promtool check config /etc/prometheus/prometheus.yml 2>&1 || true
# Query Prometheus targets status
kubectl port-forward svc/prometheus-server 9090:80 -n monitoring &
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
Risk Controls
- Replicate the issue in a non-production environment before making changes.
- Back up the Collector's
config.yamlbefore modifying it. - Use Prometheus's
--web.enable-lifecycleflag for hot-reload, but prefer updating via ConfigMap in production. - Never edit production Prometheus or Collector configurations directly; use version control (Git).
Rollback
- If changes to Collector config cause problems, restore the previous ConfigMap:
kubectl apply -f backup-otel-collector-config.yaml, then restart pods. - For Prometheus config, rollback the ConfigMap and trigger a reload:
kubectl exec -n monitoring prometheus-server -- kill -HUP 1. - In Grafana, use dashboard version history to revert to a previous panel or datasource configuration.
Verification
- Ensure all pods are running and ready:
kubectl get pods -n observability. - Check Prometheus targets are all UP.
- Refresh Grafana dashboards; metrics should display correctly.
- Perform an end-to-end trace test: use a sample client to send a request and view the complete trace in Grafana Tempo.
When to Submit an OpsGlobal Ticket
Submit a ticket if: - Basic configuration checks do not resolve the issue, and you suspect compatibility between Collector and Prometheus. - The OpenTelemetry Collector keeps restarting or shows memory leaks. - Grafana datasource connects successfully but no data appears, while Prometheus has data. - You need to configure advanced features (e.g., sampling policies, multi-backend exporters) and lack experience.
When submitting, include: - Relevant pod logs and describe output. - Configuration files for Prometheus and Collector (without secrets). - JSON model of the Grafana dashboard. - Specific steps to reproduce the issue.
Use cases
Useful for teams handling Observability issues and needing a clear troubleshooting and delivery workflow.
Problem background
In Kubernetes microservices environments, integrating Prometheus, Grafana, and OpenTelemetry enhances observability. This article walks through a real-world scenario, diagnosing missing data and latency issues, with commands, risk controls, rollback steps, and guidance on when to engage OpsGlobal.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.