Book Consultation Submit Ticket

Middleware Reliability in Practice: Troubleshooting Redis, RabbitMQ, and Kafka

From an SRE perspective, this post walks through real-world scenarios of connection timeouts, message loss, and partition offset issues in Redis, RabbitMQ, and Kafka on Kubernetes, providing diagnostic commands, risk controls, rollback plans, verification steps, and guidance on when to submit an OpsGlobal ticket.

Middleware Reliability in Practice: Troubleshooting Redis, RabbitMQ, and Kafka
NoSQL 6min 5 views 2026-06-20
MiddlewareReliabilityTroubleshootingKubernetesSRE

Scenario

An e-commerce platform runs Redis, RabbitMQ, and Kafka on Kubernetes. During a flash sale, Redis cache timeouts spike, RabbitMQ queues back up, and Kafka consumer offsets reset, causing order processing delays.

Symptoms

  • Redis: Application logs show 'Connection refused' or 'Timeout'; cache hit rate drops sharply.
  • RabbitMQ: Queue depth grows rapidly; consumer throughput declines; some messages are requeued or lost.
  • Kafka: Consumer groups experience 'OffsetOutOfRange' or frequent rebalances; consumer lag increases.

Diagnosis

Redis

  1. Check Redis service health: bash kubectl exec -it redis-pod -- redis-cli ping Should return PONG. If fails, inspect pod logs: bash kubectl logs redis-pod --tail=50
  2. Check Sentinel or Cluster status: bash redis-cli -h redis-sentinel -p 26379 sentinel master mymaster
  3. Check memory and slow queries: bash redis-cli info memory redis-cli slowlog get

RabbitMQ

  1. View node status: bash kubectl exec rabbitmq-pod -- rabbitmqctl status
  2. List queues and consumers: bash rabbitmqctl list_queues name messages_ready messages_unacknowledged rabbitmqctl list_consumers
  3. Tail logs: bash kubectl logs rabbitmq-pod

Kafka

  1. Check broker and topic metadata: bash kubectl exec kafka-pod -- kafka-broker-api-versions.sh --bootstrap-server localhost:9092 kafka-topics.sh --describe --topic orders --bootstrap-server localhost:9092
  2. Examine consumer group state: bash kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group order-group --describe
  3. Inspect log segments: bash kafka-run-class.sh kafka.tools.DumpLogSegments --files /data/kafka/orders-0/00000000000000000000.log --print-data-log

Risk Controls

  • Redis: Deploy Sentinel or Cluster to avoid single point of failure; set maxmemory and eviction policy (e.g., allkeys-lru); enable RDB/AOF persistence.
  • RabbitMQ: Configure mirrored queues (ha-mode: all); enable publisher confirms and consumer manual ack; set dead letter exchange for failed messages.
  • Kafka: Set replication factor >= 3; configure min.insync.replicas=2; producer set acks=all; consumer use auto.offset.reset=none or earliest.

Rollback

  • Redis: Restore from latest RDB snapshot or AOF file; reload onto the master node.
  • RabbitMQ: If mirrored queues are in use, switch to healthy node; else restore queue data from backup.
  • Kafka: Use kafka-reassign-partitions to redistribute replicas; if data corrupted, restore topic from backup.

Verification

  • Redis: redis-cli --latency -h <host> to test latency; redis-cli info persistence for persistence status; simulate failover and verify auto-recovery.
  • RabbitMQ: Publish and consume test messages; check queue depth and ack rate.
  • Kafka: Produce/consume test messages; ensure offsets are contiguous; use kafka-verifiable-producer and kafka-verifiable-consumer for end-to-end validation.

When to Submit an OpsGlobal Ticket

  • When the issue exceeds team capacity (e.g., cross-AZ network partition, severe data corruption);
  • When authoritative architecture review is needed (e.g., cluster scaling plan);
  • When instability persists with no clear root cause;
  • When you want to build a more robust monitoring and alerting system.

Use cases

Useful for teams handling NoSQL issues and needing a clear troubleshooting and delivery workflow.

Problem background

From an SRE perspective, this post walks through real-world scenarios of connection timeouts, message loss, and partition offset issues in Redis, RabbitMQ, and Kafka on Kubernetes, providing diagnostic commands, risk controls, rollback plans, verification steps, and guidance on when to submit an OpsGlobal ticket.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

!

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Ticket Contact on WhatsApp Consult