Book Consultation Submit Ticket

Redis, RabbitMQ, Kafka: A Practical Guide to Middleware Reliability in Kubernetes

Learn how to diagnose and resolve common reliability issues in Redis, RabbitMQ, and Kafka running on Kubernetes, with step-by-step commands and risk controls.

Redis, RabbitMQ, Kafka: A Practical Guide to Middleware Reliability in Kubernetes
NoSQL 6min 9 views 2026-06-14
KubernetesSREMiddleware Reliability

Scenario

An e-commerce platform runs on Kubernetes, using Redis for caching, RabbitMQ for order processing, and Kafka for event streaming. During a flash sale, customers report checkout failures, slow page loads, and delayed order processing.

Symptoms

  • 99th percentile response time spikes from 50ms to 2s
  • Redis memory usage exceeds 80%, slow query log grows rapidly
  • RabbitMQ queue backlog: message count jumps from hundreds to hundreds of thousands
  • Kafka consumer lag increases from milliseconds to minutes, with some consumer groups rebalancing

Diagnosis

Redis

  1. Check slow queries: redis-cli SLOWLOG GET 10 to identify slow commands (e.g., KEYS, SORT, operations on large keys).
  2. Monitor memory: redis-cli INFO memory for used_memory_rss and maxmemory, check if eviction policies are triggered.
  3. Inspect connections: redis-cli CLIENT LIST to see if connection count is abnormal.

RabbitMQ

  1. Check queue status: rabbitmqctl list_queues name messages_ready messages_unacknowledged to find backlogged queues.
  2. Examine consumers: rabbitmqctl list_consumers to ensure consumers are connected and prefetch settings are appropriate.
  3. Monitor node health: rabbitmqctl status for alarms (e.g., memory or disk alarms).

Kafka

  1. View consumer group lag: kafka-consumer-groups --bootstrap-server localhost:9092 --group order-consumer --describe for LAG.
  2. Check partition leadership: kafka-topics --describe --topic orders to confirm balanced partition distribution.
  3. Monitor disk and network: kafka-log-dirs --describe --bootstrap-server localhost:9092 for disk usage.

Commands (with safety notes)

Redis

  • redis-cli --bigkeys to scan large keys (caution: avoid peak hours; use --sleep to reduce load).
  • redis-cli config set timeout 30 to set client timeout (caution: may disrupt connections; test first).

RabbitMQ

  • rabbitmqadmin declare queue name=backup queue_args='{"x-max-length":100000}' to create a bounded queue preventing unlimited backlog.
  • Adjust prefetch: rabbitmqctl set_consumer_prefetch-size <queue_name> 50 (requires consumer restart).

Kafka

  • Increase partitions: kafka-topics --bootstrap-server localhost:9092 --alter --topic orders --partitions 6 (only if partitions ≤ brokers and key design supports it).
  • Reset consumer offsets: kafka-consumer-groups --bootstrap-server localhost:9092 --group order-consumer --reset-offsets --to-earliest --execute (dangerous: causes duplicate consumption; test only).

Risk Controls

  • Redis: Enable slow query log (slowlog-log-slower-than 10000 microseconds), set maxmemory-policy allkeys-lru, use connection pooling.
  • RabbitMQ: Set queue max length (x-max-length) and max message bytes (x-max-length-bytes), enable lazy queues, use manual acknowledgments.
  • Kafka: Configure min.insync.replicas=2, set consumer max.poll.records and session.timeout.ms, use monitoring (e.g., Prometheus + JMX Exporter).

Rollback

If configuration changes cause issues, revert immediately: - Redis: redis-cli config set maxmemory-policy volatile-lru to restore original policy. - RabbitMQ: Delete the newly created bounded queue and restore the original queue. - Kafka: Decreasing partitions is not supported; delete and recreate the topic. Alternatively, reset consumer offsets to previous positions.

Operate during off-peak hours and keep configuration backups.

Verification

  • Run synthetic transactions: simulate user login, browse, and checkout flows, monitoring success rate.
  • Use Prometheus+Grafana dashboards: observe latency, queue length, consumer lag trends.
  • Perform light load testing: use locust or jmeter to send simulated traffic and check system response.

When to Submit an OpsGlobal Ticket

If internal diagnosis exceeds 30 minutes without identifying the root cause, or if you need expert assistance for architecture changes, configuration optimization, or failover, submit a ticket immediately. OpsGlobal's SRE team provides 24/7 support to help you quickly restore service and optimize systems.

Use cases

Useful for teams handling NoSQL issues and needing a clear troubleshooting and delivery workflow.

Problem background

Learn how to diagnose and resolve common reliability issues in Redis, RabbitMQ, and Kafka running on Kubernetes, with step-by-step commands and risk controls.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

!

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Ticket Contact on WhatsApp Consult