Book Consultation Submit Ticket

Redis, RabbitMQ & Kafka Middleware Reliability: A Practical Guide

A deep dive into ensuring high availability of Redis, RabbitMQ, and Kafka in production Kubernetes environments. Covers real-world scenarios, symptoms, diagnosis, commands, risk controls, rollback, verification, and escalation criteria for OpsGlobal tickets.

Redis, RabbitMQ & Kafka Middleware Reliability: A Practical Guide
NoSQL 6min 10 views 2026-06-18
KubernetesSRERedisRabbitMQKafkaMiddleware Reliability

Scenario

You manage a multi-tenant Kubernetes cluster hosting Redis cache, RabbitMQ message queue, and Kafka event streaming. One day, users report severe order processing delays and payment confirmation failures. Initial observations show Redis master flapping, RabbitMQ queue backlog, and Kafka consumer group lag increasing.

Symptoms

  • Redis: INFO stats shows keyspace_misses spiking, latency >100ms, and repl_backlog_active fluctuating.
  • RabbitMQ: Management UI queue length growing, rabbitmqctl list_queues shows message count exceeding thresholds, some queues have unacked messages.
  • Kafka: kafka-consumer-groups --describe shows LAG rising continuously, broker disk usage near 90%.

Diagnosis

  1. Redis: Run redis-cli -h <host> -p <port> info replication to check master-slave status. Use SLOWLOG GET 10 to analyze slow queries. Check systemctl status redis and logs.
  2. RabbitMQ: Execute rabbitmqctl list_queues name messages consumers to assess consumer efficiency. Use rabbitmqctl status for memory/disk alarms. Run rabbitmq-diagnostics for connection/channel checks.
  3. Kafka: Use kafka-topics --describe --topic <topic> --bootstrap-server <broker> to view partition distribution. kafka-log-dirs --describe --bootstrap-server <broker> detects disk imbalance. Check /var/log/kafka/server.log for errors.

Commands

Redis Fixes

# Force re-election (if auto-failover fails)
redis-cli -h <slave_ip> -p 6379 replicaof no one

# Promote a slave, then point other slaves to new master
redis-cli -h <other_slave> -p 6379 replicaof <new_master_ip> 6379

# Tune slow log
redis-cli CONFIG SET slowlog-log-slower-than 10000
redis-cli CONFIG SET slowlog-max-len 128

# Limit memory
redis-cli CONFIG SET maxmemory 4gb
redis-cli CONFIG SET maxmemory-policy allkeys-lru

RabbitMQ Fixes

# Force re-declare queue
rabbitmqctl eval 'rabbit_amqqueue:declare(rabbit_misc:r(<<"/">>, queue, <<"queue_name">>), true, false, []).'

# Close stuck connections
rabbitmqctl list_connections name | awk '{print $1}' | xargs -I {} rabbitmqctl close_connection {} "manually closed"

# Apply HA policy
rabbitmqctl set_policy ha-all ".*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1

Kafka Fixes

# Rebalance consumer group
kafka-consumer-groups --bootstrap-server <broker> --group <group> --reset-offsets --to-latest --execute

# Increase partitions (irreversible)
kafka-topics --alter --topic <topic> --partitions <new_count> --bootstrap-server <broker>

# Shorten retention to free space
kafka-configs --bootstrap-server <broker> --entity-type topics --entity-name <topic> --alter --add-config retention.ms=86400000

Risk Controls

  • Redis: Before REPLICAOF, ensure slave is fully synced; when adjusting maxmemory, leave headroom to avoid OOM.
  • RabbitMQ: Closing connections may lose transactional messages; back up policies before modifications.
  • Kafka: Partition increase is irreversible; assess downstream consumer compatibility. Offset resets may cause duplicates or message loss.

Rollback

  • Redis: Reattach old master: redis-cli -h <old_master> replicaof <new_master> 6379, then promote it back if needed.
  • RabbitMQ: Remove policy: rabbitmqctl clear_policy ha-all. Restart node: systemctl restart rabbitmq-server.
  • Kafka: Cannot reduce partitions; recreate topic and migrate. Reset offsets using --to-earliest.

Verification

  • Redis: INFO stats shows keyspace_hits ratio >99%, latency <10ms, replication OK.
  • RabbitMQ: Queue length normal, rabbitmqctl list_queues shows 0 unacked, consumer rate matches.
  • Kafka: kafka-consumer-groups --describe LAG near 0, producer throughput meets SLO.

When to Submit an OpsGlobal Ticket

  • Problem recurs after multiple manual interventions.
  • Requires hardware upgrade or cluster architecture changes.
  • Core business is impacted and internal team cannot resolve quickly.
  • Need professional audit of configuration and performance tuning.

Use cases

Useful for teams handling NoSQL issues and needing a clear troubleshooting and delivery workflow.

Problem background

A deep dive into ensuring high availability of Redis, RabbitMQ, and Kafka in production Kubernetes environments. Covers real-world scenarios, symptoms, diagnosis, commands, risk controls, rollback, verification, and escalation criteria for OpsGlobal tickets.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

!

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Ticket Contact on WhatsApp Consult