Scenario
You manage a multi-tenant Kubernetes cluster hosting Redis cache, RabbitMQ message queue, and Kafka event streaming. One day, users report severe order processing delays and payment confirmation failures. Initial observations show Redis master flapping, RabbitMQ queue backlog, and Kafka consumer group lag increasing.
Symptoms
- Redis:
INFO statsshowskeyspace_missesspiking, latency >100ms, andrepl_backlog_activefluctuating. - RabbitMQ: Management UI queue length growing,
rabbitmqctl list_queuesshows message count exceeding thresholds, some queues have unacked messages. - Kafka:
kafka-consumer-groups --describeshowsLAGrising continuously, broker disk usage near 90%.
Diagnosis
- Redis: Run
redis-cli -h <host> -p <port> info replicationto check master-slave status. UseSLOWLOG GET 10to analyze slow queries. Checksystemctl status redisand logs. - RabbitMQ: Execute
rabbitmqctl list_queues name messages consumersto assess consumer efficiency. Userabbitmqctl statusfor memory/disk alarms. Runrabbitmq-diagnosticsfor connection/channel checks. - Kafka: Use
kafka-topics --describe --topic <topic> --bootstrap-server <broker>to view partition distribution.kafka-log-dirs --describe --bootstrap-server <broker>detects disk imbalance. Check/var/log/kafka/server.logfor errors.
Commands
Redis Fixes
# Force re-election (if auto-failover fails)
redis-cli -h <slave_ip> -p 6379 replicaof no one
# Promote a slave, then point other slaves to new master
redis-cli -h <other_slave> -p 6379 replicaof <new_master_ip> 6379
# Tune slow log
redis-cli CONFIG SET slowlog-log-slower-than 10000
redis-cli CONFIG SET slowlog-max-len 128
# Limit memory
redis-cli CONFIG SET maxmemory 4gb
redis-cli CONFIG SET maxmemory-policy allkeys-lru
RabbitMQ Fixes
# Force re-declare queue
rabbitmqctl eval 'rabbit_amqqueue:declare(rabbit_misc:r(<<"/">>, queue, <<"queue_name">>), true, false, []).'
# Close stuck connections
rabbitmqctl list_connections name | awk '{print $1}' | xargs -I {} rabbitmqctl close_connection {} "manually closed"
# Apply HA policy
rabbitmqctl set_policy ha-all ".*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1
Kafka Fixes
# Rebalance consumer group
kafka-consumer-groups --bootstrap-server <broker> --group <group> --reset-offsets --to-latest --execute
# Increase partitions (irreversible)
kafka-topics --alter --topic <topic> --partitions <new_count> --bootstrap-server <broker>
# Shorten retention to free space
kafka-configs --bootstrap-server <broker> --entity-type topics --entity-name <topic> --alter --add-config retention.ms=86400000
Risk Controls
- Redis: Before
REPLICAOF, ensure slave is fully synced; when adjustingmaxmemory, leave headroom to avoid OOM. - RabbitMQ: Closing connections may lose transactional messages; back up policies before modifications.
- Kafka: Partition increase is irreversible; assess downstream consumer compatibility. Offset resets may cause duplicates or message loss.
Rollback
- Redis: Reattach old master:
redis-cli -h <old_master> replicaof <new_master> 6379, then promote it back if needed. - RabbitMQ: Remove policy:
rabbitmqctl clear_policy ha-all. Restart node:systemctl restart rabbitmq-server. - Kafka: Cannot reduce partitions; recreate topic and migrate. Reset offsets using
--to-earliest.
Verification
- Redis:
INFO statsshowskeyspace_hitsratio >99%, latency <10ms, replication OK. - RabbitMQ: Queue length normal,
rabbitmqctl list_queuesshows 0 unacked, consumer rate matches. - Kafka:
kafka-consumer-groups --describeLAG near 0, producer throughput meets SLO.
When to Submit an OpsGlobal Ticket
- Problem recurs after multiple manual interventions.
- Requires hardware upgrade or cluster architecture changes.
- Core business is impacted and internal team cannot resolve quickly.
- Need professional audit of configuration and performance tuning.
Use cases
Useful for teams handling NoSQL issues and needing a clear troubleshooting and delivery workflow.
Problem background
A deep dive into ensuring high availability of Redis, RabbitMQ, and Kafka in production Kubernetes environments. Covers real-world scenarios, symptoms, diagnosis, commands, risk controls, rollback, verification, and escalation criteria for OpsGlobal tickets.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.