Scenario
An e-commerce platform runs on Kubernetes, using Redis for caching, RabbitMQ for order processing, and Kafka for event streaming. During a flash sale, customers report checkout failures, slow page loads, and delayed order processing.
Symptoms
- 99th percentile response time spikes from 50ms to 2s
- Redis memory usage exceeds 80%, slow query log grows rapidly
- RabbitMQ queue backlog: message count jumps from hundreds to hundreds of thousands
- Kafka consumer lag increases from milliseconds to minutes, with some consumer groups rebalancing
Diagnosis
Redis
- Check slow queries:
redis-cli SLOWLOG GET 10to identify slow commands (e.g., KEYS, SORT, operations on large keys). - Monitor memory:
redis-cli INFO memoryfor used_memory_rss and maxmemory, check if eviction policies are triggered. - Inspect connections:
redis-cli CLIENT LISTto see if connection count is abnormal.
RabbitMQ
- Check queue status:
rabbitmqctl list_queues name messages_ready messages_unacknowledgedto find backlogged queues. - Examine consumers:
rabbitmqctl list_consumersto ensure consumers are connected and prefetch settings are appropriate. - Monitor node health:
rabbitmqctl statusfor alarms (e.g., memory or disk alarms).
Kafka
- View consumer group lag:
kafka-consumer-groups --bootstrap-server localhost:9092 --group order-consumer --describefor LAG. - Check partition leadership:
kafka-topics --describe --topic ordersto confirm balanced partition distribution. - Monitor disk and network:
kafka-log-dirs --describe --bootstrap-server localhost:9092for disk usage.
Commands (with safety notes)
Redis
redis-cli --bigkeysto scan large keys (caution: avoid peak hours; use --sleep to reduce load).redis-cli config set timeout 30to set client timeout (caution: may disrupt connections; test first).
RabbitMQ
rabbitmqadmin declare queue name=backup queue_args='{"x-max-length":100000}'to create a bounded queue preventing unlimited backlog.- Adjust prefetch:
rabbitmqctl set_consumer_prefetch-size <queue_name> 50(requires consumer restart).
Kafka
- Increase partitions:
kafka-topics --bootstrap-server localhost:9092 --alter --topic orders --partitions 6(only if partitions ≤ brokers and key design supports it). - Reset consumer offsets:
kafka-consumer-groups --bootstrap-server localhost:9092 --group order-consumer --reset-offsets --to-earliest --execute(dangerous: causes duplicate consumption; test only).
Risk Controls
- Redis: Enable slow query log (slowlog-log-slower-than 10000 microseconds), set maxmemory-policy allkeys-lru, use connection pooling.
- RabbitMQ: Set queue max length (x-max-length) and max message bytes (x-max-length-bytes), enable lazy queues, use manual acknowledgments.
- Kafka: Configure min.insync.replicas=2, set consumer max.poll.records and session.timeout.ms, use monitoring (e.g., Prometheus + JMX Exporter).
Rollback
If configuration changes cause issues, revert immediately:
- Redis: redis-cli config set maxmemory-policy volatile-lru to restore original policy.
- RabbitMQ: Delete the newly created bounded queue and restore the original queue.
- Kafka: Decreasing partitions is not supported; delete and recreate the topic. Alternatively, reset consumer offsets to previous positions.
Operate during off-peak hours and keep configuration backups.
Verification
- Run synthetic transactions: simulate user login, browse, and checkout flows, monitoring success rate.
- Use Prometheus+Grafana dashboards: observe latency, queue length, consumer lag trends.
- Perform light load testing: use locust or jmeter to send simulated traffic and check system response.
When to Submit an OpsGlobal Ticket
If internal diagnosis exceeds 30 minutes without identifying the root cause, or if you need expert assistance for architecture changes, configuration optimization, or failover, submit a ticket immediately. OpsGlobal's SRE team provides 24/7 support to help you quickly restore service and optimize systems.
Use cases
Useful for teams handling NoSQL issues and needing a clear troubleshooting and delivery workflow.
Problem background
Learn how to diagnose and resolve common reliability issues in Redis, RabbitMQ, and Kafka running on Kubernetes, with step-by-step commands and risk controls.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.