Scenario
In a production environment, a Docker container running a critical microservice suddenly becomes unresponsive. The application team reports errors, and health checks fail. The container might be stuck in a restart loop or stopped unexpectedly. This guide walks through systematic troubleshooting steps to identify and resolve runtime issues.
Symptoms
Common symptoms include:
- Container exits immediately or repeatedly restarts (Status: Exited or Restarting).
- docker logs returns empty or truncated output.
- Host resource exhaustion (CPU, memory, disk I/O).
- Container health check fails.
- Error messages like "Container killed by OOM killer" or "Insufficient disk space".
Diagnosis
Start by gathering basic information:
docker ps -a
Check container logs with tailing:
docker logs --tail 50 <container>
Inspect container state and exit code:
docker inspect <container> --format '{{.State.Status}} {{.State.ExitCode}}'
Monitor real-time resource usage:
docker stats --no-stream
If the container is still running, you can enter its namespace:
nsenter --target $(docker inspect -f '{{.State.Pid}}' <container>) --mount --uts --ipc --net --pid
Check system-level issues:
- Host disk space: df -h and inodes: df -i
- Kernel messages: dmesg | tail -20
- Docker daemon logs: sudo journalctl -u docker -n 50
- Check cgroup limits: cat /sys/fs/cgroup/memory/docker/.../memory.limit_in_bytes
Commands (summary)
docker ps -a– list all containers.docker logs --tail 100 <container>– view recent logs.docker inspect <container>– detailed configuration.docker stats --no-stream– live resource usage.dmesg | grep -i oom– check for out-of-memory kills.df -h– disk usage.top -p $(docker inspect -f '{{.State.Pid}}' <container>)– process-level view.
Risk Controls
- Never run destructive commands like
docker rmwithout verifying. - Backup container data if possible (e.g., volumes).
- Use graceful shutdown:
docker stop -t 30instead of kill. - Avoid modifying running container internals except via
docker exec. - If restarting, consider using
--restart=on-failure:5to avoid infinite loops.
Rollback
If the issue is deployment-related:
- Revert to a previous image version: docker pull <image>:<previous_tag> and docker run ...
- If using orchestration, rollback the deployment.
- For manual rollback, stop current container and start the old one.
Verification
After intervention:
- Confirm container state: docker ps --filter status=running
- Check logs: docker logs --tail 20 <container>
- Verify health endpoint: curl -f http://localhost:<port>/health
- Monitor resource usage for stabilization.
When to Submit an OpsGlobal Ticket
- Kernel-level issues (e.g., OOM killer, kernel panics).
- Persistent container crashes after exhausting basic fixes.
- Need for performance profiling (e.g., CPU/IO bottlenecks).
- Multiple containers affected suggesting host-level problems.
- Environments with strict SLA requirements where immediate expert assistance is needed.
OpsGlobal provides 24/7 remote SRE support to diagnose and resolve complex Docker runtime issues.
Use cases
Useful for teams handling DevOps issues and needing a clear troubleshooting and delivery workflow.
Problem background
Learn how to diagnose and resolve common Docker container runtime issues in production. This guide covers symptoms, diagnosis commands, risk controls, rollback strategies, and when to seek expert assistance from OpsGlobal.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.