Scenario
Imagine your production Docker container suddenly exits, restarts repeatedly, or fails health checks. Common causes include application crashes, out-of-memory (OOM) errors, misconfigurations, or base image issues. As an OpsGlobal-supported DevOps/SRE engineer, you need to quickly diagnose and restore service.
Symptoms
- Container status shows
Exited (1)orRestarting. docker logs <container>shows application errors or exit messages.docker inspect <container>showsOOMKilled: true.- Health check fails (
docker psshowsunhealthyin STATUS). - Resource metrics (CPU/memory) are anomalous.
Diagnosis Steps
- Check container state
bash docker ps -a --filter "name=my-service" - Read logs
bash docker logs --tail 200 my-service - Examine exit code
bash docker inspect my-service --format '{{.State.ExitCode}}'Non-zero exit code typically indicates application error. - Detect OOM
bash docker inspect my-service --format '{{.State.OOMKilled}}'If true, increase memory limits or optimize the app. - Monitor with docker events
bash docker events --filter 'container=my-service' --since 1h - Deep host-level diagnosis
- Check Docker daemon logs:
journalctl -u docker- Check system resources:free -m,top,df -h- If using crictl (CRI-compatible runtime):crictl ps,crictl logs <container-id>
Key Commands
docker container restart my-service: Restart container (temporary recovery).docker container stop my-service && docker container start my-service: Step-by-step restart.docker run --rm -it my-image bash: Interactive debug container.docker stats --no-stream: Real-time container resource usage.docker system df: Check Docker disk usage.
Risk Controls
- Never delete containers or images without backup.
- Confirm no downstream dependencies (e.g., database connections) before stopping a container.
- Use
--restart=on-failureto avoid infinite restart loops. - Adjust resource limits gradually and monitor performance.
- Notify the team before running diagnostic commands to avoid interference.
Rollback Procedure
- Stop the faulty container:
docker stop my-service - Pull the previous stable image:
docker pull my-registry/my-service:v1.0.0 - Start the old version:
docker run -d --name my-service-old --restart=always my-registry/my-service:v1.0.0 - Verify service restoration (see below).
- Update orchestration (e.g., docker-compose or Kubernetes) to point to old version.
- Keep faulty container and logs for post-mortem:
docker logs my-service > /tmp/crash.log
Verification
- Check container running status:
docker ps -f status=running - Verify processes inside container:
docker exec my-service ps aux - Send test requests to confirm application responses.
- Monitor logs for new errors:
docker logs --tail 50 my-service - Check resource usage metrics are stable.
When to Submit an OpsGlobal Ticket
- Containers repeatedly exit without obvious cause after basic diagnosis.
- Need daemon-level or node-level debugging (e.g., kernel issues, storage driver failures).
- Similar symptoms appear on multiple nodes in a large cluster.
- Problem persists after rollback, or image rebuild and security patches are needed.
- Need to tune Docker runtime parameters (e.g., ulimit, cgroups).
- Involves Docker version compatibility or upgrade planning.
When submitting, provide: container ID, log snippets, docker inspect output, host system info, and timeline.
Use cases
Useful for teams handling DevOps issues and needing a clear troubleshooting and delivery workflow.
Problem background
Master the art of troubleshooting Docker container runtime issues with systematic diagnosis, commands, safety measures, and rollback procedures. Learn when to escalate to OpsGlobal's remote SRE support.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.