Scenario
You are on-call and receive an alert that a critical containerized application is crashing immediately after startup, or stuck in a restart loop. The container runs on a production Docker host, and application logs show no obvious error.
Symptoms
- Container exits with code 137 (OOMKilled), 139 (Segfault), or 1 (generic error).
- In Kubernetes, container status shows CrashLoopBackOff; in Docker, it shows restarting.
docker logs <container>returns empty or incomplete output.
Diagnosis
- Check exit code:
docker inspect <container> --format '{{.State.ExitCode}}' - Inspect resource limits:
docker inspect <container> --format '{{.HostConfig.Memory}}'and CPU shares. - Check system logs:
journalctl -u docker.service -n 100 --no-pagerfor OOM kills or driver errors. - Use
dmesg | grep -i killto see kernel OOM killer messages. - Test runtime health:
docker run --rm -it --runtime=runc hello-world - Verify storage driver:
docker info | grep "Storage Driver"
Key Commands
- Safe Docker daemon restart:
systemctl restart docker(with caution). - Force remove stuck container:
docker rm -f <container> - Adjust memory limit:
docker update --memory=512m --memory-swap=512m <container> - Debug cgroups:
find /sys/fs/cgroup/memory/docker -name "memory.limit_in_bytes" -exec cat {} \;
Risk Controls
- Before restarting Docker, drain the node of workloads or ensure replication (if using Kubernetes).
- Avoid killing containers that are part of a critical transaction; prefer graceful shutdown.
- Test all changes in staging first.
Rollback
- Revert memory limits:
docker update --memory=2g --memory-swap=2g <container>(original values). - Restore previous image:
docker pull <image>:old_tag && docker stop <container> && docker rm <container> && docker run --restart=always ... - If Docker daemon restart caused issues, reload config without restart:
systemctl reload dockeror kill -HUP.
Verification
- After changes, monitor
docker events --since 5mfor container start/stop. - Run
docker stats --no-streamto see resource usage. - Check application health endpoint via curl.
When to Submit an OpsGlobal Ticket
- Persistent runtime errors even after resource adjustments.
- Corrupted Docker overlay filesystem requiring data recovery.
- Kernel or Docker engine bugs needing vendor escalation.
Use cases
Useful for teams handling DevOps issues and needing a clear troubleshooting and delivery workflow.
Problem background
Step-by-step guide to diagnose and fix common Docker container runtime issues including scenario, symptoms, commands, risk controls, rollback, verification, and when to escalate to OpsGlobal.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.