Docker Container Runtime Troubleshooting

DevOps 6min 64 views 2026-06-12

DockerContainer RuntimeTroubleshootingSRE

Scenario

You are on-call and receive an alert that a critical containerized application is crashing immediately after startup, or stuck in a restart loop. The container runs on a production Docker host, and application logs show no obvious error.

Symptoms

Container exits with code 137 (OOMKilled), 139 (Segfault), or 1 (generic error).
In Kubernetes, container status shows CrashLoopBackOff; in Docker, it shows restarting.
docker logs <container> returns empty or incomplete output.

Diagnosis

Check exit code: docker inspect <container> --format '{{.State.ExitCode}}'
Inspect resource limits: docker inspect <container> --format '{{.HostConfig.Memory}}' and CPU shares.
Check system logs: journalctl -u docker.service -n 100 --no-pager for OOM kills or driver errors.
Use dmesg | grep -i kill to see kernel OOM killer messages.
Test runtime health: docker run --rm -it --runtime=runc hello-world
Verify storage driver: docker info | grep "Storage Driver"

Key Commands

Safe Docker daemon restart: systemctl restart docker (with caution).
Force remove stuck container: docker rm -f <container>
Adjust memory limit: docker update --memory=512m --memory-swap=512m <container>
Debug cgroups: find /sys/fs/cgroup/memory/docker -name "memory.limit_in_bytes" -exec cat {} \;

Risk Controls

Before restarting Docker, drain the node of workloads or ensure replication (if using Kubernetes).
Avoid killing containers that are part of a critical transaction; prefer graceful shutdown.
Test all changes in staging first.

Rollback

Revert memory limits: docker update --memory=2g --memory-swap=2g <container> (original values).
Restore previous image: docker pull <image>:old_tag && docker stop <container> && docker rm <container> && docker run --restart=always ...
If Docker daemon restart caused issues, reload config without restart: systemctl reload docker or kill -HUP.

Verification

After changes, monitor docker events --since 5m for container start/stop.
Run docker stats --no-stream to see resource usage.
Check application health endpoint via curl.

When to Submit an OpsGlobal Ticket

Persistent runtime errors even after resource adjustments.
Corrupted Docker overlay filesystem requiring data recovery.
Kernel or Docker engine bugs needing vendor escalation.

Use cases

Useful for teams handling DevOps issues and needing a clear troubleshooting and delivery workflow.

Problem background

Step-by-step guide to diagnose and fix common Docker container runtime issues including scenario, symptoms, commands, risk controls, rollback, verification, and when to escalate to OpsGlobal.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

Related service CTA

If you are facing a similar Docker Container Runtime Troubleshooting: A Practical Guide for SREs issue, submit a ticket for remote OpsGlobal support.

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Submit Incident Ticket Book Technical Consultation

Book Technical Consultation Back to Blog