Book Consultation Submit Ticket

Docker Container Runtime Troubleshooting: A Practical Guide for SREs

Step-by-step guide to diagnose and fix common Docker container runtime issues including scenario, symptoms, commands, risk controls, rollback, verification, and when to escalate to OpsGlobal.

Docker Container Runtime Troubleshooting: A Practical Guide for SREs
DevOps 6min 4 views 2026-06-12
DockerContainer RuntimeTroubleshootingSRE

Scenario

You are on-call and receive an alert that a critical containerized application is crashing immediately after startup, or stuck in a restart loop. The container runs on a production Docker host, and application logs show no obvious error.

Symptoms

  • Container exits with code 137 (OOMKilled), 139 (Segfault), or 1 (generic error).
  • In Kubernetes, container status shows CrashLoopBackOff; in Docker, it shows restarting.
  • docker logs <container> returns empty or incomplete output.

Diagnosis

  1. Check exit code: docker inspect <container> --format '{{.State.ExitCode}}'
  2. Inspect resource limits: docker inspect <container> --format '{{.HostConfig.Memory}}' and CPU shares.
  3. Check system logs: journalctl -u docker.service -n 100 --no-pager for OOM kills or driver errors.
  4. Use dmesg | grep -i kill to see kernel OOM killer messages.
  5. Test runtime health: docker run --rm -it --runtime=runc hello-world
  6. Verify storage driver: docker info | grep "Storage Driver"

Key Commands

  • Safe Docker daemon restart: systemctl restart docker (with caution).
  • Force remove stuck container: docker rm -f <container>
  • Adjust memory limit: docker update --memory=512m --memory-swap=512m <container>
  • Debug cgroups: find /sys/fs/cgroup/memory/docker -name "memory.limit_in_bytes" -exec cat {} \;

Risk Controls

  • Before restarting Docker, drain the node of workloads or ensure replication (if using Kubernetes).
  • Avoid killing containers that are part of a critical transaction; prefer graceful shutdown.
  • Test all changes in staging first.

Rollback

  • Revert memory limits: docker update --memory=2g --memory-swap=2g <container> (original values).
  • Restore previous image: docker pull <image>:old_tag && docker stop <container> && docker rm <container> && docker run --restart=always ...
  • If Docker daemon restart caused issues, reload config without restart: systemctl reload docker or kill -HUP.

Verification

  • After changes, monitor docker events --since 5m for container start/stop.
  • Run docker stats --no-stream to see resource usage.
  • Check application health endpoint via curl.

When to Submit an OpsGlobal Ticket

  • Persistent runtime errors even after resource adjustments.
  • Corrupted Docker overlay filesystem requiring data recovery.
  • Kernel or Docker engine bugs needing vendor escalation.

Use cases

Useful for teams handling DevOps issues and needing a clear troubleshooting and delivery workflow.

Problem background

Step-by-step guide to diagnose and fix common Docker container runtime issues including scenario, symptoms, commands, risk controls, rollback, verification, and when to escalate to OpsGlobal.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

!

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Ticket Contact on WhatsApp Consult