Book Consultation Submit Ticket

Docker Container Runtime Troubleshooting: A Practical Guide for DevOps/SRE Teams

Master the art of troubleshooting Docker container runtime issues with systematic diagnosis, commands, safety measures, and rollback procedures. Learn when to escalate to OpsGlobal's remote SRE support.

Docker Container Runtime Troubleshooting: A Practical Guide for DevOps/SRE Teams
DevOps 6min 16 views 2026-06-19
DockerContainer RuntimeTroubleshootingDevOpsSRE

Scenario

Imagine your production Docker container suddenly exits, restarts repeatedly, or fails health checks. Common causes include application crashes, out-of-memory (OOM) errors, misconfigurations, or base image issues. As an OpsGlobal-supported DevOps/SRE engineer, you need to quickly diagnose and restore service.

Symptoms

  • Container status shows Exited (1) or Restarting.
  • docker logs <container> shows application errors or exit messages.
  • docker inspect <container> shows OOMKilled: true.
  • Health check fails (docker ps shows unhealthy in STATUS).
  • Resource metrics (CPU/memory) are anomalous.

Diagnosis Steps

  1. Check container state bash docker ps -a --filter "name=my-service"
  2. Read logs bash docker logs --tail 200 my-service
  3. Examine exit code bash docker inspect my-service --format '{{.State.ExitCode}}' Non-zero exit code typically indicates application error.
  4. Detect OOM bash docker inspect my-service --format '{{.State.OOMKilled}}' If true, increase memory limits or optimize the app.
  5. Monitor with docker events bash docker events --filter 'container=my-service' --since 1h
  6. Deep host-level diagnosis - Check Docker daemon logs: journalctl -u docker - Check system resources: free -m, top, df -h - If using crictl (CRI-compatible runtime): crictl ps, crictl logs <container-id>

Key Commands

  • docker container restart my-service: Restart container (temporary recovery).
  • docker container stop my-service && docker container start my-service: Step-by-step restart.
  • docker run --rm -it my-image bash: Interactive debug container.
  • docker stats --no-stream: Real-time container resource usage.
  • docker system df: Check Docker disk usage.

Risk Controls

  • Never delete containers or images without backup.
  • Confirm no downstream dependencies (e.g., database connections) before stopping a container.
  • Use --restart=on-failure to avoid infinite restart loops.
  • Adjust resource limits gradually and monitor performance.
  • Notify the team before running diagnostic commands to avoid interference.

Rollback Procedure

  1. Stop the faulty container: docker stop my-service
  2. Pull the previous stable image: docker pull my-registry/my-service:v1.0.0
  3. Start the old version: docker run -d --name my-service-old --restart=always my-registry/my-service:v1.0.0
  4. Verify service restoration (see below).
  5. Update orchestration (e.g., docker-compose or Kubernetes) to point to old version.
  6. Keep faulty container and logs for post-mortem: docker logs my-service > /tmp/crash.log

Verification

  • Check container running status: docker ps -f status=running
  • Verify processes inside container: docker exec my-service ps aux
  • Send test requests to confirm application responses.
  • Monitor logs for new errors: docker logs --tail 50 my-service
  • Check resource usage metrics are stable.

When to Submit an OpsGlobal Ticket

  • Containers repeatedly exit without obvious cause after basic diagnosis.
  • Need daemon-level or node-level debugging (e.g., kernel issues, storage driver failures).
  • Similar symptoms appear on multiple nodes in a large cluster.
  • Problem persists after rollback, or image rebuild and security patches are needed.
  • Need to tune Docker runtime parameters (e.g., ulimit, cgroups).
  • Involves Docker version compatibility or upgrade planning.

When submitting, provide: container ID, log snippets, docker inspect output, host system info, and timeline.

Use cases

Useful for teams handling DevOps issues and needing a clear troubleshooting and delivery workflow.

Problem background

Master the art of troubleshooting Docker container runtime issues with systematic diagnosis, commands, safety measures, and rollback procedures. Learn when to escalate to OpsGlobal's remote SRE support.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

!

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Ticket Contact on WhatsApp Consult