Scenario
A team deploys multiple times daily but faces frequent production outages. After a deployment, monitoring shows a spike in error rates and health checks fail. The team scrambles to roll back, but the lack of automation leads to extended downtime.
Symptoms
- Error rate increases >20% within 5 minutes after deployment
- Kubernetes Pod health checks fail
- Users report feature unavailability
- Rollback takes >10 minutes manually
Diagnosis
Root cause: missing CI/CD guardrails. No automated pre-deployment tests, insufficient canary analysis, and immature rollback procedures. Specific issues: 1. Pipeline lacks integration tests or security scans 2. Direct deployment to production without canary 3. No automated rollback triggers 4. Health check thresholds misconfigured
Commands
Implement guardrails in GitHub Actions:
# Pre-deployment tests
- name: Run integration tests
run: make test-integration
# Canary deployment
- name: Deploy canary
run: |
kubectl set image deployment/app-canary app=myapp:${GITHUB_SHA} --record
kubectl rollout status deployment/app-canary --timeout=5m
# Health check
- name: Verify canary health
run: |
kubectl run curl --image=radial/busyboxplus:curl -i --rm --restart=Never -- curl -f http://canary-service/health
# Auto-rollback on failure
- name: Rollback if failed
if: failure()
run: |
kubectl rollout undo deployment/app-canary
echo "Canary deployment failed, rolled back." && exit 1
Risk Controls
- Canary releases: Route 10% traffic to new version, observe for 5 minutes
- Feature flags: Use LaunchDarkly or ConfigMap to disable problematic features dynamically
- Progressive delivery: Use Flagger or Argo Rollouts to gradually increase traffic automatically
- Resource limits: Set CPU/memory quotas to prevent resource exhaustion
Rollback
Automatic Rollback
When canary health check fails, auto-rollback:
kubectl rollout undo deployment/app-canary
kubectl scale deployment/app-stable --replicas=10
Manual Rollback
If automatic rollback fails, SRE can execute:
kubectl rollout undo deployment/app --to-revision=<previous-revision>
kubectl rollout status deployment/app
Verification
- Monitor error rate (e.g., Prometheus + Alertmanager)
- Check pod status:
kubectl get pods -l app=app-stable - Run synthetic transactions:
curl -f https://api.example.com/v1/health - Verify SLO: error rate <1% within 5 minutes post-deployment
When to Submit an OpsGlobal Ticket
- Complex rollbacks across multiple clusters or regions
- Guardrails themselves fail (e.g., false positives or missed detections)
- Urgent hotfix needed but pipeline blocked
- Custom metrics or policies require expert tuning
Use cases
Useful for teams handling CI/CD issues and needing a clear troubleshooting and delivery workflow.
Problem background
Learn how to enforce safety gates in your CI/CD pipeline to prevent bad deployments, with practical commands and rollback strategies.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.