Scenario
A fast-growing SaaS company deploys dozens of times daily. The team notices that despite the CI pipeline always passing green, production incidents spike after each deployment. Alerts pour in within minutes: users experience timeouts and errors. Rollbacks are slow and manual, causing extended downtime.
Symptoms
- P1 alerts right after deployment.
- Error budget burns rapidly.
- Rollbacks take 30+ minutes.
- Dev and SRE teams are in constant blame game.
Diagnosis
The root cause is missing CI/CD guardrails: no automated test gating, canary analysis, deployment window control, or health-check-based auto-rollback. The CI pipeline only ran unit tests — no integration tests, load tests, or security scans. The CD pipeline deployed directly to full production without gradual rollout.
Commands & Configuration
Below is an example guardrail setup using GitHub Actions and ArgoCD:
# .github/workflows/deploy.yaml
name: Deploy with Guardrails
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run unit tests
run: make test
- name: Run integration tests
run: make integration-test
- name: Security scan
run: trivy image --severity HIGH,CRITICAL --exit-code 1 myapp:${{ github.sha }}
deploy-canary:
needs: test
runs-on: ubuntu-latest
steps:
- name: Deploy canary to 10%
run: kubectl set image deployment/myapp-canary myapp=myapp:${{ github.sha }} -n production
- name: Wait for health check
run: sleep 60 && kubectl rollout status deployment/myapp-canary -n production --timeout=5m
promote:
needs: deploy-canary
runs-on: ubuntu-latest
steps:
- name: Promote to full
run: kubectl set image deployment/myapp myapp=myapp:${{ github.sha }} -n production
- name: Verify deployment
run: kubectl rollout status deployment/myapp -n production --timeout=10m
Risk Controls
- Feature Flags: Control exposure with LaunchDarkly or ConfigMap flags.
- Gradual Rollout: Use ArgoCD's auto-rollback on health check failure.
- Deployment Windows: Check time against approved window in CI; reject if outside.
- Error Budget Gating: Block deployment if error budget consumption exceeds 70%.
Rollback
# Rollback with kubectl
kubectl rollout undo deployment/myapp -n production
# Rollback with Git revert
revert HEAD
# Rollback with ArgoCD
argocd app rollback myapp --prune
Verification
- Dashboard tracks: deployment frequency, failure rate, error budget, rollback count.
- Synthetic checks: execute critical transaction user paths every 5 minutes.
- Alert rules: if 5xx errors increase >1% within 10 min of deploy, trigger rollback.
When to Submit an OpsGlobal Ticket
- When you need to design a complete guardrail pipeline but lack internal expertise.
- When existing guardrails fail and cause repeated incidents.
- When you need complex progressive delivery strategies coordinated across multiple Kubernetes clusters.
Use cases
Useful for teams handling CI/CD issues and needing a clear troubleshooting and delivery workflow.
Problem background
Learn how to implement automated guardrails to prevent bad deployments, reduce downtime, and maintain SLOs in Kubernetes environments.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.