Scenario
A DevOps team manages a microservice architecture with 50+ daily deployments. Frequent P1 incidents occur when code bypasses quality gates and is deployed directly to production. Manual code review is too slow, and rollbacks are unautomated, leading to high MTTR.
Symptoms
- Immediate alerts after deployment with error rate spikes.
- Pipeline stages skipped (e.g., tests bypassed).
- Rollback operations take >30 minutes manually.
- Developers SSH into production to apply hotfixes, skipping CI/CD.
Diagnosis
- No automated quality gates: Pipeline lacks unit tests, security scans, etc.
- No canary deployment: All traffic shifts at once, risking full outage.
- Unstandardized rollback: Manual steps, no verification.
- Weak access controls: Developers can bypass pipeline and modify production directly.
Commands & Implementation
1. Add Quality Gates in GitLab CI
stages:
- test
- build
- deploy-canary
- deploy-production
unit-test:
stage: test
script:
- npm test
only:
- branches
security-scan:
stage: test
script:
- snyk test --all-projects
deploy-canary:
stage: deploy-canary
script:
- kubectl apply -f k8s/canary-deployment.yaml
environment:
name: production/canary
only:
- main
2. Implement Canary Releases with Flagger
# Install Flagger
kubectl apply -f https://raw.githubusercontent.com/fluxcd/flagger/main/artifacts/flagger/canary.yaml
# Define Canary resource
cat <<EOF | kubectl apply -f -
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
service:
name: myapp
port: 80
analysis:
interval: 30s
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
EOF
3. Automated Rollback Script
# rollback.sh
#!/bin/bash
NAMESPACE=$1
DEPLOYMENT=$2
REVISION=${3:-previous}
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE --to-revision=$REVISION
4. Access Control with RBAC
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
rules:
- apiGroups: ["apps", "extensions"]
resources: ["deployments", "deployments/rollback"]
verbs: [] # Deny direct changes
Risk Controls
- Automated Testing: Enforce 80% unit test coverage, 100% integration pass rate.
- Canary Analysis: Abort release based on error rate, latency thresholds.
- Policy as Code: Use OPA to mandate all changes go through CI/CD.
- Change Approval: Require Ops manager approval for privileged actions.
Rollback Strategy
- Automatic Rollback: Flagger auto-rollbacks when canary metrics exceed thresholds.
- Manual Rollback: Use predefined script to rollback to a specific revision.
- Database Compatibility: Ensure migration scripts are reversible; apply data patches if needed.
Verification
- Monitoring Dashboard: Create release tracking dashboard in Grafana showing error rates, latency, traffic distribution.
- SLO Alignment: Define release policies per service SLO; block releases if SLO is violated.
- Chaos Engineering: Regularly test rollback mechanisms and failure recovery via chaos experiments.
When to Submit an OpsGlobal Ticket
- P0 incidents (complete service outage) with failed automated rollback.
- CI/CD guardrails bypassed and emergency security hardening needed.
- Expert assistance required for complex canary strategies or OPA rules.
- Compliance audit requiring third-party validation of release processes.
Use cases
Useful for teams handling CI/CD issues and needing a clear troubleshooting and delivery workflow.
Problem background
Deep dive into CI/CD guardrails covering scenario, symptoms, diagnosis, commands, risk controls, rollback, verification, and when to raise an OpsGlobal ticket.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.