Scenario
A company migrated its e-commerce platform to a cloud-native Kubernetes environment. Initially, node and Pod counts were manually configured, leading to performance issues during peak times and resource waste during off-peak hours. Monthly cloud bills increased by over 30%.
Symptoms
- Cloud bills continuously rise, yet average CPU/memory utilization is below 30%.
- User response times exceed 5 seconds during peak hours, with occasional timeouts.
- Fixed number of nodes, Pod requests frequently hit rate limits.
Diagnosis
- Use
kubectl top podsandkubectl top nodesto view resource usage. - Analyze cluster autoscaler logs:
kubectl logs -n kube-system cluster-autoscaler. - Check HorizontalPodAutoscaler configuration:
kubectl describe hpa <hpa-name>.
Commands
Configure HorizontalPodAutoscaler (HPA)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Apply: kubectl apply -f hpa.yaml.
Configure Cluster Autoscaler (assuming AWS)
Ensure node groups have min/max sizes set. Annotate the auto-scaling group:
kubectl annotate nodegroup <nodegroup-name> cluster-autoscaler.kubernetes.io/min-size=2
kubectl annotate nodegroup <nodegroup-name> cluster-autoscaler.kubernetes.io/max-size=20
Use VerticalPodAutoscaler to optimize resource requests
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/vertical-pod-autoscaler/deploy/vpa-v1-crd.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/vertical-pod-autoscaler/deploy/recommender-deployment.yaml
Then create a VPA:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-app-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: web-app
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 1
memory: 512Mi
Risk Controls
- Use PodDisruptionBudget to ensure availability:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: web-app
- Set HPA cool-down periods to avoid thrashing: add
behaviorfield in HPA. - Use Spot instances to reduce compute cost, but implement proper interruption handling (e.g., priority scheduling).
Rollback
- Delete or modify HPA/VPA config:
kubectl delete hpa web-app-hpa. - Revert node group to original sizes.
- Monitor metrics until stable.
Verification
- Check Pod count changes with load:
kubectl get pods -w. - Compare bills using cloud cost tools (e.g., AWS Cost Explorer).
- Performance metrics: response time <1s, error rate <0.1%.
When to Submit an OpsGlobal Ticket
- Autoscaling fails to meet SLAs (e.g., still timeout during peaks).
- Node group fails to scale down (residual Spot interruption issues).
- Need complex cost allocation or budgeting strategies.
Use cases
Useful for teams handling Cloud Migration issues and needing a clear troubleshooting and delivery workflow.
Problem background
A practical guide to implementing intelligent autoscaling strategies that optimize both performance and cost, with real-world commands and risk controls.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.