Book Consultation Submit Ticket

Cloud Capacity Autoscaling and Cost Operations: A Practical SRE Guide

This article provides a hands-on guide to optimizing cloud capacity and costs using Kubernetes autoscaling during migration, covering scenario analysis, symptom diagnosis, commands, risk controls, rollback, verification, and when to submit an OpsGlobal ticket.

Cloud Capacity Autoscaling and Cost Operations: A Practical SRE Guide
Cloud Migration 6min 17 views 2026-06-18
KubernetesSREcloud migrationautoscalingcost optimization

Scenario

A company is migrating workloads to Kubernetes clusters. Initially, they provisioned many fixed nodes, resulting in low resource utilization (<30%) and high cloud bills. The goal is to achieve autoscaling that matches actual demand while controlling costs.

Symptoms

  • Cluster node CPU/memory usage consistently below 40%.
  • Month-over-month cloud bill increase >20% without business growth.
  • Pods frequently pending due to resource constraints, yet nodes are idle.

Diagnosis

  1. Check Horizontal Pod Autoscaler (HPA) configuration: bash kubectl get hpa -A kubectl describe hpa <name> -n <namespace>
  2. Check Cluster Autoscaler logs: bash kubectl logs -n kube-system deployment/cluster-autoscaler
  3. Analyze node utilization: bash kubectl top nodes kubectl describe node <node-name>
  4. Use cloud cost tools (e.g., AWS Cost Explorer) to view spending by resource type.

Commands

Set up HPA

kubectl autoscale deployment <deployment-name> --cpu-percent=50 --min=2 --max=10 -n <namespace>

Enable Cluster Autoscaler (example for AWS EKS)

eksctl create iamserviceaccount --name cluster-autoscaler --namespace kube-system \
  --cluster <cluster-name> --attach-policy-arn <policy-arn> --approve
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
kubectl -n kube-system annotate deployment.apps/cluster-autoscaler cluster-autoscaler.kubernetes.io/scale-down-disabled=false

Use Spot instances (if cloud provider supports)

Create node group:

eksctl create nodegroup --cluster <cluster-name> --node-type t3.medium --nodes-min 2 --nodes-max 20 \
  --node-volume-size 20 --spot --asg-access --managed

Risk Controls

  • Set Pod Disruption Budgets (PDB) to prevent critical service interruption: yaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: app-pdb spec: minAvailable: 2 selector: matchLabels: app: myapp
  • Set minimum and maximum replica limits for HPA.
  • Use node selectors and taints/tolerations for specialized workloads.
  • Test non-production environments before enabling Cluster Autoscaler.
  • Set budget alerts: use cloud monitoring and PagerDuty to notify cost anomalies.

Rollback

If autoscaling causes performance degradation or cost increase: 1. Revert HPA: bash kubectl delete hpa <name> -n <namespace> 2. Restore node group to fixed size: bash eksctl scale nodegroup --cluster <cluster-name> --name <ng-name> --nodes 5 --nodes-min 5 --nodes-max 5 3. Remove Cluster Autoscaler: bash kubectl delete -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Verification

  • Monitor HPA status: bash kubectl get hpa -w
  • Observe node count changes: bash kubectl get nodes -w
  • View cost trends in cloud console (daily/weekly/monthly).
  • Use load testing tools (e.g., Locust) to simulate traffic and confirm scaling response.

When to Submit an OpsGlobal Ticket

  • Cluster autoscaler fails to scale nodes within expected time (>10 minutes).
  • HPA target metrics are unreachable or oscillate wildly.
  • Cost spikes despite normal scaling behavior.
  • Misconfiguration of node groups or autoscaler causing service unavailability.
  • Need expert review for overall architecture cost optimization.

Use cases

Useful for teams handling Cloud Migration issues and needing a clear troubleshooting and delivery workflow.

Problem background

This article provides a hands-on guide to optimizing cloud capacity and costs using Kubernetes autoscaling during migration, covering scenario analysis, symptom diagnosis, commands, risk controls, rollback, verification, and when to submit an OpsGlobal ticket.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

!

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Ticket Contact on WhatsApp Consult