The client came to us after their AWS bill crossed $700k for the year - up from $280k two years earlier, with roughly the same amount of actual user traffic. The engineering team knew Kubernetes was involved. They didn't know exactly where. Eight weeks later, their annualised run rate was under $290k. This is what we found and what we changed.
Where the waste actually lives
The first thing we do with any Kubernetes cost engagement is pull two weeks of actual CPU and memory utilisation data and lay it next to the resource requests configured on every workload. The gap is almost always shocking to the team that's been running the cluster.
In this case, average CPU utilisation across the cluster was 17% of requested. Memory was a bit better at 34%. The cluster was running on a mix of m5.2xlarge and m5.4xlarge nodes - not because the workloads needed that much compute, but because the requests had been set conservatively during initial deployment and nobody had revisited them.
That's the pattern we see repeatedly. Engineers set requests high to avoid OOMKills during development, the cluster autoscaler provisions nodes to satisfy those requests, and the actual workload uses a fraction of what's been reserved. The waste isn't visible because the cluster looks "healthy" - pods are running, no alerts are firing, nothing is obviously broken.
Before changing anything, instrument your cluster with Prometheus + kube-state-metrics and run at least two weeks of data collection across production load patterns. Optimising without this data is guesswork. The Kubernetes Dashboard utilisation numbers are not sufficient - you need percentile data (p50, p95, p99), not averages.
Requests and limits: the root cause of most Kubernetes waste
If you understand one thing about Kubernetes cost, it should be this: the cluster autoscaler makes node provisioning decisions based on resource requests, not actual utilisation. A pod requesting 4 CPU cores that actually uses 0.3 cores will still cause the autoscaler to provision a node with 4 cores of available capacity. You're paying for the request.
This creates a specific failure mode that's extremely common in teams that inherited a cluster rather than built it from scratch. Requests get set during initial deployment based on rough estimates or AWS documentation recommendations, they never get revisited because nothing is visibly broken, and the bill grows with every new service that follows the same pattern.
The distinction between requests and limits matters here too:
- Requests are what the scheduler uses to decide where to place a pod and what the autoscaler uses to decide when to add a node. Setting requests too high means you pay for capacity you don't use.
- Limits are the ceiling at which the kernel will throttle CPU or OOMKill a container. Setting limits too low causes application instability. Setting them too high wastes nothing - you only pay for requests, not limits.
In most rightsizing exercises, the correct approach is to set requests close to actual p95 utilisation and set limits 2–3x higher. This gives your workloads room to burst without causing OOMKills, while significantly reducing the reserved capacity that drives your node costs.
Setting CPU limits equal to CPU requests. This is the default posture many teams fall into, and it causes CPU throttling even when node capacity is available. CPU throttling shows up as increased latency at the application layer, which teams often diagnose incorrectly as a code performance problem. Check your CPU throttling metrics before assuming your application is slow.
Rightsizing in practice: what we actually changed
We used Goldilocks (from Fairwinds) to generate VPA recommendations across all workloads, then manually reviewed each recommendation before applying anything. Automated rightsizing without human review is how you accidentally throttle a payment processing service at 3am.
# Install the Vertical Pod Autoscaler (required by Goldilocks) git clone https://github.com/kubernetes/autoscaler.git ./autoscaler/vertical-pod-autoscaler/hack/vpa-up.sh # Install Goldilocks via Helm helm repo add fairwinds-stable https://charts.fairwinds.com/stable helm install goldilocks fairwinds-stable/goldilocks \ --namespace goldilocks \ --create-namespace # Label the namespaces you want recommendations for kubectl label ns production goldilocks.fairwinds.com/enabled=true kubectl label ns staging goldilocks.fairwinds.com/enabled=true # Access the dashboard kubectl -n goldilocks port-forward svc/goldilocks-dashboard 8080:80
The Goldilocks dashboard gives you a per-deployment view of current requests vs VPA recommendations, broken down by container. For each service, we applied the following decision rule:
- If the recommended request was more than 30% lower than current, we staged the change in a canary deployment first and watched for latency regressions over 48 hours before rolling to production.
- For stateful workloads (databases, message queues, caches), we were more conservative - VPA recommendations were used as a floor, not a ceiling.
- For batch workloads and internal tooling with no external SLAs, we applied recommendations directly.
The single largest source of waste in this cluster wasn't any individual service - it was the cultural norm that requests are set once during deployment and never reviewed. Rightsizing is a process, not a project.
— Sequere, ADMIN, SequereNode group strategy: the wrong instance types cost more than you think
The client's cluster was running a flat mix of m5.2xlarge (8 vCPU / 32GB) and m5.4xlarge (16 vCPU / 64GB) on-demand instances across all workloads. This is a common starting configuration - general-purpose instances, reasonable size, easy to reason about. It's also rarely optimal.
The problem: different workloads have very different resource profiles. A high-throughput API server is CPU-bound with low memory needs. A machine learning inference service is memory-bound with moderate CPU. A background data processing job needs bursts of CPU with minimal base cost in between. Running all of them on the same instance type means you're either overpaying on one dimension or starving the other.
We restructured the cluster into three node groups:
- General purpose (m5 family): For services with balanced CPU/memory ratios. Kept as the default node group, but downsized to
m5.xlarge(4 vCPU / 16GB) after rightsizing reduced the per-pod footprint. - Compute-optimised (c5 family): For API servers and compute-heavy microservices.
c5.xlargecosts roughly 15% less than an equivalentm5and delivers better CPU performance for the workloads that needed it. - Memory-optimised (r5 family): For caching layers and a data aggregation service with large working sets. Two
r5.largenodes replaced fourm5.xlargenodes that were CPU-light but memory-constrained.
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-type operator: In values: - compute-optimised # route API servers here # Label nodes in the c5 group kubectl label node <node-name> node-type=compute-optimised
Spot instances without the incidents
Spot instances are the highest-leverage cost reduction available in a Kubernetes cluster. AWS Spot prices for common instance types run 60–70% below on-demand rates. The tradeoff is that Spot instances can be reclaimed by AWS with a 2-minute warning when capacity is needed.
Most teams either avoid Spot entirely (leaving significant savings on the table) or run stateful workloads on Spot and get burned. The right approach is more granular than "Spot or not."
We categorised every workload in the cluster:
- Spot-safe: Stateless services with multiple replicas, batch jobs, CI runners, and background workers. For a pod with 3+ replicas, a single Spot interruption causes a brief replica reduction - not an outage. These moved to 100% Spot.
- Mixed (Spot + On-Demand): Services where availability matters but the blast radius of a Spot interruption is acceptable if enough replicas survive. We used a 70/30 Spot/On-Demand split here, with pod topology spread constraints to ensure replicas weren't co-located on the same node.
- On-Demand only: Databases (RDS is external anyway), anything stateful, and single-replica services that couldn't be quickly horizontally scaled. This was a smaller list than the team expected - roughly 20% of the workload by CPU request.
apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: general-mixed spec: template: spec: requirements: - key: karpenter.sh/capacity-type operator: In values: ["spot", "on-demand"] - key: node.kubernetes.io/instance-type operator: In values: ["m5.xlarge", "m5.2xlarge", "m4.xlarge"] nodeClassRef: name: default limits: cpu: "100" disruption: consolidationPolicy: WhenUnderutilized consolidateAfter: 30s
We also moved from the Cluster Autoscaler to Karpenter during this engagement. Karpenter provisions nodes in response to pending pods in around 30–60 seconds (vs 3–5 minutes with CA), and its consolidation logic is significantly better at bin-packing workloads onto fewer nodes during low-traffic periods. The switch itself took about two days and paid for the effort within the first billing cycle.
Autoscaling that actually works
The cluster had HPA configured on most services, but the HPA metrics were pointing at CPU utilisation - which, after our rightsizing pass, was now a reasonably accurate signal. Before rightsizing, CPU utilisation as a percentage of request was artificially low because requests were set so high. This meant the HPA was scaling out too late, adding replicas only when pods were genuinely under heavy load, rather than proactively.
For the two highest-traffic services, we supplemented CPU-based HPA with custom metrics (requests-per-second via KEDA pulling from a Prometheus metric), which gave a more leading indicator for scale-out decisions. This had nothing to do with cost directly, but it allowed us to reduce the standing replica count during off-peak hours more aggressively, because we had confidence the scale-out would happen fast enough when traffic increased.
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: api-server-scaler spec: scaleTargetRef: name: api-server minReplicaCount: 2 maxReplicaCount: 20 triggers: - type: prometheus metadata: serverAddress: http://prometheus:9090 metricName: http_requests_per_second query: sum(rate(http_requests_total[2m])) threshold: "150" # scale out when RPS exceeds 150 per replica
The $420k breakdown
Here's where the savings actually came from, in order of impact:
| Change | Annual saving | Complexity | Risk |
|---|---|---|---|
| Rightsizing CPU/memory requests | ~$158,000 | Medium | Medium - needs validation |
| Spot instances for stateless workloads | ~$112,000 | Medium | Low–medium with multi-AZ spread |
| Instance type optimisation by workload | ~$74,000 | Medium | Low |
| Off-peak autoscaling (overnight + weekends) | ~$51,000 | Low | Low |
| Removing orphaned resources (PVs, LBs, snapshots) | ~$25,000 | Low | Low |
Total annualised saving: $420,000. The cluster went from $700k/year to approximately $280k/year. Application performance improved (lower latency due to less CPU throttling). Two incidents during the migration, both caught in staging, neither reaching production.
What not to cut
A few places where we deliberately left money on the table because the risk wasn't worth it:
- Observability infrastructure. Prometheus, Alertmanager, and the logging pipeline stayed fully provisioned on on-demand instances. Losing visibility during the optimisation work would have been expensive in a different way.
- PodDisruptionBudgets on core services. We set minimum available replicas conservatively. Saving $3k/year by running a payment service at a single replica is not a trade anyone should make.
- Cluster control plane costs. EKS control plane is $0.10/hour per cluster - about $876/year. There were suggestions to consolidate development clusters to save this. We kept them separate. The blast radius of a misconfigured policy hitting production because it was tested in the same cluster isn't worth the saving.
Where to start if you're looking at your own bill
Not every cluster has $420k of waste in it. But most clusters with more than 20 workloads have been running long enough that original resource requests are substantially out of date. The two fastest places to look:
- Run Goldilocks or Kubecost against your largest namespaces and look at the ratio of requested to actual for your top 10 workloads by CPU request. If that ratio is below 40%, you have rightsizing work to do.
- Check your Spot adoption rate. If it's zero and you have stateless workloads, you're leaving 60%+ savings on the table for those services. Start with a single low-risk deployment as a proof of concept.
If you're running EKS and want a second set of eyes on your cluster configuration before making changes, we run free 60-minute cost review sessions for engineering teams - no sales process, just an engineer looking at your numbers with you.