Two weeks before go-live, a client's platform team ran a load test. The results looked fine. A week later, after deploying to production with real traffic, three of their five API pods were evicted within 90 minutes of a traffic spike. The Horizontal Pod Autoscaler had scaled up, but the nodes could not accommodate the new pods because resource requests were specified too generously, and the cluster autoscaler had not yet caught up.
It took four hours to stabilise. None of it was complicated to fix in hindsight. But it was entirely preventable if someone had asked the right questions during cluster setup rather than after the first incident.
This guide is the set of things we wish someone had told us before our first production Kubernetes deployment - and the things we now check on every cluster review we do for clients.
Resource Requests and Limits: Getting the Numbers Right
The most common production Kubernetes problem is not a bug in application code. It is resource misconfiguration - either requests set too low (causing scheduling failures and evictions under load) or limits set too high (causing node pressure and throttling other pods). Both show up in production in ways that are genuinely hard to diagnose if you do not know what you are looking for.
The rule of thumb we use: set CPU requests based on average consumption from profiling, set memory requests based on peak observed usage, set CPU limits to 2-4× the request, and set memory limits equal to the request. Memory limits that are too tight cause OOMKilled restarts that look like application bugs.
# Measured from staging: p95 CPU ~180m, peak memory ~420Mi resources: requests: cpu: "200m" # Slightly above p95 avg memory: "512Mi" # Peak + 20% headroom limits: cpu: "800m" # 4× request - handles burst memory: "512Mi" # Same as request - OOMKill is intentional # LimitRange to enforce defaults cluster-wide: apiVersion: v1 kind: LimitRange metadata: name: default-limits namespace: production spec: limits: - type: Container defaultRequest: cpu: "100m" memory: "128Mi" default: cpu: "500m" memory: "256Mi"
Setting memory limits equal to requests means the scheduler knows exactly how much memory a pod will actually need. If it exceeds that amount, it gets OOMKilled - which is a clean, observable failure mode that restarts the pod. The alternative is a pod that slowly consumes all available node memory, triggering evictions of other pods without any obvious signal that your pod is the cause.
Pod Disruption Budgets: Configure Them Before You Need Them
A Pod Disruption Budget tells Kubernetes how many pods of a given deployment can be unavailable at any one time during voluntary disruptions - node drains, cluster upgrades, or maintenance operations. Without one, a node drain can terminate all pods of a deployment simultaneously if they happen to land on the same node.
The reason teams get caught out is timing. PDBs need to exist before a disruption event. Configuring them mid-upgrade - after the drain has already started - does not retroactively protect pods that were already evicted. And cluster upgrade windows are rarely forgiving enough to pause and wait for configuration changes to propagate.
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: api-pdb namespace: production spec: minAvailable: 2 # Always keep at least 2 replicas selector: matchLabels: app: api-server # Alternatively, use maxUnavailable: spec: maxUnavailable: "25%" # Never take down more than 25% at once selector: matchLabels: app: worker
Single-replica deployments and PDBs do not mix. If you configure minAvailable: 1 on a single-replica deployment, node drains will block indefinitely because Kubernetes cannot satisfy the PDB - it cannot evict the only pod while maintaining at least one running. Either run 2+ replicas for anything that needs a PDB, or accept that single-replica deployments will be disrupted during maintenance.
HPA vs VPA: Why Running Both Is Asking for Trouble
The Horizontal Pod Autoscaler scales the number of pod replicas based on CPU/memory utilisation or custom metrics. The Vertical Pod Autoscaler adjusts the resource requests of individual pods. Both are useful. Running both on the same deployment at the same time - without explicit configuration to prevent conflicts - is not.
The specific failure mode: VPA recommends increasing memory requests for a pod. It evicts the pod to apply the new request. Meanwhile, HPA sees the pod count drop below its minimum and scales up a new replica. The new replica gets evicted by VPA again because the recommendation has not propagated yet. This loop produces oscillating replica counts, billing spikes on your node pool, and confusing metrics in your observability stack.
| Scenario | Recommended Approach | Reason |
|---|---|---|
| Stateless API with variable traffic | HPA only | Scale replicas horizontally - resource requests should be stable. |
| Batch workers with unpredictable memory needs | VPA only | Right-size individual pods - replica count is less important. |
| Both traffic spikes and memory variability | HPA + VPA with Off mode | Use VPA in Off mode for recommendations only, apply manually. |
| Any deployment without profiling data | Neither yet | Profile first. Autoscalers amplify misconfiguration. |
"Autoscalers are not a substitute for understanding your workload's resource profile. They are tools that make good configuration better and bad configuration worse."
- Internal review, post-incident, client cluster auditKarpenter vs Cluster Autoscaler on AWS EKS
Cluster Autoscaler has been the standard since before most teams were running Kubernetes in production. It works by adjusting Auto Scaling Group sizes and is well understood. Karpenter, which AWS released as a production-ready project in late 2022, takes a different approach - it provisions nodes directly using EC2 APIs, without ASG intermediaries, which makes it significantly faster at responding to unschedulable pods.
The practical difference: Cluster Autoscaler typically takes 2-3 minutes to provision a new node after a pod becomes unschedulable. Karpenter typically takes 30-60 seconds. For workloads with spiky traffic patterns, that gap matters. For steady-state workloads, it probably does not.
Karpenter is worth the migration cost if: (1) your workloads have short-duration traffic spikes where provisioning latency affects user experience, (2) you want fine-grained control over Spot instance selection across multiple instance families, or (3) you are starting a new cluster and can configure it properly from the beginning. If your existing cluster is stable with Cluster Autoscaler, the migration is probably not worth the disruption unless you have a specific performance problem to solve.
Namespace Isolation: Quotas and RBAC Before Access
The pattern that causes the most operational headaches in multi-team clusters is giving teams namespace access before ResourceQuotas and RBAC policies are in place. Once a team has deployed workloads without resource constraints, adding quotas retroactively causes immediate disruption - existing pods are not affected, but any new pod that exceeds the newly applied quota will fail to schedule.
The sequence that works: create the namespace, apply the ResourceQuota and LimitRange, set up RBAC with least-privilege roles, then grant access. Not the other way around.
# Apply this before granting any team access to a namespace apiVersion: v1 kind: ResourceQuota metadata: name: team-quota namespace: team-payments spec: hard: requests.cpu: "8" requests.memory: "16Gi" limits.cpu: "16" limits.memory: "32Gi" pods: "40" services: "10" --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: team-developer namespace: team-payments rules: - apiGroups: ["apps"] resources: ["deployments","replicasets"] verbs: ["get","list","watch","create","update","patch"] - apiGroups: [""] resources: ["pods","pods/logs","services","configmaps"] verbs: ["get","list","watch"] # Note: no delete on pods, no access to secrets
Observability: You Cannot Debug What You Cannot See
Every production Kubernetes cluster needs three things before the first workload goes live: structured logging aggregated to a central store, metrics with alerts on the signals that actually matter, and distributed tracing if you are running microservices. Not "nice to have once things stabilise." Before go-live.
The specific metrics that matter in production are not the ones that come pre-configured in most Grafana dashboards. They are:
- Pod eviction rate per namespace - rising eviction rate is usually a resource pressure problem, not an application problem.
- Pending pod count and duration - pods stuck in Pending state indicate scheduling pressure, missing PVCs, or quota exhaustion.
- Node memory pressure and disk pressure conditions - these precede evictions by several minutes and are actionable.
- OOMKilled container rate - if this is non-zero, memory limits are wrong somewhere.
- HPA current/desired replica delta - a persistent gap means autoscaling is constrained by something (node capacity, PDBs, or quota limits).
The Pre-Go-Live Checklist
This is the specific list we run through on every cluster before production traffic goes anywhere near it. It is not exhaustive, but it covers the failures we have seen most often.
1. Every deployment has explicit resource requests and limits
No deployment should rely on namespace LimitRange defaults for its resource configuration. Defaults exist for unconfigured deployments that slip through - not as a substitute for explicit configuration on the workloads you actually care about.
2. PDBs exist for every deployment with more than one replica
If a deployment is worth running with multiple replicas, it is worth protecting during disruptions. PDBs take five minutes to write. Node drain incidents without them take much longer to recover from.
3. Autoscaling has been tested with simulated load, not assumed
Run a load test that actually triggers HPA scaling events before go-live. Verify that the cluster autoscaler (or Karpenter) responds within the time window your SLA allows. Discovering that node provisioning takes longer than expected during a real traffic spike is not a good way to spend an evening.
4. The runbook for a node failure exists and has been read by the on-call team
Kubernetes will handle many failure scenarios automatically. Some it will not. The on-call team should know the difference before midnight on a Saturday, not during it.
None of this is glamorous infrastructure work. It does not show up in talks about distributed systems architecture or the latest autoscaling algorithms. It is the unglamorous configuration layer that determines whether your Kubernetes cluster is reliable or whether it is a source of 2am alerts.
If you are working through cluster configuration before a production launch, or reviewing an existing cluster that has been causing operational problems, the DevOps team at Sequere does cluster audits as a standalone engagement - typically a one-week engagement that produces a prioritised remediation list. Get in touch if that would be useful.