There is a version of the GitHub Actions Kubernetes tutorial that is everywhere. It uses azure/k8s-deploy or a raw kubectl set image, waits for rollout status, and declares success. If you have shipped an application that way, you have shipped it with a gap between "rollout complete" and "application actually working." Rollout status tells you that pods reached the Running state. It does not tell you that the new container can accept traffic, authenticate correctly, connect to its database, or respond to your healthcheck endpoint inside the timeout window.
We learned this the uncomfortable way. A microservice that initialised an external cache connection on first request - rather than at startup - would consistently pass a rollout status check and fail silently for the first 20–30 seconds of live traffic as the connection pool warmed up. The pod was running. The readiness probe was passing. Users were getting errors. The deployment workflow had already marked itself green.
This guide documents the pattern we settled on after that incident and have used since. It is not the simplest version of a GitHub Actions Kubernetes pipeline. It is the version that catches the failures the simple version misses.
Workflow structure and what each stage is actually checking
The full workflow has five jobs that run in sequence with one conditional branch. Understanding what each job is checking - not just what it does - is the difference between a pipeline that gives you confidence and one that just produces green ticks.
| Job | What it checks | Failure means |
|---|---|---|
| test-matrix | Unit + integration tests across Node versions | Code is broken - stop here, do not build |
| build-push | Image builds cleanly, pushes to GHCR | Build failure or registry authentication problem |
| scan | Trivy CVE scan against pushed image | Known HIGH/CRITICAL vulnerabilities in image |
| canary-deploy | 10% canary receives traffic, smoke test passes | Application fails under real request conditions |
| full-rollout | 100% traffic promoted to new image | Promotion failed - canary already rolled back |
The scan job runs in parallel with the deployment preparation, not after it - this matters for total pipeline duration. By the time the canary is ready to receive traffic, the CVE report has already been evaluated and either blocked or cleared. You do not pay an extra Trivy wait on the critical path.
Parallel test matrix with dependency caching
The first job runs your test suite across a matrix of Node versions - or whatever runtime matrix is relevant to your stack. The reason to run a matrix rather than a single version in CI is not pedantry about compatibility. It is that matrix jobs in GitHub Actions run in parallel, and parallelism on the test job is the single highest-value optimisation available on a typical team's pipeline. Three Node versions running simultaneously takes roughly the same wall-clock time as one version.
name: deploy on: push: branches: [main] workflow_dispatch: # Allows: gh workflow run deploy.yml --ref main env: REGISTRY: ghcr.io IMAGE_NAME: ${{ github.repository }} jobs: test-matrix: runs-on: ubuntu-latest strategy: matrix: node-version: [18, 20, 22] fail-fast: true # Cancel siblings on first failure steps: - uses: actions/checkout@v4 - name: Setup Node ${{ matrix.node-version }} uses: actions/setup-node@v4 with: node-version: ${{ matrix.node-version }} cache: npm # Restores node_modules from cache key - run: npm ci --prefer-offline - run: npm test -- --ci --coverage - name: Upload coverage uses: actions/upload-artifact@v4 with: name: coverage-node${{ matrix.node-version }} path: coverage/
With fail-fast: true, if Node 18 fails its tests, GitHub Actions cancels the Node 20 and Node 22 jobs immediately rather than letting them run to completion. This cuts wasted runner minutes on a broken build. The trade-off is that you only see failures from the first job to fail, not all three. For most teams, that is the right trade-off - fix the first failure, re-run.
Image build, GHCR push, and Trivy scan
The build job uses Docker Buildx with layer caching through the GitHub Actions cache backend. Without explicit cache configuration, every build on GitHub-hosted runners starts cold - all layers pulled from scratch. With the cache-from and cache-to configuration below, unchanged layers are restored from the cache and only changed layers are rebuilt. On a medium-complexity Node application, this typically cuts build time from 3–4 minutes to 40–90 seconds.
build-push: needs: test-matrix runs-on: ubuntu-latest outputs: image-digest: ${{ steps.push.outputs.digest }} image-tag: ${{ steps.meta.outputs.tags }} steps: - uses: actions/checkout@v4 - name: Log in to GHCR uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - name: Docker metadata id: meta uses: docker/metadata-action@v5 with: images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }} tags: | type=sha,prefix=,format=short - name: Set up Buildx uses: docker/setup-buildx-action@v3 - name: Build and push id: push uses: docker/build-push-action@v6 with: push: true tags: ${{ steps.meta.outputs.tags }} cache-from: type=gha cache-to: type=gha,mode=max provenance: false # Avoid multi-platform manifest on single-arch build scan: needs: build-push runs-on: ubuntu-latest steps: - name: Trivy vulnerability scan uses: aquasecurity/trivy-action@master with: image-ref: ${{ needs.build-push.outputs.image-tag }} format: table exit-code: '1' # Fail the job on any finding severity: HIGH,CRITICAL ignore-unfixed: true # Skip CVEs with no available patch
ignore-unfixed: true is intentional, not permissive. CVEs with no available patch in the base image or dependency tree are noise in a deployment gate. They block every pipeline run without giving engineers anything actionable to fix. Unfixed CVEs belong in a separate tracking process - a weekly report, a Jira board, a Dependabot alert - not in the path between a code change and a deployment.
Canary deployment without a service mesh
A canary deployment splits traffic between the current stable version and the new version, so that a small percentage of real requests hit the new code before you commit to a full rollout. Most guides implement this with a service mesh like Istio or Linkerd, which gives you precise traffic weight control. What they do not tell you is that you can implement a functional canary in plain Kubernetes using two separate Deployments and a single Service - no mesh required.
The mechanism is the Kubernetes Service selector. A Service routes traffic to all pods matching its selector, proportionally by replica count. If your stable Deployment has nine replicas and your canary Deployment has one replica, the Service sends roughly 10% of traffic to the canary pod. This is not weighted routing in the Istio sense - it is simple probability based on available pod count - but it is sufficient for a health check gate without adding a service mesh dependency to your cluster.
# This manifest is only applied during the canary window. # It is deleted after promotion or rollback. apiVersion: apps/v1 kind: Deployment metadata: name: api-canary namespace: production labels: app: api track: canary spec: replicas: 1 # 1 canary : 9 stable = ~10% traffic selector: matchLabels: app: api track: canary template: metadata: labels: app: api # Must match the Service selector track: canary spec: containers: - name: api image: ghcr.io/OWNER/REPO:SHA # Replaced by workflow readinessProbe: httpGet: path: /healthz port: 3000 initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 3 resources: requests: { cpu: "200m", memory: "256Mi" } limits: { cpu: "800m", memory: "256Mi" }
Your existing production Service should select on app: api only - not on track. Both the stable Deployment (labelled track: stable) and the canary Deployment (labelled track: canary) match the Service selector, so both receive traffic in proportion to their replica count. This is the entire mechanism - no additional configuration required.
Automated smoke test and rollback trigger
The smoke test job is the part most pipelines skip and the part that catches the most production failures. A smoke test in this context is not a full integration test suite - it is a small number of HTTP requests against the canary pod's endpoint, with assertions on response codes and response time, run from inside the cluster after the canary pod reaches the Ready state.
Running the smoke test from inside the cluster rather than through an external URL is important. It tests the actual path traffic will take to the pod - through the Service, through kube-proxy - not a path through a load balancer or CDN that may have cached behaviour. A pod can be Ready and reachable internally while still not being routable externally due to an ingress misconfiguration. The smoke test catches internal routing problems; your existing synthetic monitoring catches external ones.
canary-deploy: needs: [build-push, scan] runs-on: ubuntu-latest env: KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_PROD }} steps: - uses: actions/checkout@v4 - name: Write kubeconfig run: | mkdir -p ~/.kube echo "$KUBECONFIG_DATA" | base64 -d > ~/.kube/config chmod 600 ~/.kube/config - name: Apply canary manifest run: | IMAGE_TAG="${{ needs.build-push.outputs.image-tag }}" sed "s|ghcr.io/OWNER/REPO:SHA|${IMAGE_TAG}|g" \ k8s/canary-deployment.yaml | kubectl apply -f - - name: Wait for canary pod readiness run: | kubectl rollout status deployment/api-canary \ -n production --timeout=120s - name: Smoke test via in-cluster curl id: smoke run: | # Run curl from a temp pod inside the cluster kubectl run smoke-$GITHUB_RUN_ID \ --image=curlimages/curl:8.7.1 \ --restart=Never \ --rm \ --attach \ -n production \ -- sh -c ' STATUS=$(curl -s -o /dev/null -w "%{http_code}" \ http://api-service.production.svc.cluster.local:3000/healthz) echo "Healthcheck status: $STATUS" [ "$STATUS" = "200" ] || exit 1 LATENCY=$(curl -s -o /dev/null \ -w "%{time_total}" \ http://api-service.production.svc.cluster.local:3000/healthz) echo "Latency: ${LATENCY}s" awk "BEGIN{exit ($LATENCY > 1.5)}" || exit 1 ' - name: Rollback and notify on smoke failure if: failure() && steps.smoke.outcome == 'failure' run: | echo "Smoke test failed - deleting canary deployment" kubectl delete deployment api-canary -n production --ignore-not-found # Slack notification via webhook curl -s -X POST "${{ secrets.SLACK_WEBHOOK }}" \ -H 'Content-type: application/json' \ -d "{ \"text\": \"⚠️ Canary deploy FAILED for \`${{ github.repository }}\`\", \"blocks\": [{ \"type\": \"section\", \"text\": { \"type\": \"mrkdwn\", \"text\": \"*Canary smoke test failed*\nCommit: \`${{ github.sha }}\`\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}\" } }] }" exit 1 # Fail the job so full-rollout is blocked full-rollout: needs: canary-deploy runs-on: ubuntu-latest env: KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_PROD }} steps: - uses: actions/checkout@v4 - name: Write kubeconfig run: | mkdir -p ~/.kube echo "$KUBECONFIG_DATA" | base64 -d > ~/.kube/config chmod 600 ~/.kube/config - name: Update stable deployment image run: | IMAGE_TAG="${{ needs.build-push.outputs.image-tag }}" kubectl set image deployment/api \ api="${IMAGE_TAG}" \ -n production - name: Wait for stable rollout run: | kubectl rollout status deployment/api \ -n production --timeout=300s - name: Clean up canary deployment run: | kubectl delete deployment api-canary \ -n production --ignore-not-found
"Rollout status tells you pods are running. It does not tell you the application can handle a request. Those are different things. The smoke test checks the second one."
- Internal postmortem, cache warm-up incident, Q3 2025RBAC and secrets - what the tutorials skip
Almost every GitHub Actions Kubernetes tutorial either uses a cluster-admin kubeconfig or quietly generates one with a tool like doctl or the AWS CLI without discussing what permissions are actually included. Cluster-admin access for a CI runner means that a compromised workflow or a supply chain attack on an action you use can do anything to your cluster. Restricting the runner to the minimum permissions required is not optional hardening - it is basic hygiene.
The workflow above requires exactly these permissions in the production namespace: get/list/watch/create/update/patch on Deployments, get/list on Pods, create/delete on Pods (for the smoke test runner), and get/list on Services. That is it. Nothing at the cluster level.
apiVersion: v1 kind: ServiceAccount metadata: name: github-actions-deployer namespace: production --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: deployment-manager namespace: production rules: - apiGroups: ["apps"] resources: ["deployments"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: [""] resources: ["pods"] verbs: ["get", "list", "watch", "create", "delete"] - apiGroups: [""] resources: ["services"] verbs: ["get", "list"] - apiGroups: [""] resources: ["pods/log"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: github-actions-deployer-binding namespace: production subjects: - kind: ServiceAccount name: github-actions-deployer namespace: production roleRef: kind: Role name: deployment-manager apiGroup: rbac.authorization.k8s.io
To generate the kubeconfig secret from this ServiceAccount, create a long-lived token (Kubernetes 1.24+ requires explicit secret creation for ServiceAccount tokens), base64-encode it, and store it as the KUBECONFIG_PROD GitHub Actions secret. Do not use your personal kubeconfig. Do not use a kubeconfig that can list or modify resources in any namespace other than production.
What this pattern actually catches that the simple version does not
The value of this workflow is not in its complexity - it is in the specific failure modes it surfaces before they affect your full production traffic. These are the failures we have seen it catch in the six months since we standardised on this pattern across client deployments.
- Application startup failures that pass readiness probes. A readiness probe checks a fixed endpoint at a fixed interval. An application that passes the probe but crashes on the first real request - due to a missing environment variable, a misconfigured secrets mount, or a lazy-initialised external connection - passes a rollout status check and fails a smoke test.
- Response time regressions introduced by dependency updates. The smoke test's latency assertion (1.5 seconds in the example above) catches slow cold-start behaviour, new synchronous external calls introduced by a library update, or N+1 query patterns introduced by an ORM version change.
- CVEs introduced via base image updates or transitive dependency pulls. Trivy catches these before the image reaches your cluster. Without the scan step, a
npm updatethat pulls in a vulnerable transitive dependency goes undetected until your next scheduled vulnerability scan. - RBAC and secrets misconfiguration. Running smoke tests with a least-privilege ServiceAccount means that if a secrets mount is missing or an environment variable is not set, the application fails during the canary window rather than silently serving errors to the subset of users who happen to hit a code path that requires the missing configuration.
The workflow described here is not the fastest path from commit to deployment. It adds roughly four minutes to a typical pipeline - two minutes for the test matrix to run in parallel, one minute for the Trivy scan, and another minute for the canary to stabilise and the smoke test to complete. Those four minutes have a good return. The alternative cost of a production incident caused by something this pipeline catches is usually measured in hours.
If you are building a CI/CD pipeline for a Kubernetes-based platform and want a second opinion on the workflow design, security boundaries, or canary strategy, the DevOps team at Sequere reviews existing pipelines as a standalone engagement. Get in touch here.