CI/CD Kubernetes

GitHub Actions Zero-Downtime Deployment to Kubernetes - A Working Pattern

Most GitHub Actions Kubernetes examples push an image and call it done. No canary stage, no health check gate, no rollback trigger. This is the workflow we actually run - parallel test matrix, Trivy container scan, canary deploy with a live smoke test, and an automated rollback that fires before a human is even paged.

18 min read
4.2K reads
GitHub Actions Kubernetes Zero-Downtime Deploy Canary Deployment Trivy Container Scanning kubectl Helm Rollback Automation CI/CD Pipeline GHCR GitOps
What this guide covers - 5 things you will leave with
A complete, production-tested GitHub Actions workflow that runs a parallel test matrix, builds and pushes to GHCR, scans the image with Trivy, deploys a canary, runs a smoke test, and only then promotes to full rollout.
Why the standard "build → push → kubectl apply" pattern creates silent failure modes and what a health check gate actually catches that rollout status does not.
How to structure a canary deployment in Kubernetes using a separate Deployment manifest - without a service mesh - so that 10% of traffic reaches the new version before you commit to the full rollout.
An automated rollback step that triggers on smoke test failure, restores the previous image, and posts a failure summary to Slack - all before an on-call alert fires.
The exact GitHub Actions secrets and RBAC configuration needed so your runner can deploy to a namespace without getting cluster-admin, which most tutorials silently require.

There is a version of the GitHub Actions Kubernetes tutorial that is everywhere. It uses azure/k8s-deploy or a raw kubectl set image, waits for rollout status, and declares success. If you have shipped an application that way, you have shipped it with a gap between "rollout complete" and "application actually working." Rollout status tells you that pods reached the Running state. It does not tell you that the new container can accept traffic, authenticate correctly, connect to its database, or respond to your healthcheck endpoint inside the timeout window.

We learned this the uncomfortable way. A microservice that initialised an external cache connection on first request - rather than at startup - would consistently pass a rollout status check and fail silently for the first 20–30 seconds of live traffic as the connection pool warmed up. The pod was running. The readiness probe was passing. Users were getting errors. The deployment workflow had already marked itself green.

This guide documents the pattern we settled on after that incident and have used since. It is not the simplest version of a GitHub Actions Kubernetes pipeline. It is the version that catches the failures the simple version misses.

Workflow structure and what each stage is actually checking

The full workflow has five jobs that run in sequence with one conditional branch. Understanding what each job is checking - not just what it does - is the difference between a pipeline that gives you confidence and one that just produces green ticks.

JobWhat it checksFailure means
test-matrixUnit + integration tests across Node versionsCode is broken - stop here, do not build
build-pushImage builds cleanly, pushes to GHCRBuild failure or registry authentication problem
scanTrivy CVE scan against pushed imageKnown HIGH/CRITICAL vulnerabilities in image
canary-deploy10% canary receives traffic, smoke test passesApplication fails under real request conditions
full-rollout100% traffic promoted to new imagePromotion failed - canary already rolled back

The scan job runs in parallel with the deployment preparation, not after it - this matters for total pipeline duration. By the time the canary is ready to receive traffic, the CVE report has already been evaluated and either blocked or cleared. You do not pay an extra Trivy wait on the critical path.

Parallel test matrix with dependency caching

The first job runs your test suite across a matrix of Node versions - or whatever runtime matrix is relevant to your stack. The reason to run a matrix rather than a single version in CI is not pedantry about compatibility. It is that matrix jobs in GitHub Actions run in parallel, and parallelism on the test job is the single highest-value optimisation available on a typical team's pipeline. Three Node versions running simultaneously takes roughly the same wall-clock time as one version.

YAML - .github/workflows/deploy.yml (test job) Copy
name: deploy
on:
  push:
    branches: [main]
  workflow_dispatch: # Allows: gh workflow run deploy.yml --ref main

env:
  REGISTRY:   ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:

  test-matrix:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: [18, 20, 22]
      fail-fast: true   # Cancel siblings on first failure
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node ${{ matrix.node-version }}
        uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node-version }}
          cache: npm       # Restores node_modules from cache key

      - run: npm ci --prefer-offline
      - run: npm test -- --ci --coverage

      - name: Upload coverage
        uses: actions/upload-artifact@v4
        with:
          name: coverage-node${{ matrix.node-version }}
          path: coverage/
Why fail-fast: true matters here

With fail-fast: true, if Node 18 fails its tests, GitHub Actions cancels the Node 20 and Node 22 jobs immediately rather than letting them run to completion. This cuts wasted runner minutes on a broken build. The trade-off is that you only see failures from the first job to fail, not all three. For most teams, that is the right trade-off - fix the first failure, re-run.

Image build, GHCR push, and Trivy scan

The build job uses Docker Buildx with layer caching through the GitHub Actions cache backend. Without explicit cache configuration, every build on GitHub-hosted runners starts cold - all layers pulled from scratch. With the cache-from and cache-to configuration below, unchanged layers are restored from the cache and only changed layers are rebuilt. On a medium-complexity Node application, this typically cuts build time from 3–4 minutes to 40–90 seconds.

YAML - build-push job + Trivy scan Copy
  build-push:
    needs: test-matrix
    runs-on: ubuntu-latest
    outputs:
      image-digest: ${{ steps.push.outputs.digest }}
      image-tag:    ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4

      - name: Log in to GHCR
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Docker metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=,format=short

      - name: Set up Buildx
        uses: docker/setup-buildx-action@v3

      - name: Build and push
        id: push
        uses: docker/build-push-action@v6
        with:
          push:       true
          tags:       ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to:   type=gha,mode=max
          provenance: false  # Avoid multi-platform manifest on single-arch build

  scan:
    needs: build-push
    runs-on: ubuntu-latest
    steps:
      - name: Trivy vulnerability scan
        uses: aquasecurity/trivy-action@master
        with:
          image-ref:    ${{ needs.build-push.outputs.image-tag }}
          format:       table
          exit-code:    '1'      # Fail the job on any finding
          severity:     HIGH,CRITICAL
          ignore-unfixed: true  # Skip CVEs with no available patch

ignore-unfixed: true is intentional, not permissive. CVEs with no available patch in the base image or dependency tree are noise in a deployment gate. They block every pipeline run without giving engineers anything actionable to fix. Unfixed CVEs belong in a separate tracking process - a weekly report, a Jira board, a Dependabot alert - not in the path between a code change and a deployment.

Canary deployment without a service mesh

A canary deployment splits traffic between the current stable version and the new version, so that a small percentage of real requests hit the new code before you commit to a full rollout. Most guides implement this with a service mesh like Istio or Linkerd, which gives you precise traffic weight control. What they do not tell you is that you can implement a functional canary in plain Kubernetes using two separate Deployments and a single Service - no mesh required.

The mechanism is the Kubernetes Service selector. A Service routes traffic to all pods matching its selector, proportionally by replica count. If your stable Deployment has nine replicas and your canary Deployment has one replica, the Service sends roughly 10% of traffic to the canary pod. This is not weighted routing in the Istio sense - it is simple probability based on available pod count - but it is sufficient for a health check gate without adding a service mesh dependency to your cluster.

YAML - k8s/canary-deployment.yaml Copy
# This manifest is only applied during the canary window.
# It is deleted after promotion or rollback.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-canary
  namespace: production
  labels:
    app: api
    track: canary
spec:
  replicas: 1           # 1 canary : 9 stable = ~10% traffic
  selector:
    matchLabels:
      app: api
      track: canary
  template:
    metadata:
      labels:
        app: api         # Must match the Service selector
        track: canary
    spec:
      containers:
      - name: api
        image: ghcr.io/OWNER/REPO:SHA   # Replaced by workflow
        readinessProbe:
          httpGet:
            path: /healthz
            port: 3000
          initialDelaySeconds: 5
          periodSeconds:       5
          failureThreshold:    3
        resources:
          requests: { cpu: "200m", memory: "256Mi" }
          limits:   { cpu: "800m", memory: "256Mi" }
The Service selector that makes this work

Your existing production Service should select on app: api only - not on track. Both the stable Deployment (labelled track: stable) and the canary Deployment (labelled track: canary) match the Service selector, so both receive traffic in proportion to their replica count. This is the entire mechanism - no additional configuration required.

Automated smoke test and rollback trigger

The smoke test job is the part most pipelines skip and the part that catches the most production failures. A smoke test in this context is not a full integration test suite - it is a small number of HTTP requests against the canary pod's endpoint, with assertions on response codes and response time, run from inside the cluster after the canary pod reaches the Ready state.

Running the smoke test from inside the cluster rather than through an external URL is important. It tests the actual path traffic will take to the pod - through the Service, through kube-proxy - not a path through a load balancer or CDN that may have cached behaviour. A pod can be Ready and reachable internally while still not being routable externally due to an ingress misconfiguration. The smoke test catches internal routing problems; your existing synthetic monitoring catches external ones.

YAML - canary-deploy + smoke-test + rollback jobs Copy
  canary-deploy:
    needs: [build-push, scan]
    runs-on: ubuntu-latest
    env:
      KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_PROD }}
    steps:
      - uses: actions/checkout@v4

      - name: Write kubeconfig
        run: |
          mkdir -p ~/.kube
          echo "$KUBECONFIG_DATA" | base64 -d > ~/.kube/config
          chmod 600 ~/.kube/config

      - name: Apply canary manifest
        run: |
          IMAGE_TAG="${{ needs.build-push.outputs.image-tag }}"
          sed "s|ghcr.io/OWNER/REPO:SHA|${IMAGE_TAG}|g" \
            k8s/canary-deployment.yaml | kubectl apply -f -

      - name: Wait for canary pod readiness
        run: |
          kubectl rollout status deployment/api-canary \
            -n production --timeout=120s

      - name: Smoke test via in-cluster curl
        id: smoke
        run: |
          # Run curl from a temp pod inside the cluster
          kubectl run smoke-$GITHUB_RUN_ID \
            --image=curlimages/curl:8.7.1 \
            --restart=Never \
            --rm \
            --attach \
            -n production \
            -- sh -c '
              STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
                http://api-service.production.svc.cluster.local:3000/healthz)
              echo "Healthcheck status: $STATUS"
              [ "$STATUS" = "200" ] || exit 1

              LATENCY=$(curl -s -o /dev/null \
                -w "%{time_total}" \
                http://api-service.production.svc.cluster.local:3000/healthz)
              echo "Latency: ${LATENCY}s"
              awk "BEGIN{exit ($LATENCY > 1.5)}" || exit 1
            '

      - name: Rollback and notify on smoke failure
        if: failure() && steps.smoke.outcome == 'failure'
        run: |
          echo "Smoke test failed - deleting canary deployment"
          kubectl delete deployment api-canary -n production --ignore-not-found

          # Slack notification via webhook
          curl -s -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H 'Content-type: application/json' \
            -d "{
              \"text\": \"⚠️ Canary deploy FAILED for \`${{ github.repository }}\`\",
              \"blocks\": [{
                \"type\": \"section\",
                \"text\": {
                  \"type\": \"mrkdwn\",
                  \"text\": \"*Canary smoke test failed*\nCommit: \`${{ github.sha }}\`\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}\"
                }
              }]
            }"
          exit 1   # Fail the job so full-rollout is blocked

  full-rollout:
    needs: canary-deploy
    runs-on: ubuntu-latest
    env:
      KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_PROD }}
    steps:
      - uses: actions/checkout@v4

      - name: Write kubeconfig
        run: |
          mkdir -p ~/.kube
          echo "$KUBECONFIG_DATA" | base64 -d > ~/.kube/config
          chmod 600 ~/.kube/config

      - name: Update stable deployment image
        run: |
          IMAGE_TAG="${{ needs.build-push.outputs.image-tag }}"
          kubectl set image deployment/api \
            api="${IMAGE_TAG}" \
            -n production

      - name: Wait for stable rollout
        run: |
          kubectl rollout status deployment/api \
            -n production --timeout=300s

      - name: Clean up canary deployment
        run: |
          kubectl delete deployment api-canary \
            -n production --ignore-not-found

"Rollout status tells you pods are running. It does not tell you the application can handle a request. Those are different things. The smoke test checks the second one."

- Internal postmortem, cache warm-up incident, Q3 2025

RBAC and secrets - what the tutorials skip

Almost every GitHub Actions Kubernetes tutorial either uses a cluster-admin kubeconfig or quietly generates one with a tool like doctl or the AWS CLI without discussing what permissions are actually included. Cluster-admin access for a CI runner means that a compromised workflow or a supply chain attack on an action you use can do anything to your cluster. Restricting the runner to the minimum permissions required is not optional hardening - it is basic hygiene.

The workflow above requires exactly these permissions in the production namespace: get/list/watch/create/update/patch on Deployments, get/list on Pods, create/delete on Pods (for the smoke test runner), and get/list on Services. That is it. Nothing at the cluster level.

YAML - k8s/cicd-rbac.yaml Copy
apiVersion: v1
kind: ServiceAccount
metadata:
  name: github-actions-deployer
  namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: deployment-manager
  namespace: production
rules:
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs:     ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs:     ["get", "list", "watch", "create", "delete"]
- apiGroups: [""]
  resources: ["services"]
  verbs:     ["get", "list"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs:     ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: github-actions-deployer-binding
  namespace: production
subjects:
- kind: ServiceAccount
  name: github-actions-deployer
  namespace: production
roleRef:
  kind: Role
  name: deployment-manager
  apiGroup: rbac.authorization.k8s.io

To generate the kubeconfig secret from this ServiceAccount, create a long-lived token (Kubernetes 1.24+ requires explicit secret creation for ServiceAccount tokens), base64-encode it, and store it as the KUBECONFIG_PROD GitHub Actions secret. Do not use your personal kubeconfig. Do not use a kubeconfig that can list or modify resources in any namespace other than production.

What this pattern actually catches that the simple version does not

The value of this workflow is not in its complexity - it is in the specific failure modes it surfaces before they affect your full production traffic. These are the failures we have seen it catch in the six months since we standardised on this pattern across client deployments.

  • Application startup failures that pass readiness probes. A readiness probe checks a fixed endpoint at a fixed interval. An application that passes the probe but crashes on the first real request - due to a missing environment variable, a misconfigured secrets mount, or a lazy-initialised external connection - passes a rollout status check and fails a smoke test.
  • Response time regressions introduced by dependency updates. The smoke test's latency assertion (1.5 seconds in the example above) catches slow cold-start behaviour, new synchronous external calls introduced by a library update, or N+1 query patterns introduced by an ORM version change.
  • CVEs introduced via base image updates or transitive dependency pulls. Trivy catches these before the image reaches your cluster. Without the scan step, a npm update that pulls in a vulnerable transitive dependency goes undetected until your next scheduled vulnerability scan.
  • RBAC and secrets misconfiguration. Running smoke tests with a least-privilege ServiceAccount means that if a secrets mount is missing or an environment variable is not set, the application fails during the canary window rather than silently serving errors to the subset of users who happen to hit a code path that requires the missing configuration.

The workflow described here is not the fastest path from commit to deployment. It adds roughly four minutes to a typical pipeline - two minutes for the test matrix to run in parallel, one minute for the Trivy scan, and another minute for the canary to stabilise and the smoke test to complete. Those four minutes have a good return. The alternative cost of a production incident caused by something this pipeline catches is usually measured in hours.

If you are building a CI/CD pipeline for a Kubernetes-based platform and want a second opinion on the workflow design, security boundaries, or canary strategy, the DevOps team at Sequere reviews existing pipelines as a standalone engagement. Get in touch here.