Files
infra/cicd-plan.md

12 KiB

CI/CD Plan

Overview

Three distinct problems, tackled in phases:

  1. Does the config parse/validate without errors? (static, no credentials)
  2. Does the new Docker image actually exist and start? (pre-merge, needs Docker)
  3. Does the running service stay healthy through a deployment? (post-merge, needs Nomad canary)

The goal is: Renovate opens a PR → CI runs checks → you review → merge → canary starts automatically → you promote (or it auto-reverts).


Phase 1 — Static Validation (proves the runner works)

No secrets needed. Runs on every PR.

Infrastructure required

  • act_runner Nomad job (see below) with a Gitea runner token
  • .gitea/workflows/ci.yml in this repo

Checks

Check Command Notes
HCL formatting terraform fmt -check -recursive Fails on whitespace/style drift
Terraform syntax terraform init -backend=false && terraform validate Catches wrong resource types, missing required args, bad references
Nomad job spec syntax nomad job validate <file> Catches Nomad-specific issues; needs NOMAD_ADDR + read token

terraform validate -backend=false is the most valuable: it catches ~90% of real mistakes with zero secret exposure. The Nomad validate step requires a low-privilege read token — worth adding once the runner is trusted.

Workflow sketch

# .gitea/workflows/ci.yml
on: [pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3

      - name: fmt check
        run: terraform fmt -check -recursive
        working-directory: 2-nomad-config

      - name: init + validate (no backend)
        run: |
          terraform init -backend=false
          terraform validate
        working-directory: 2-nomad-config

      - name: fmt check (nixos-node)
        run: terraform fmt -check -recursive
        working-directory: 1-nixos-node

  nomad-validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install Nomad CLI
        run: |
          curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp.gpg
          echo "deb [signed-by=/usr/share/keyrings/hashicorp.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
          sudo apt-get update && sudo apt-get install -y nomad
      - name: validate all job specs
        env:
          NOMAD_ADDR: ${{ secrets.NOMAD_ADDR }}
          NOMAD_TOKEN: ${{ secrets.NOMAD_TOKEN }} # read-only policy sufficient
        run: |
          find 2-nomad-config -name '*.nomad.hcl' | while read f; do
            echo "==> $f"
            nomad job validate "$f"
          done

act_runner Nomad job

# act-runner.nomad.hcl
job "act-runner" {
  group "act-runner" {
    network {
      mode = "bridge"
    }

    # Connect upstream to Gitea
    service {
      name = "act-runner"
      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "code-connect"
              local_bind_port  = 3000
            }
          }
        }
      }
    }

    task "act-runner" {
      driver = "docker"

      config {
        image   = "gitea/act_runner:latest"
        volumes = ["/var/run/docker.sock:/var/run/docker.sock"]
      }

      env = {
        GITEA_INSTANCE_URL = "http://localhost:3000"
      }

      template {
        data        = <<EOF
GITEA_RUNNER_REGISTRATION_TOKEN={{ with nomadVar "nomad/jobs/act-runner" }}{{ .registration_token }}{{ end }}
EOF
        destination = "secrets/runner.env"
        env         = true
      }

      resources {
        cpu        = 200
        memory     = 256
        memory_max = 512
      }
    }
  }
}

Security note: mounting /var/run/docker.sock gives the runner root-equivalent access to the host. Acceptable for a home server. Alternative: use docker:dind sidecar or Nomad's exec driver — more complex, lower risk.


Phase 2 — Docker Image Validation (pre-merge)

Runs on PRs that touch .nomad.hcl files. Catches: tag typos, deleted images, registry outages.

Requires the act_runner to have Docker access (same socket mount as above).

image-pull:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - name: Pull changed images
      run: |
        # Extract image tags added or changed vs main
        git fetch origin main
        git diff origin/main...HEAD -- '*.nomad.hcl' \
          | grep '^\+\s*image\s*=' \
          | grep -oP '"[^"]+:[^"]+"' \
          | tr -d '"' \
          | sort -u \
          | while read image; do
              echo "==> Pulling $image"
              docker pull "$image"
            done

This intentionally only tests changed images — no value in pulling everything on every PR.


Phase 3 — Nomad Canary Deployments (post-merge gate)

Makes "merge" mean "start canary" rather than "go live". The old allocation keeps running until you promote.

Which jobs get canaries

Most jobs already have Consul health checks — these can use health_check = "checks" for automatic revert gating.

Job Health check Shared writable volume Canary safe?
ntfy /healthz no yes
gitea / single-node-writer ⚠️ volume blocks 2nd alloc from mounting — needs max_parallel=1 rolling instead
jellyfin single-node-writer ⚠️ same — rolling
immich single-node-writer ⚠️ same — rolling
sonarr single-node-writer ⚠️ same — rolling
prowlarr single-node-writer ⚠️ same — rolling
deluge single-node-writer ⚠️ same — rolling
frigate single-node-writer ⚠️ same — rolling
glance no yes
transfer single-node-writer ⚠️ rolling
openreader single-node-writer ⚠️ add check first, then rolling
unifi single-node-writer ⚠️ add check first, then rolling
traefik (ingress) ⚠️ rolling — downtime risk, promote quickly
authelia (ingress) stateless config, canary fine
renovate batch job n/a n/a — no deployment model
postgres (data layer) never canary — single-writer DB

Canary stanza (stateless jobs with no volume conflict)

update {
  canary           = 1
  auto_promote     = false
  auto_revert      = true
  health_check     = "checks"
  healthy_deadline = "5m"
  min_healthy_time = "30s"
}

Rolling stanza (jobs with single-node-writer volumes)

update {
  max_parallel     = 1
  auto_revert      = true
  health_check     = "checks"
  healthy_deadline = "5m"
  min_healthy_time = "30s"
}

Rolling with max_parallel=1 still gives auto-revert but doesn't attempt to run two allocations simultaneously — the old one stops before the new one mounts the volume.


Phase 4 — Automated terraform apply + Deployment Promotion

Full CD: merge triggers apply, which creates the canary, CI then watches it and promotes or reverts.

Flow

PR merged to main
      │
      ▼
Gitea Actions (on: push, branches: [main])
  - terraform init
  - terraform apply -auto-approve
      │
      ▼
Nomad canary starts (old allocation still live)
      │
      ▼
CI polls `nomad deployment list` for the new deployment ID
CI waits for canary allocation to reach "healthy" in Consul
      │ healthy within deadline
      ▼
CI runs: nomad deployment promote <id>
      │ or unhealthy → nomad deployment fail <id> (auto_revert fires)
      ▼
ntfy notification: "deployment promoted" or "deployment reverted"

Secrets required for full CD

Secret Used by Risk level
NOMAD_ADDR validate + apply + promote Low (internal LAN addr)
NOMAD_TOKEN terraform apply (write) + promote High — grants full infra write
CLOUDFLARE_API_TOKEN terraform apply High — DNS write
SOPS_AGE_KEY terraform apply (decrypt secrets) High — decrypts all secrets
PG_PASSWORD terraform apply (postgres provider) High

Full CD requires all of these in Gitea Actions secrets. This is acceptable for a self-hosted, non-public Gitea instance where you control runner access — but it's the trust boundary to be deliberate about. A reasonable middle ground: Phase 1-3 are fully automated; Phase 4 (apply + promote) runs automatically but requires a manual re-trigger or approval step (Gitea supports required reviewers on environments).

Promote/revert script sketch

# In CI, after terraform apply completes:
DEPLOY_ID=$(nomad deployment list -json | jq -r '[.[] | select(.JobID == "$JOB" and .Status == "running")] | first | .ID')
echo "Watching deployment $DEPLOY_ID..."

for i in $(seq 1 30); do
  STATUS=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.Status')
  HEALTHY=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.TaskGroups[].HealthyAllocs')
  echo "[$i] status=$STATUS healthy=$HEALTHY"
  if [ "$STATUS" = "successful" ]; then exit 0; fi
  if [ "$STATUS" = "failed" ]; then exit 1; fi
  # Check if canary is healthy enough to promote
  CANARY_HEALTHY=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.TaskGroups[].DesiredCanaries == .TaskGroups[].HealthyAllocs')
  if [ "$CANARY_HEALTHY" = "true" ]; then
    nomad deployment promote "$DEPLOY_ID"
    exit 0
  fi
  sleep 10
done
nomad deployment fail "$DEPLOY_ID"
exit 1

Implementation Order

  • Phase 1a: Create act-runner.nomad.hcl + Terraform wrapper, register runner token in Gitea, get a hello-world workflow green
  • Phase 1b: Add terraform fmt + terraform validate -backend=false workflow — no secrets needed
  • Phase 1c: Add Nomad validate step — add NOMAD_ADDR + read-only NOMAD_TOKEN to Gitea secrets
  • Phase 2: Add image pull validation step to the workflow
  • Phase 3a: Add update stanzas to ntfy and glance (simplest, no volume conflict)
  • Phase 3b: Add rolling update stanzas to remaining service jobs (jellyfin, sonarr, etc.)
  • Phase 3c: Add health checks to openreader and unifi before adding update stanzas
  • Phase 4a: Add on-push workflow that runs terraform apply -auto-approve using full credential set
  • Phase 4b: Add deployment promotion/revert polling script
  • Phase 4c: Wire ntfy notifications for promote/revert outcomes