# CI/CD Plan ## Overview Three distinct problems, tackled in phases: 1. **Does the config parse/validate without errors?** (static, no credentials) 2. **Does the new Docker image actually exist and start?** (pre-merge, needs Docker) 3. **Does the running service stay healthy through a deployment?** (post-merge, needs Nomad canary) The goal is: Renovate opens a PR → CI runs checks → you review → merge → canary starts automatically → you promote (or it auto-reverts). --- ## Phase 1 — Static Validation (proves the runner works) No secrets needed. Runs on every PR. ### Infrastructure required - `act_runner` Nomad job (see below) with a Gitea runner token - `.gitea/workflows/ci.yml` in this repo ### Checks | Check | Command | Notes | | --------------------- | ----------------------------------------------------- | ------------------------------------------------------------------- | | HCL formatting | `terraform fmt -check -recursive` | Fails on whitespace/style drift | | Terraform syntax | `terraform init -backend=false && terraform validate` | Catches wrong resource types, missing required args, bad references | | Nomad job spec syntax | `nomad job validate ` | Catches Nomad-specific issues; needs `NOMAD_ADDR` + read token | `terraform validate -backend=false` is the most valuable: it catches ~90% of real mistakes with zero secret exposure. The Nomad validate step requires a low-privilege read token — worth adding once the runner is trusted. ### Workflow sketch ```yaml # .gitea/workflows/ci.yml on: [pull_request] jobs: validate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: hashicorp/setup-terraform@v3 - name: fmt check run: terraform fmt -check -recursive working-directory: 2-nomad-config - name: init + validate (no backend) run: | terraform init -backend=false terraform validate working-directory: 2-nomad-config - name: fmt check (nixos-node) run: terraform fmt -check -recursive working-directory: 1-nixos-node nomad-validate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install Nomad CLI run: | curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp.gpg echo "deb [signed-by=/usr/share/keyrings/hashicorp.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list sudo apt-get update && sudo apt-get install -y nomad - name: validate all job specs env: NOMAD_ADDR: ${{ secrets.NOMAD_ADDR }} NOMAD_TOKEN: ${{ secrets.NOMAD_TOKEN }} # read-only policy sufficient run: | find 2-nomad-config -name '*.nomad.hcl' | while read f; do echo "==> $f" nomad job validate "$f" done ``` ### act_runner Nomad job ```hcl # act-runner.nomad.hcl job "act-runner" { group "act-runner" { network { mode = "bridge" } # Connect upstream to Gitea service { name = "act-runner" connect { sidecar_service { proxy { upstreams { destination_name = "code-connect" local_bind_port = 3000 } } } } } task "act-runner" { driver = "docker" config { image = "gitea/act_runner:latest" volumes = ["/var/run/docker.sock:/var/run/docker.sock"] } env = { GITEA_INSTANCE_URL = "http://localhost:3000" } template { data = < Pulling $image" docker pull "$image" done ``` This intentionally only tests _changed_ images — no value in pulling everything on every PR. --- ## Phase 3 — Nomad Canary Deployments (post-merge gate) Makes "merge" mean "start canary" rather than "go live". The old allocation keeps running until you promote. ### Which jobs get canaries Most jobs already have Consul health checks — these can use `health_check = "checks"` for automatic revert gating. | Job | Health check | Shared writable volume | Canary safe? | | ---------- | ------------- | ----------------------- | --------------------------------------------------------------------------------- | | ntfy | ✅ `/healthz` | no | ✅ yes | | gitea | ✅ `/` | ✅ `single-node-writer` | ⚠️ volume blocks 2nd alloc from mounting — needs `max_parallel=1` rolling instead | | jellyfin | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling | | immich | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling | | sonarr | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling | | prowlarr | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling | | deluge | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling | | frigate | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling | | glance | ✅ | no | ✅ yes | | transfer | ✅ | ✅ `single-node-writer` | ⚠️ rolling | | openreader | ❌ | ✅ `single-node-writer` | ⚠️ add check first, then rolling | | unifi | ❌ | ✅ `single-node-writer` | ⚠️ add check first, then rolling | | traefik | (ingress) | ✅ | ⚠️ rolling — downtime risk, promote quickly | | authelia | (ingress) | ✅ | ✅ stateless config, canary fine | | renovate | batch job | n/a | n/a — no deployment model | | postgres | (data layer) | ✅ | ❌ never canary — single-writer DB | ### Canary stanza (stateless jobs with no volume conflict) ```hcl update { canary = 1 auto_promote = false auto_revert = true health_check = "checks" healthy_deadline = "5m" min_healthy_time = "30s" } ``` ### Rolling stanza (jobs with single-node-writer volumes) ```hcl update { max_parallel = 1 auto_revert = true health_check = "checks" healthy_deadline = "5m" min_healthy_time = "30s" } ``` Rolling with `max_parallel=1` still gives auto-revert but doesn't attempt to run two allocations simultaneously — the old one stops before the new one mounts the volume. --- ## Phase 4 — Automated terraform apply + Deployment Promotion Full CD: merge triggers apply, which creates the canary, CI then watches it and promotes or reverts. ### Flow ``` PR merged to main │ ▼ Gitea Actions (on: push, branches: [main]) - terraform init - terraform apply -auto-approve │ ▼ Nomad canary starts (old allocation still live) │ ▼ CI polls `nomad deployment list` for the new deployment ID CI waits for canary allocation to reach "healthy" in Consul │ healthy within deadline ▼ CI runs: nomad deployment promote │ or unhealthy → nomad deployment fail (auto_revert fires) ▼ ntfy notification: "deployment promoted" or "deployment reverted" ``` ### Secrets required for full CD | Secret | Used by | Risk level | | ---------------------- | ----------------------------------- | ---------------------------------- | | `NOMAD_ADDR` | validate + apply + promote | Low (internal LAN addr) | | `NOMAD_TOKEN` | terraform apply (write) + promote | **High** — grants full infra write | | `CLOUDFLARE_API_TOKEN` | terraform apply | **High** — DNS write | | `SOPS_AGE_KEY` | terraform apply (decrypt secrets) | **High** — decrypts all secrets | | `PG_PASSWORD` | terraform apply (postgres provider) | High | Full CD requires all of these in Gitea Actions secrets. This is acceptable for a self-hosted, non-public Gitea instance where you control runner access — but it's the trust boundary to be deliberate about. A reasonable middle ground: **Phase 1-3 are fully automated; Phase 4 (apply + promote) runs automatically but requires a manual re-trigger or approval step** (Gitea supports required reviewers on environments). ### Promote/revert script sketch ```bash # In CI, after terraform apply completes: DEPLOY_ID=$(nomad deployment list -json | jq -r '[.[] | select(.JobID == "$JOB" and .Status == "running")] | first | .ID') echo "Watching deployment $DEPLOY_ID..." for i in $(seq 1 30); do STATUS=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.Status') HEALTHY=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.TaskGroups[].HealthyAllocs') echo "[$i] status=$STATUS healthy=$HEALTHY" if [ "$STATUS" = "successful" ]; then exit 0; fi if [ "$STATUS" = "failed" ]; then exit 1; fi # Check if canary is healthy enough to promote CANARY_HEALTHY=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.TaskGroups[].DesiredCanaries == .TaskGroups[].HealthyAllocs') if [ "$CANARY_HEALTHY" = "true" ]; then nomad deployment promote "$DEPLOY_ID" exit 0 fi sleep 10 done nomad deployment fail "$DEPLOY_ID" exit 1 ``` --- ## Implementation Order - [ ] **Phase 1a**: Create `act-runner.nomad.hcl` + Terraform wrapper, register runner token in Gitea, get a hello-world workflow green - [ ] **Phase 1b**: Add `terraform fmt` + `terraform validate -backend=false` workflow — no secrets needed - [ ] **Phase 1c**: Add Nomad validate step — add `NOMAD_ADDR` + read-only `NOMAD_TOKEN` to Gitea secrets - [ ] **Phase 2**: Add image pull validation step to the workflow - [ ] **Phase 3a**: Add `update` stanzas to ntfy and glance (simplest, no volume conflict) - [ ] **Phase 3b**: Add rolling `update` stanzas to remaining service jobs (jellyfin, sonarr, etc.) - [ ] **Phase 3c**: Add health checks to openreader and unifi before adding update stanzas - [ ] **Phase 4a**: Add on-push workflow that runs `terraform apply -auto-approve` using full credential set - [ ] **Phase 4b**: Add deployment promotion/revert polling script - [ ] **Phase 4c**: Wire ntfy notifications for promote/revert outcomes