From 6c0b1c928110c364c8188c065b87d499d63eadc0 Mon Sep 17 00:00:00 2001 From: Adrian Cowan Date: Sat, 18 Apr 2026 17:34:11 +1000 Subject: [PATCH] Add CI/CD plan documentation outlining phases for validation and deployment --- cicd-plan.md | 305 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 305 insertions(+) create mode 100644 cicd-plan.md diff --git a/cicd-plan.md b/cicd-plan.md new file mode 100644 index 0000000..d3bad6e --- /dev/null +++ b/cicd-plan.md @@ -0,0 +1,305 @@ +# CI/CD Plan + +## Overview + +Three distinct problems, tackled in phases: + +1. **Does the config parse/validate without errors?** (static, no credentials) +2. **Does the new Docker image actually exist and start?** (pre-merge, needs Docker) +3. **Does the running service stay healthy through a deployment?** (post-merge, needs Nomad canary) + +The goal is: Renovate opens a PR → CI runs checks → you review → merge → canary starts automatically → you promote (or it auto-reverts). + +--- + +## Phase 1 — Static Validation (proves the runner works) + +No secrets needed. Runs on every PR. + +### Infrastructure required + +- `act_runner` Nomad job (see below) with a Gitea runner token +- `.gitea/workflows/ci.yml` in this repo + +### Checks + +| Check | Command | Notes | +| --------------------- | ----------------------------------------------------- | ------------------------------------------------------------------- | +| HCL formatting | `terraform fmt -check -recursive` | Fails on whitespace/style drift | +| Terraform syntax | `terraform init -backend=false && terraform validate` | Catches wrong resource types, missing required args, bad references | +| Nomad job spec syntax | `nomad job validate ` | Catches Nomad-specific issues; needs `NOMAD_ADDR` + read token | + +`terraform validate -backend=false` is the most valuable: it catches ~90% of real mistakes with zero secret exposure. The Nomad validate step requires a low-privilege read token — worth adding once the runner is trusted. + +### Workflow sketch + +```yaml +# .gitea/workflows/ci.yml +on: [pull_request] + +jobs: + validate: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - uses: hashicorp/setup-terraform@v3 + + - name: fmt check + run: terraform fmt -check -recursive + working-directory: 2-nomad-config + + - name: init + validate (no backend) + run: | + terraform init -backend=false + terraform validate + working-directory: 2-nomad-config + + - name: fmt check (nixos-node) + run: terraform fmt -check -recursive + working-directory: 1-nixos-node + + nomad-validate: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - name: Install Nomad CLI + run: | + curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp.gpg + echo "deb [signed-by=/usr/share/keyrings/hashicorp.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list + sudo apt-get update && sudo apt-get install -y nomad + - name: validate all job specs + env: + NOMAD_ADDR: ${{ secrets.NOMAD_ADDR }} + NOMAD_TOKEN: ${{ secrets.NOMAD_TOKEN }} # read-only policy sufficient + run: | + find 2-nomad-config -name '*.nomad.hcl' | while read f; do + echo "==> $f" + nomad job validate "$f" + done +``` + +### act_runner Nomad job + +```hcl +# act-runner.nomad.hcl +job "act-runner" { + group "act-runner" { + network { + mode = "bridge" + } + + # Connect upstream to Gitea + service { + name = "act-runner" + connect { + sidecar_service { + proxy { + upstreams { + destination_name = "code-connect" + local_bind_port = 3000 + } + } + } + } + } + + task "act-runner" { + driver = "docker" + + config { + image = "gitea/act_runner:latest" + volumes = ["/var/run/docker.sock:/var/run/docker.sock"] + } + + env = { + GITEA_INSTANCE_URL = "http://localhost:3000" + } + + template { + data = < Pulling $image" + docker pull "$image" + done +``` + +This intentionally only tests _changed_ images — no value in pulling everything on every PR. + +--- + +## Phase 3 — Nomad Canary Deployments (post-merge gate) + +Makes "merge" mean "start canary" rather than "go live". The old allocation keeps running until you promote. + +### Which jobs get canaries + +Most jobs already have Consul health checks — these can use `health_check = "checks"` for automatic revert gating. + +| Job | Health check | Shared writable volume | Canary safe? | +| ---------- | ------------- | ----------------------- | --------------------------------------------------------------------------------- | +| ntfy | ✅ `/healthz` | no | ✅ yes | +| gitea | ✅ `/` | ✅ `single-node-writer` | ⚠️ volume blocks 2nd alloc from mounting — needs `max_parallel=1` rolling instead | +| jellyfin | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling | +| immich | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling | +| sonarr | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling | +| prowlarr | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling | +| deluge | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling | +| frigate | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling | +| glance | ✅ | no | ✅ yes | +| transfer | ✅ | ✅ `single-node-writer` | ⚠️ rolling | +| openreader | ❌ | ✅ `single-node-writer` | ⚠️ add check first, then rolling | +| unifi | ❌ | ✅ `single-node-writer` | ⚠️ add check first, then rolling | +| traefik | (ingress) | ✅ | ⚠️ rolling — downtime risk, promote quickly | +| authelia | (ingress) | ✅ | ✅ stateless config, canary fine | +| renovate | batch job | n/a | n/a — no deployment model | +| postgres | (data layer) | ✅ | ❌ never canary — single-writer DB | + +### Canary stanza (stateless jobs with no volume conflict) + +```hcl +update { + canary = 1 + auto_promote = false + auto_revert = true + health_check = "checks" + healthy_deadline = "5m" + min_healthy_time = "30s" +} +``` + +### Rolling stanza (jobs with single-node-writer volumes) + +```hcl +update { + max_parallel = 1 + auto_revert = true + health_check = "checks" + healthy_deadline = "5m" + min_healthy_time = "30s" +} +``` + +Rolling with `max_parallel=1` still gives auto-revert but doesn't attempt to run two allocations simultaneously — the old one stops before the new one mounts the volume. + +--- + +## Phase 4 — Automated terraform apply + Deployment Promotion + +Full CD: merge triggers apply, which creates the canary, CI then watches it and promotes or reverts. + +### Flow + +``` +PR merged to main + │ + ▼ +Gitea Actions (on: push, branches: [main]) + - terraform init + - terraform apply -auto-approve + │ + ▼ +Nomad canary starts (old allocation still live) + │ + ▼ +CI polls `nomad deployment list` for the new deployment ID +CI waits for canary allocation to reach "healthy" in Consul + │ healthy within deadline + ▼ +CI runs: nomad deployment promote + │ or unhealthy → nomad deployment fail (auto_revert fires) + ▼ +ntfy notification: "deployment promoted" or "deployment reverted" +``` + +### Secrets required for full CD + +| Secret | Used by | Risk level | +| ---------------------- | ----------------------------------- | ---------------------------------- | +| `NOMAD_ADDR` | validate + apply + promote | Low (internal LAN addr) | +| `NOMAD_TOKEN` | terraform apply (write) + promote | **High** — grants full infra write | +| `CLOUDFLARE_API_TOKEN` | terraform apply | **High** — DNS write | +| `SOPS_AGE_KEY` | terraform apply (decrypt secrets) | **High** — decrypts all secrets | +| `PG_PASSWORD` | terraform apply (postgres provider) | High | + +Full CD requires all of these in Gitea Actions secrets. This is acceptable for a self-hosted, non-public Gitea instance where you control runner access — but it's the trust boundary to be deliberate about. A reasonable middle ground: **Phase 1-3 are fully automated; Phase 4 (apply + promote) runs automatically but requires a manual re-trigger or approval step** (Gitea supports required reviewers on environments). + +### Promote/revert script sketch + +```bash +# In CI, after terraform apply completes: +DEPLOY_ID=$(nomad deployment list -json | jq -r '[.[] | select(.JobID == "$JOB" and .Status == "running")] | first | .ID') +echo "Watching deployment $DEPLOY_ID..." + +for i in $(seq 1 30); do + STATUS=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.Status') + HEALTHY=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.TaskGroups[].HealthyAllocs') + echo "[$i] status=$STATUS healthy=$HEALTHY" + if [ "$STATUS" = "successful" ]; then exit 0; fi + if [ "$STATUS" = "failed" ]; then exit 1; fi + # Check if canary is healthy enough to promote + CANARY_HEALTHY=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.TaskGroups[].DesiredCanaries == .TaskGroups[].HealthyAllocs') + if [ "$CANARY_HEALTHY" = "true" ]; then + nomad deployment promote "$DEPLOY_ID" + exit 0 + fi + sleep 10 +done +nomad deployment fail "$DEPLOY_ID" +exit 1 +``` + +--- + +## Implementation Order + +- [ ] **Phase 1a**: Create `act-runner.nomad.hcl` + Terraform wrapper, register runner token in Gitea, get a hello-world workflow green +- [ ] **Phase 1b**: Add `terraform fmt` + `terraform validate -backend=false` workflow — no secrets needed +- [ ] **Phase 1c**: Add Nomad validate step — add `NOMAD_ADDR` + read-only `NOMAD_TOKEN` to Gitea secrets +- [ ] **Phase 2**: Add image pull validation step to the workflow +- [ ] **Phase 3a**: Add `update` stanzas to ntfy and glance (simplest, no volume conflict) +- [ ] **Phase 3b**: Add rolling `update` stanzas to remaining service jobs (jellyfin, sonarr, etc.) +- [ ] **Phase 3c**: Add health checks to openreader and unifi before adding update stanzas +- [ ] **Phase 4a**: Add on-push workflow that runs `terraform apply -auto-approve` using full credential set +- [ ] **Phase 4b**: Add deployment promotion/revert polling script +- [ ] **Phase 4c**: Wire ntfy notifications for promote/revert outcomes