Add Gitea act-runner and test actions for the repo

Add CI/CD plan documentation outlining phases for validation and deployment
2026-04-18 18:12:39 +10:00 · 2026-04-18 17:34:11 +10:00
8 changed files with 440 additions and 24 deletions
--- a/.gitea/workflows/ci.yml
+++ b/.gitea/workflows/ci.yml
@@ -0,0 +1,31 @@
+name: CI
+
+on:
+  pull_request:
+  push:
+    branches:
+      - main
+
+jobs:
+  terraform-validate:
+    name: Terraform fmt + validate
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: hashicorp/setup-terraform@v3
+
+      - name: fmt check — 1-nixos-node
+        run: terraform fmt -check -recursive
+        working-directory: 1-nixos-node
+
+      - name: fmt check — 2-nomad-config
+        run: terraform fmt -check -recursive
+        working-directory: 2-nomad-config
+
+      - name: validate — 2-nomad-config (no backend)
+        run: |
+          terraform init -backend=false
+          terraform validate
+        working-directory: 2-nomad-config
--- a/1-nixos-node/configuration.nix
+++ b/1-nixos-node/configuration.nix
@@ -64,6 +64,7 @@
          cni_path = "$${pkgs.cni-plugins}/bin";
        };
        plugin.docker.config.allow_privileged = true;
+        plugin.docker.config.volumes.enabled = true;
      };
      extraPackages = with pkgs; [
        cni-plugins
--- a/1-nixos-node/terraform.tfstate
+++ b/1-nixos-node/terraform.tfstate
--- a/1-nixos-node/terraform.tfstate.backup
+++ b/1-nixos-node/terraform.tfstate.backup
--- a/2-nomad-config/act-runner.nomad.hcl
+++ b/2-nomad-config/act-runner.nomad.hcl
@@ -0,0 +1,66 @@
+job "act-runner" {
+  group "act-runner" {
+    network {
+      mode = "bridge"
+    }
+
+    # Consul Connect upstream to Gitea so the runner can register and receive jobs
+    service {
+      name = "act-runner"
+      connect {
+        sidecar_service {
+          proxy {
+            upstreams {
+              destination_name = "code-connect"
+              local_bind_port  = 3000
+            }
+          }
+        }
+      }
+    }
+
+    task "act-runner" {
+      driver = "docker"
+
+      config {
+        image   = "gitea/act_runner:latest"
+        volumes = ["/var/run/docker.sock:/var/run/docker.sock"]
+      }
+
+      env = {
+        GITEA_INSTANCE_URL = "http://localhost:3000"
+        CONFIG_FILE        = "/secrets/runner-config.yml"
+      }
+
+      # Required SOPS key:
+      #   act-runner.registration_token — runner registration token from Gitea
+      #   Admin → Settings → Actions → Runners → Create new runner
+      template {
+        data        = <<EOF
+GITEA_RUNNER_REGISTRATION_TOKEN={{ with nomadVar "nomad/jobs/act-runner" }}{{ .registration_token }}{{ end }}
+EOF
+        destination = "secrets/runner.env"
+        env         = true
+      }
+
+      # Limit which images/labels the runner will accept so it doesn't pick up
+      # unrelated workloads if more runners are added later.
+      template {
+        data        = <<EOF
+runner:
+  labels:
+    - "ubuntu-latest:docker://node:20-bookworm"
+    - "ubuntu-22.04:docker://node:20-bookworm"
+    - "ubuntu-24.04:docker://node:20-bookworm"
+EOF
+        destination = "secrets/runner-config.yml"
+      }
+
+      resources {
+        cpu        = 200
+        memory     = 256
+        memory_max = 1024
+      }
+    }
+  }
+}
--- a/2-nomad-config/act-runner.tf
+++ b/2-nomad-config/act-runner.tf
@@ -0,0 +1,10 @@
+resource "nomad_job" "act_runner" {
+  jobspec = file("act-runner.nomad.hcl")
+}
+
+resource "nomad_variable" "act_runner" {
+  path = "nomad/jobs/act-runner"
+  items = {
+    registration_token = data.sops_file.secrets.data["act-runner.registration_token"]
+  }
+}
--- a/2-nomad-config/secrets/secrets.enc.json
+++ b/2-nomad-config/secrets/secrets.enc.json
@@ -56,6 +56,9 @@
 		"gitea_token": "ENC[AES256_GCM,data:/J3CDMgWZLe20oQ+ENKBMi8fs/+jgsARV7xihMq0OLmRk8C8ae/IXg==,iv:e7WYOanSOCZ/LhN6SKrH0VrR3xLPTTppOKpGpSl+oAc=,tag:XBAilRdK3jL7WtM+92Fsmg==,type:str]",
 		"github_token": "ENC[AES256_GCM,data:omZpdsTV1aFgQ9PjIApITEyIRKk6Z8QyvD2Kp5tJnBWzFCm4v2lRAg==,iv:cKL7z+CSChzF9eZEcske2lbmx9KV6CrWw0tn7rmP/10=,tag:gon3Sc1d3ntNSbWwenHuOw==,type:str]"
 	},
+	"act-runner": {
+		"registration_token": "ENC[AES256_GCM,data:RnDvcNh69lLlL/ms+sMPKhhc+ECtc5hUHSkAQZv8e77iTD/QPd356Q==,iv:sl2Aua8rTe6cKYQAUC7O4UyHajGy1LgG/ZNLTVP4SyE=,tag:JjdaQqZ4PaWjfoiVmBl6lQ==,type:str]"
+	},
 	"sops": {
 		"age": [
 			{
@@ -63,8 +66,8 @@
 				"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSByUWM4ZDVVbGFrUGdMRHBX\nUFBmU3Nlc0RBSzhFK0tHNHpkQXUvUVdiZUZJCmpRN1lFdENpWW0rcThjVlVQNUl6\nWnlLU0RnQ3FZby81Ly8xTFBrek9nMncKLS0tIFQ4UTRNOC9CRmx4OFJWem1wckZz\nUDFTSzdWZldFK3FqcTNWTWRyNDhHQ2MKS811mR5xn7qiC/aVgPFYJ5c6Q3zxRfcr\nHcvxUvB01vNJKZpRg92vvKPkV6lQO3DXCT98OdfwiymlEOvYxg71Pg==\n-----END AGE ENCRYPTED FILE-----\n"
 			}
 		],
-		"lastmodified": "2026-04-18T06:30:49Z",
-		"mac": "ENC[AES256_GCM,data:ZqT+lJxFOxbRaDkex8URHRRoNSoHVkB9tbMCDVWoln0otMUBFDnxa1Fqwzl77G+JxD/I7W5QX5qUx+oSoDxhyCvC97tjBfTZ+nlqTos25wLddSKwOfbvRNS7oZrzMt5AepgauApucNDjjUWtZB55mTV497PzESLBrZeI/4zpCU0=,iv:AVvlyJLyLJup2PtLt8NzZO+uCbuQKmUV0S2swwl6nME=,tag:HxywCeG6NQotrsN7ovDfrw==,type:str]",
+		"lastmodified": "2026-04-18T07:41:42Z",
+		"mac": "ENC[AES256_GCM,data:+HhhsiZXok4BZI05tG3p9veZaj51kELSQlWFYMSInv7bGfEadmOrJqCxaGrFcNkMmgVPx80jWQFrILfVLW5MUvEsHAhD4Vza2TSWeUq1HuL9DbMxsK2G9Y1fbthd12r/++dDcXxVnTUf/rCD70in/+g/zRObocAnUcFEcIqx1JE=,iv:pS+aj+47J4bYZYGlMVniQVTlLt4jtCLUT7oROJLUkZo=,tag:+lznxDhs2C3bcz5quxfHjA==,type:str]",
 		"encrypted_regex": "^(.*)$",
 		"version": "3.10.2"
 	}
--- a/cicd-plan.md
+++ b/cicd-plan.md
@@ -0,0 +1,305 @@
+# CI/CD Plan
+
+## Overview
+
+Three distinct problems, tackled in phases:
+
+1. **Does the config parse/validate without errors?** (static, no credentials)
+2. **Does the new Docker image actually exist and start?** (pre-merge, needs Docker)
+3. **Does the running service stay healthy through a deployment?** (post-merge, needs Nomad canary)
+
+The goal is: Renovate opens a PR → CI runs checks → you review → merge → canary starts automatically → you promote (or it auto-reverts).
+
+---
+
+## Phase 1 — Static Validation (proves the runner works)
+
+No secrets needed. Runs on every PR.
+
+### Infrastructure required
+
+- `act_runner` Nomad job (see below) with a Gitea runner token
+- `.gitea/workflows/ci.yml` in this repo
+
+### Checks
+
+| Check                 | Command                                               | Notes                                                               |
+| --------------------- | ----------------------------------------------------- | ------------------------------------------------------------------- |
+| HCL formatting        | `terraform fmt -check -recursive`                     | Fails on whitespace/style drift                                     |
+| Terraform syntax      | `terraform init -backend=false && terraform validate` | Catches wrong resource types, missing required args, bad references |
+| Nomad job spec syntax | `nomad job validate <file>`                           | Catches Nomad-specific issues; needs `NOMAD_ADDR` + read token      |
+
+`terraform validate -backend=false` is the most valuable: it catches ~90% of real mistakes with zero secret exposure. The Nomad validate step requires a low-privilege read token — worth adding once the runner is trusted.
+
+### Workflow sketch
+
+```yaml
+# .gitea/workflows/ci.yml
+on: [pull_request]
+
+jobs:
+  validate:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: hashicorp/setup-terraform@v3
+
+      - name: fmt check
+        run: terraform fmt -check -recursive
+        working-directory: 2-nomad-config
+
+      - name: init + validate (no backend)
+        run: |
+          terraform init -backend=false
+          terraform validate
+        working-directory: 2-nomad-config
+
+      - name: fmt check (nixos-node)
+        run: terraform fmt -check -recursive
+        working-directory: 1-nixos-node
+
+  nomad-validate:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install Nomad CLI
+        run: |
+          curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp.gpg
+          echo "deb [signed-by=/usr/share/keyrings/hashicorp.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
+          sudo apt-get update && sudo apt-get install -y nomad
+      - name: validate all job specs
+        env:
+          NOMAD_ADDR: ${{ secrets.NOMAD_ADDR }}
+          NOMAD_TOKEN: ${{ secrets.NOMAD_TOKEN }} # read-only policy sufficient
+        run: |
+          find 2-nomad-config -name '*.nomad.hcl' | while read f; do
+            echo "==> $f"
+            nomad job validate "$f"
+          done
+```
+
+### act_runner Nomad job
+
+```hcl
+# act-runner.nomad.hcl
+job "act-runner" {
+  group "act-runner" {
+    network {
+      mode = "bridge"
+    }
+
+    # Connect upstream to Gitea
+    service {
+      name = "act-runner"
+      connect {
+        sidecar_service {
+          proxy {
+            upstreams {
+              destination_name = "code-connect"
+              local_bind_port  = 3000
+            }
+          }
+        }
+      }
+    }
+
+    task "act-runner" {
+      driver = "docker"
+
+      config {
+        image   = "gitea/act_runner:latest"
+        volumes = ["/var/run/docker.sock:/var/run/docker.sock"]
+      }
+
+      env = {
+        GITEA_INSTANCE_URL = "http://localhost:3000"
+      }
+
+      template {
+        data        = <<EOF
+GITEA_RUNNER_REGISTRATION_TOKEN={{ with nomadVar "nomad/jobs/act-runner" }}{{ .registration_token }}{{ end }}
+EOF
+        destination = "secrets/runner.env"
+        env         = true
+      }
+
+      resources {
+        cpu        = 200
+        memory     = 256
+        memory_max = 512
+      }
+    }
+  }
+}
+```
+
+**Security note**: mounting `/var/run/docker.sock` gives the runner root-equivalent access to the host. Acceptable for a home server. Alternative: use `docker:dind` sidecar or Nomad's `exec` driver — more complex, lower risk.
+
+---
+
+## Phase 2 — Docker Image Validation (pre-merge)
+
+Runs on PRs that touch `.nomad.hcl` files. Catches: tag typos, deleted images, registry outages.
+
+Requires the `act_runner` to have Docker access (same socket mount as above).
+
+```yaml
+image-pull:
+  runs-on: ubuntu-latest
+  steps:
+    - uses: actions/checkout@v4
+    - name: Pull changed images
+      run: |
+        # Extract image tags added or changed vs main
+        git fetch origin main
+        git diff origin/main...HEAD -- '*.nomad.hcl' \
+          | grep '^\+\s*image\s*=' \
+          | grep -oP '"[^"]+:[^"]+"' \
+          | tr -d '"' \
+          | sort -u \
+          | while read image; do
+              echo "==> Pulling $image"
+              docker pull "$image"
+            done
+```
+
+This intentionally only tests _changed_ images — no value in pulling everything on every PR.
+
+---
+
+## Phase 3 — Nomad Canary Deployments (post-merge gate)
+
+Makes "merge" mean "start canary" rather than "go live". The old allocation keeps running until you promote.
+
+### Which jobs get canaries
+
+Most jobs already have Consul health checks — these can use `health_check = "checks"` for automatic revert gating.
+
+| Job        | Health check  | Shared writable volume  | Canary safe?                                                                      |
+| ---------- | ------------- | ----------------------- | --------------------------------------------------------------------------------- |
+| ntfy       | ✅ `/healthz` | no                      | ✅ yes                                                                            |
+| gitea      | ✅ `/`        | ✅ `single-node-writer` | ⚠️ volume blocks 2nd alloc from mounting — needs `max_parallel=1` rolling instead |
+| jellyfin   | ✅            | ✅ `single-node-writer` | ⚠️ same — rolling                                                                 |
+| immich     | ✅            | ✅ `single-node-writer` | ⚠️ same — rolling                                                                 |
+| sonarr     | ✅            | ✅ `single-node-writer` | ⚠️ same — rolling                                                                 |
+| prowlarr   | ✅            | ✅ `single-node-writer` | ⚠️ same — rolling                                                                 |
+| deluge     | ✅            | ✅ `single-node-writer` | ⚠️ same — rolling                                                                 |
+| frigate    | ✅            | ✅ `single-node-writer` | ⚠️ same — rolling                                                                 |
+| glance     | ✅            | no                      | ✅ yes                                                                            |
+| transfer   | ✅            | ✅ `single-node-writer` | ⚠️ rolling                                                                        |
+| openreader | ❌            | ✅ `single-node-writer` | ⚠️ add check first, then rolling                                                  |
+| unifi      | ❌            | ✅ `single-node-writer` | ⚠️ add check first, then rolling                                                  |
+| traefik    | (ingress)     | ✅                      | ⚠️ rolling — downtime risk, promote quickly                                       |
+| authelia   | (ingress)     | ✅                      | ✅ stateless config, canary fine                                                  |
+| renovate   | batch job     | n/a                     | n/a — no deployment model                                                         |
+| postgres   | (data layer)  | ✅                      | ❌ never canary — single-writer DB                                                |
+
+### Canary stanza (stateless jobs with no volume conflict)
+
+```hcl
+update {
+  canary           = 1
+  auto_promote     = false
+  auto_revert      = true
+  health_check     = "checks"
+  healthy_deadline = "5m"
+  min_healthy_time = "30s"
+}
+```
+
+### Rolling stanza (jobs with single-node-writer volumes)
+
+```hcl
+update {
+  max_parallel     = 1
+  auto_revert      = true
+  health_check     = "checks"
+  healthy_deadline = "5m"
+  min_healthy_time = "30s"
+}
+```
+
+Rolling with `max_parallel=1` still gives auto-revert but doesn't attempt to run two allocations simultaneously — the old one stops before the new one mounts the volume.
+
+---
+
+## Phase 4 — Automated terraform apply + Deployment Promotion
+
+Full CD: merge triggers apply, which creates the canary, CI then watches it and promotes or reverts.
+
+### Flow
+
+```
+PR merged to main
+      │
+      ▼
+Gitea Actions (on: push, branches: [main])
+  - terraform init
+  - terraform apply -auto-approve
+      │
+      ▼
+Nomad canary starts (old allocation still live)
+      │
+      ▼
+CI polls `nomad deployment list` for the new deployment ID
+CI waits for canary allocation to reach "healthy" in Consul
+      │ healthy within deadline
+      ▼
+CI runs: nomad deployment promote <id>
+      │ or unhealthy → nomad deployment fail <id> (auto_revert fires)
+      ▼
+ntfy notification: "deployment promoted" or "deployment reverted"
+```
+
+### Secrets required for full CD
+
+| Secret                 | Used by                             | Risk level                         |
+| ---------------------- | ----------------------------------- | ---------------------------------- |
+| `NOMAD_ADDR`           | validate + apply + promote          | Low (internal LAN addr)            |
+| `NOMAD_TOKEN`          | terraform apply (write) + promote   | **High** — grants full infra write |
+| `CLOUDFLARE_API_TOKEN` | terraform apply                     | **High** — DNS write               |
+| `SOPS_AGE_KEY`         | terraform apply (decrypt secrets)   | **High** — decrypts all secrets    |
+| `PG_PASSWORD`          | terraform apply (postgres provider) | High                               |
+
+Full CD requires all of these in Gitea Actions secrets. This is acceptable for a self-hosted, non-public Gitea instance where you control runner access — but it's the trust boundary to be deliberate about. A reasonable middle ground: **Phase 1-3 are fully automated; Phase 4 (apply + promote) runs automatically but requires a manual re-trigger or approval step** (Gitea supports required reviewers on environments).
+
+### Promote/revert script sketch
+
+```bash
+# In CI, after terraform apply completes:
+DEPLOY_ID=$(nomad deployment list -json | jq -r '[.[] | select(.JobID == "$JOB" and .Status == "running")] | first | .ID')
+echo "Watching deployment $DEPLOY_ID..."
+
+for i in $(seq 1 30); do
+  STATUS=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.Status')
+  HEALTHY=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.TaskGroups[].HealthyAllocs')
+  echo "[$i] status=$STATUS healthy=$HEALTHY"
+  if [ "$STATUS" = "successful" ]; then exit 0; fi
+  if [ "$STATUS" = "failed" ]; then exit 1; fi
+  # Check if canary is healthy enough to promote
+  CANARY_HEALTHY=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.TaskGroups[].DesiredCanaries == .TaskGroups[].HealthyAllocs')
+  if [ "$CANARY_HEALTHY" = "true" ]; then
+    nomad deployment promote "$DEPLOY_ID"
+    exit 0
+  fi
+  sleep 10
+done
+nomad deployment fail "$DEPLOY_ID"
+exit 1
+```
+
+---
+
+## Implementation Order
+
+- [ ] **Phase 1a**: Create `act-runner.nomad.hcl` + Terraform wrapper, register runner token in Gitea, get a hello-world workflow green
+- [ ] **Phase 1b**: Add `terraform fmt` + `terraform validate -backend=false` workflow — no secrets needed
+- [ ] **Phase 1c**: Add Nomad validate step — add `NOMAD_ADDR` + read-only `NOMAD_TOKEN` to Gitea secrets
+- [ ] **Phase 2**: Add image pull validation step to the workflow
+- [ ] **Phase 3a**: Add `update` stanzas to ntfy and glance (simplest, no volume conflict)
+- [ ] **Phase 3b**: Add rolling `update` stanzas to remaining service jobs (jellyfin, sonarr, etc.)
+- [ ] **Phase 3c**: Add health checks to openreader and unifi before adding update stanzas
+- [ ] **Phase 4a**: Add on-push workflow that runs `terraform apply -auto-approve` using full credential set
+- [ ] **Phase 4b**: Add deployment promotion/revert polling script
+- [ ] **Phase 4c**: Wire ntfy notifications for promote/revert outcomes
Author	SHA1	Message	Date
Adrian Cowan	a13f2cef25	Add Gitea act-runner and test actions for the repo All checks were successful CI / Terraform fmt + validate (push) Successful in 34s Details	2026-04-18 18:12:39 +10:00
Adrian Cowan	6c0b1c9281	Add CI/CD plan documentation outlining phases for validation and deployment	2026-04-18 17:34:11 +10:00