12 KiB
CI/CD Plan
Overview
Three distinct problems, tackled in phases:
- Does the config parse/validate without errors? (static, no credentials)
- Does the new Docker image actually exist and start? (pre-merge, needs Docker)
- Does the running service stay healthy through a deployment? (post-merge, needs Nomad canary)
The goal is: Renovate opens a PR → CI runs checks → you review → merge → canary starts automatically → you promote (or it auto-reverts).
Phase 1 — Static Validation (proves the runner works)
No secrets needed. Runs on every PR.
Infrastructure required
act_runnerNomad job (see below) with a Gitea runner token.gitea/workflows/ci.ymlin this repo
Checks
| Check | Command | Notes |
|---|---|---|
| HCL formatting | terraform fmt -check -recursive |
Fails on whitespace/style drift |
| Terraform syntax | terraform init -backend=false && terraform validate |
Catches wrong resource types, missing required args, bad references |
| Nomad job spec syntax | nomad job validate <file> |
Catches Nomad-specific issues; needs NOMAD_ADDR + read token |
terraform validate -backend=false is the most valuable: it catches ~90% of real mistakes with zero secret exposure. The Nomad validate step requires a low-privilege read token — worth adding once the runner is trusted.
Workflow sketch
# .gitea/workflows/ci.yml
on: [pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: fmt check
run: terraform fmt -check -recursive
working-directory: 2-nomad-config
- name: init + validate (no backend)
run: |
terraform init -backend=false
terraform validate
working-directory: 2-nomad-config
- name: fmt check (nixos-node)
run: terraform fmt -check -recursive
working-directory: 1-nixos-node
nomad-validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Nomad CLI
run: |
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt-get update && sudo apt-get install -y nomad
- name: validate all job specs
env:
NOMAD_ADDR: ${{ secrets.NOMAD_ADDR }}
NOMAD_TOKEN: ${{ secrets.NOMAD_TOKEN }} # read-only policy sufficient
run: |
find 2-nomad-config -name '*.nomad.hcl' | while read f; do
echo "==> $f"
nomad job validate "$f"
done
act_runner Nomad job
# act-runner.nomad.hcl
job "act-runner" {
group "act-runner" {
network {
mode = "bridge"
}
# Connect upstream to Gitea
service {
name = "act-runner"
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "code-connect"
local_bind_port = 3000
}
}
}
}
}
task "act-runner" {
driver = "docker"
config {
image = "gitea/act_runner:latest"
volumes = ["/var/run/docker.sock:/var/run/docker.sock"]
}
env = {
GITEA_INSTANCE_URL = "http://localhost:3000"
}
template {
data = <<EOF
GITEA_RUNNER_REGISTRATION_TOKEN={{ with nomadVar "nomad/jobs/act-runner" }}{{ .registration_token }}{{ end }}
EOF
destination = "secrets/runner.env"
env = true
}
resources {
cpu = 200
memory = 256
memory_max = 512
}
}
}
}
Security note: mounting /var/run/docker.sock gives the runner root-equivalent access to the host. Acceptable for a home server. Alternative: use docker:dind sidecar or Nomad's exec driver — more complex, lower risk.
Phase 2 — Docker Image Validation (pre-merge)
Runs on PRs that touch .nomad.hcl files. Catches: tag typos, deleted images, registry outages.
Requires the act_runner to have Docker access (same socket mount as above).
image-pull:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Pull changed images
run: |
# Extract image tags added or changed vs main
git fetch origin main
git diff origin/main...HEAD -- '*.nomad.hcl' \
| grep '^\+\s*image\s*=' \
| grep -oP '"[^"]+:[^"]+"' \
| tr -d '"' \
| sort -u \
| while read image; do
echo "==> Pulling $image"
docker pull "$image"
done
This intentionally only tests changed images — no value in pulling everything on every PR.
Phase 3 — Nomad Canary Deployments (post-merge gate)
Makes "merge" mean "start canary" rather than "go live". The old allocation keeps running until you promote.
Which jobs get canaries
Most jobs already have Consul health checks — these can use health_check = "checks" for automatic revert gating.
| Job | Health check | Shared writable volume | Canary safe? |
|---|---|---|---|
| ntfy | ✅ /healthz |
no | ✅ yes |
| gitea | ✅ / |
✅ single-node-writer |
⚠️ volume blocks 2nd alloc from mounting — needs max_parallel=1 rolling instead |
| jellyfin | ✅ | ✅ single-node-writer |
⚠️ same — rolling |
| immich | ✅ | ✅ single-node-writer |
⚠️ same — rolling |
| sonarr | ✅ | ✅ single-node-writer |
⚠️ same — rolling |
| prowlarr | ✅ | ✅ single-node-writer |
⚠️ same — rolling |
| deluge | ✅ | ✅ single-node-writer |
⚠️ same — rolling |
| frigate | ✅ | ✅ single-node-writer |
⚠️ same — rolling |
| glance | ✅ | no | ✅ yes |
| transfer | ✅ | ✅ single-node-writer |
⚠️ rolling |
| openreader | ❌ | ✅ single-node-writer |
⚠️ add check first, then rolling |
| unifi | ❌ | ✅ single-node-writer |
⚠️ add check first, then rolling |
| traefik | (ingress) | ✅ | ⚠️ rolling — downtime risk, promote quickly |
| authelia | (ingress) | ✅ | ✅ stateless config, canary fine |
| renovate | batch job | n/a | n/a — no deployment model |
| postgres | (data layer) | ✅ | ❌ never canary — single-writer DB |
Canary stanza (stateless jobs with no volume conflict)
update {
canary = 1
auto_promote = false
auto_revert = true
health_check = "checks"
healthy_deadline = "5m"
min_healthy_time = "30s"
}
Rolling stanza (jobs with single-node-writer volumes)
update {
max_parallel = 1
auto_revert = true
health_check = "checks"
healthy_deadline = "5m"
min_healthy_time = "30s"
}
Rolling with max_parallel=1 still gives auto-revert but doesn't attempt to run two allocations simultaneously — the old one stops before the new one mounts the volume.
Phase 4 — Automated terraform apply + Deployment Promotion
Full CD: merge triggers apply, which creates the canary, CI then watches it and promotes or reverts.
Flow
PR merged to main
│
▼
Gitea Actions (on: push, branches: [main])
- terraform init
- terraform apply -auto-approve
│
▼
Nomad canary starts (old allocation still live)
│
▼
CI polls `nomad deployment list` for the new deployment ID
CI waits for canary allocation to reach "healthy" in Consul
│ healthy within deadline
▼
CI runs: nomad deployment promote <id>
│ or unhealthy → nomad deployment fail <id> (auto_revert fires)
▼
ntfy notification: "deployment promoted" or "deployment reverted"
Secrets required for full CD
| Secret | Used by | Risk level |
|---|---|---|
NOMAD_ADDR |
validate + apply + promote | Low (internal LAN addr) |
NOMAD_TOKEN |
terraform apply (write) + promote | High — grants full infra write |
CLOUDFLARE_API_TOKEN |
terraform apply | High — DNS write |
SOPS_AGE_KEY |
terraform apply (decrypt secrets) | High — decrypts all secrets |
PG_PASSWORD |
terraform apply (postgres provider) | High |
Full CD requires all of these in Gitea Actions secrets. This is acceptable for a self-hosted, non-public Gitea instance where you control runner access — but it's the trust boundary to be deliberate about. A reasonable middle ground: Phase 1-3 are fully automated; Phase 4 (apply + promote) runs automatically but requires a manual re-trigger or approval step (Gitea supports required reviewers on environments).
Promote/revert script sketch
# In CI, after terraform apply completes:
DEPLOY_ID=$(nomad deployment list -json | jq -r '[.[] | select(.JobID == "$JOB" and .Status == "running")] | first | .ID')
echo "Watching deployment $DEPLOY_ID..."
for i in $(seq 1 30); do
STATUS=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.Status')
HEALTHY=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.TaskGroups[].HealthyAllocs')
echo "[$i] status=$STATUS healthy=$HEALTHY"
if [ "$STATUS" = "successful" ]; then exit 0; fi
if [ "$STATUS" = "failed" ]; then exit 1; fi
# Check if canary is healthy enough to promote
CANARY_HEALTHY=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.TaskGroups[].DesiredCanaries == .TaskGroups[].HealthyAllocs')
if [ "$CANARY_HEALTHY" = "true" ]; then
nomad deployment promote "$DEPLOY_ID"
exit 0
fi
sleep 10
done
nomad deployment fail "$DEPLOY_ID"
exit 1
Implementation Order
- Phase 1a: Create
act-runner.nomad.hcl+ Terraform wrapper, register runner token in Gitea, get a hello-world workflow green - Phase 1b: Add
terraform fmt+terraform validate -backend=falseworkflow — no secrets needed - Phase 1c: Add Nomad validate step — add
NOMAD_ADDR+ read-onlyNOMAD_TOKENto Gitea secrets - Phase 2: Add image pull validation step to the workflow
- Phase 3a: Add
updatestanzas to ntfy and glance (simplest, no volume conflict) - Phase 3b: Add rolling
updatestanzas to remaining service jobs (jellyfin, sonarr, etc.) - Phase 3c: Add health checks to openreader and unifi before adding update stanzas
- Phase 4a: Add on-push workflow that runs
terraform apply -auto-approveusing full credential set - Phase 4b: Add deployment promotion/revert polling script
- Phase 4c: Wire ntfy notifications for promote/revert outcomes