Add CI/CD plan documentation outlining phases for validation and deployment
This commit is contained in:
305
cicd-plan.md
Normal file
305
cicd-plan.md
Normal file
@@ -0,0 +1,305 @@
|
||||
# CI/CD Plan
|
||||
|
||||
## Overview
|
||||
|
||||
Three distinct problems, tackled in phases:
|
||||
|
||||
1. **Does the config parse/validate without errors?** (static, no credentials)
|
||||
2. **Does the new Docker image actually exist and start?** (pre-merge, needs Docker)
|
||||
3. **Does the running service stay healthy through a deployment?** (post-merge, needs Nomad canary)
|
||||
|
||||
The goal is: Renovate opens a PR → CI runs checks → you review → merge → canary starts automatically → you promote (or it auto-reverts).
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Static Validation (proves the runner works)
|
||||
|
||||
No secrets needed. Runs on every PR.
|
||||
|
||||
### Infrastructure required
|
||||
|
||||
- `act_runner` Nomad job (see below) with a Gitea runner token
|
||||
- `.gitea/workflows/ci.yml` in this repo
|
||||
|
||||
### Checks
|
||||
|
||||
| Check | Command | Notes |
|
||||
| --------------------- | ----------------------------------------------------- | ------------------------------------------------------------------- |
|
||||
| HCL formatting | `terraform fmt -check -recursive` | Fails on whitespace/style drift |
|
||||
| Terraform syntax | `terraform init -backend=false && terraform validate` | Catches wrong resource types, missing required args, bad references |
|
||||
| Nomad job spec syntax | `nomad job validate <file>` | Catches Nomad-specific issues; needs `NOMAD_ADDR` + read token |
|
||||
|
||||
`terraform validate -backend=false` is the most valuable: it catches ~90% of real mistakes with zero secret exposure. The Nomad validate step requires a low-privilege read token — worth adding once the runner is trusted.
|
||||
|
||||
### Workflow sketch
|
||||
|
||||
```yaml
|
||||
# .gitea/workflows/ci.yml
|
||||
on: [pull_request]
|
||||
|
||||
jobs:
|
||||
validate:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- uses: hashicorp/setup-terraform@v3
|
||||
|
||||
- name: fmt check
|
||||
run: terraform fmt -check -recursive
|
||||
working-directory: 2-nomad-config
|
||||
|
||||
- name: init + validate (no backend)
|
||||
run: |
|
||||
terraform init -backend=false
|
||||
terraform validate
|
||||
working-directory: 2-nomad-config
|
||||
|
||||
- name: fmt check (nixos-node)
|
||||
run: terraform fmt -check -recursive
|
||||
working-directory: 1-nixos-node
|
||||
|
||||
nomad-validate:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Install Nomad CLI
|
||||
run: |
|
||||
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp.gpg
|
||||
echo "deb [signed-by=/usr/share/keyrings/hashicorp.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
|
||||
sudo apt-get update && sudo apt-get install -y nomad
|
||||
- name: validate all job specs
|
||||
env:
|
||||
NOMAD_ADDR: ${{ secrets.NOMAD_ADDR }}
|
||||
NOMAD_TOKEN: ${{ secrets.NOMAD_TOKEN }} # read-only policy sufficient
|
||||
run: |
|
||||
find 2-nomad-config -name '*.nomad.hcl' | while read f; do
|
||||
echo "==> $f"
|
||||
nomad job validate "$f"
|
||||
done
|
||||
```
|
||||
|
||||
### act_runner Nomad job
|
||||
|
||||
```hcl
|
||||
# act-runner.nomad.hcl
|
||||
job "act-runner" {
|
||||
group "act-runner" {
|
||||
network {
|
||||
mode = "bridge"
|
||||
}
|
||||
|
||||
# Connect upstream to Gitea
|
||||
service {
|
||||
name = "act-runner"
|
||||
connect {
|
||||
sidecar_service {
|
||||
proxy {
|
||||
upstreams {
|
||||
destination_name = "code-connect"
|
||||
local_bind_port = 3000
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
task "act-runner" {
|
||||
driver = "docker"
|
||||
|
||||
config {
|
||||
image = "gitea/act_runner:latest"
|
||||
volumes = ["/var/run/docker.sock:/var/run/docker.sock"]
|
||||
}
|
||||
|
||||
env = {
|
||||
GITEA_INSTANCE_URL = "http://localhost:3000"
|
||||
}
|
||||
|
||||
template {
|
||||
data = <<EOF
|
||||
GITEA_RUNNER_REGISTRATION_TOKEN={{ with nomadVar "nomad/jobs/act-runner" }}{{ .registration_token }}{{ end }}
|
||||
EOF
|
||||
destination = "secrets/runner.env"
|
||||
env = true
|
||||
}
|
||||
|
||||
resources {
|
||||
cpu = 200
|
||||
memory = 256
|
||||
memory_max = 512
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Security note**: mounting `/var/run/docker.sock` gives the runner root-equivalent access to the host. Acceptable for a home server. Alternative: use `docker:dind` sidecar or Nomad's `exec` driver — more complex, lower risk.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Docker Image Validation (pre-merge)
|
||||
|
||||
Runs on PRs that touch `.nomad.hcl` files. Catches: tag typos, deleted images, registry outages.
|
||||
|
||||
Requires the `act_runner` to have Docker access (same socket mount as above).
|
||||
|
||||
```yaml
|
||||
image-pull:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Pull changed images
|
||||
run: |
|
||||
# Extract image tags added or changed vs main
|
||||
git fetch origin main
|
||||
git diff origin/main...HEAD -- '*.nomad.hcl' \
|
||||
| grep '^\+\s*image\s*=' \
|
||||
| grep -oP '"[^"]+:[^"]+"' \
|
||||
| tr -d '"' \
|
||||
| sort -u \
|
||||
| while read image; do
|
||||
echo "==> Pulling $image"
|
||||
docker pull "$image"
|
||||
done
|
||||
```
|
||||
|
||||
This intentionally only tests _changed_ images — no value in pulling everything on every PR.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — Nomad Canary Deployments (post-merge gate)
|
||||
|
||||
Makes "merge" mean "start canary" rather than "go live". The old allocation keeps running until you promote.
|
||||
|
||||
### Which jobs get canaries
|
||||
|
||||
Most jobs already have Consul health checks — these can use `health_check = "checks"` for automatic revert gating.
|
||||
|
||||
| Job | Health check | Shared writable volume | Canary safe? |
|
||||
| ---------- | ------------- | ----------------------- | --------------------------------------------------------------------------------- |
|
||||
| ntfy | ✅ `/healthz` | no | ✅ yes |
|
||||
| gitea | ✅ `/` | ✅ `single-node-writer` | ⚠️ volume blocks 2nd alloc from mounting — needs `max_parallel=1` rolling instead |
|
||||
| jellyfin | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling |
|
||||
| immich | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling |
|
||||
| sonarr | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling |
|
||||
| prowlarr | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling |
|
||||
| deluge | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling |
|
||||
| frigate | ✅ | ✅ `single-node-writer` | ⚠️ same — rolling |
|
||||
| glance | ✅ | no | ✅ yes |
|
||||
| transfer | ✅ | ✅ `single-node-writer` | ⚠️ rolling |
|
||||
| openreader | ❌ | ✅ `single-node-writer` | ⚠️ add check first, then rolling |
|
||||
| unifi | ❌ | ✅ `single-node-writer` | ⚠️ add check first, then rolling |
|
||||
| traefik | (ingress) | ✅ | ⚠️ rolling — downtime risk, promote quickly |
|
||||
| authelia | (ingress) | ✅ | ✅ stateless config, canary fine |
|
||||
| renovate | batch job | n/a | n/a — no deployment model |
|
||||
| postgres | (data layer) | ✅ | ❌ never canary — single-writer DB |
|
||||
|
||||
### Canary stanza (stateless jobs with no volume conflict)
|
||||
|
||||
```hcl
|
||||
update {
|
||||
canary = 1
|
||||
auto_promote = false
|
||||
auto_revert = true
|
||||
health_check = "checks"
|
||||
healthy_deadline = "5m"
|
||||
min_healthy_time = "30s"
|
||||
}
|
||||
```
|
||||
|
||||
### Rolling stanza (jobs with single-node-writer volumes)
|
||||
|
||||
```hcl
|
||||
update {
|
||||
max_parallel = 1
|
||||
auto_revert = true
|
||||
health_check = "checks"
|
||||
healthy_deadline = "5m"
|
||||
min_healthy_time = "30s"
|
||||
}
|
||||
```
|
||||
|
||||
Rolling with `max_parallel=1` still gives auto-revert but doesn't attempt to run two allocations simultaneously — the old one stops before the new one mounts the volume.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Automated terraform apply + Deployment Promotion
|
||||
|
||||
Full CD: merge triggers apply, which creates the canary, CI then watches it and promotes or reverts.
|
||||
|
||||
### Flow
|
||||
|
||||
```
|
||||
PR merged to main
|
||||
│
|
||||
▼
|
||||
Gitea Actions (on: push, branches: [main])
|
||||
- terraform init
|
||||
- terraform apply -auto-approve
|
||||
│
|
||||
▼
|
||||
Nomad canary starts (old allocation still live)
|
||||
│
|
||||
▼
|
||||
CI polls `nomad deployment list` for the new deployment ID
|
||||
CI waits for canary allocation to reach "healthy" in Consul
|
||||
│ healthy within deadline
|
||||
▼
|
||||
CI runs: nomad deployment promote <id>
|
||||
│ or unhealthy → nomad deployment fail <id> (auto_revert fires)
|
||||
▼
|
||||
ntfy notification: "deployment promoted" or "deployment reverted"
|
||||
```
|
||||
|
||||
### Secrets required for full CD
|
||||
|
||||
| Secret | Used by | Risk level |
|
||||
| ---------------------- | ----------------------------------- | ---------------------------------- |
|
||||
| `NOMAD_ADDR` | validate + apply + promote | Low (internal LAN addr) |
|
||||
| `NOMAD_TOKEN` | terraform apply (write) + promote | **High** — grants full infra write |
|
||||
| `CLOUDFLARE_API_TOKEN` | terraform apply | **High** — DNS write |
|
||||
| `SOPS_AGE_KEY` | terraform apply (decrypt secrets) | **High** — decrypts all secrets |
|
||||
| `PG_PASSWORD` | terraform apply (postgres provider) | High |
|
||||
|
||||
Full CD requires all of these in Gitea Actions secrets. This is acceptable for a self-hosted, non-public Gitea instance where you control runner access — but it's the trust boundary to be deliberate about. A reasonable middle ground: **Phase 1-3 are fully automated; Phase 4 (apply + promote) runs automatically but requires a manual re-trigger or approval step** (Gitea supports required reviewers on environments).
|
||||
|
||||
### Promote/revert script sketch
|
||||
|
||||
```bash
|
||||
# In CI, after terraform apply completes:
|
||||
DEPLOY_ID=$(nomad deployment list -json | jq -r '[.[] | select(.JobID == "$JOB" and .Status == "running")] | first | .ID')
|
||||
echo "Watching deployment $DEPLOY_ID..."
|
||||
|
||||
for i in $(seq 1 30); do
|
||||
STATUS=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.Status')
|
||||
HEALTHY=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.TaskGroups[].HealthyAllocs')
|
||||
echo "[$i] status=$STATUS healthy=$HEALTHY"
|
||||
if [ "$STATUS" = "successful" ]; then exit 0; fi
|
||||
if [ "$STATUS" = "failed" ]; then exit 1; fi
|
||||
# Check if canary is healthy enough to promote
|
||||
CANARY_HEALTHY=$(nomad deployment status -json "$DEPLOY_ID" | jq -r '.TaskGroups[].DesiredCanaries == .TaskGroups[].HealthyAllocs')
|
||||
if [ "$CANARY_HEALTHY" = "true" ]; then
|
||||
nomad deployment promote "$DEPLOY_ID"
|
||||
exit 0
|
||||
fi
|
||||
sleep 10
|
||||
done
|
||||
nomad deployment fail "$DEPLOY_ID"
|
||||
exit 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Order
|
||||
|
||||
- [ ] **Phase 1a**: Create `act-runner.nomad.hcl` + Terraform wrapper, register runner token in Gitea, get a hello-world workflow green
|
||||
- [ ] **Phase 1b**: Add `terraform fmt` + `terraform validate -backend=false` workflow — no secrets needed
|
||||
- [ ] **Phase 1c**: Add Nomad validate step — add `NOMAD_ADDR` + read-only `NOMAD_TOKEN` to Gitea secrets
|
||||
- [ ] **Phase 2**: Add image pull validation step to the workflow
|
||||
- [ ] **Phase 3a**: Add `update` stanzas to ntfy and glance (simplest, no volume conflict)
|
||||
- [ ] **Phase 3b**: Add rolling `update` stanzas to remaining service jobs (jellyfin, sonarr, etc.)
|
||||
- [ ] **Phase 3c**: Add health checks to openreader and unifi before adding update stanzas
|
||||
- [ ] **Phase 4a**: Add on-push workflow that runs `terraform apply -auto-approve` using full credential set
|
||||
- [ ] **Phase 4b**: Add deployment promotion/revert polling script
|
||||
- [ ] **Phase 4c**: Wire ntfy notifications for promote/revert outcomes
|
||||
Reference in New Issue
Block a user