deploy workflows run docker compose down BEFORE `docker compose up -d #12

Closed
opened 2026-05-13 19:01:43 +00:00 by brendan · 0 comments
Owner

Originally filed: 2026-05-10 in ~/bugs.md, block #7.
Cross-project companion issues: brendan/authd, brendan/buchinese, brendan/inventory, brendan/movement

> **Originally filed:** 2026-05-10 in ~/bugs.md, block #7. > **Cross-project companion issues:** brendan/authd, brendan/buchinese, brendan/inventory, brendan/movement <!-- Resolved 2026-05-11 — PR https://gitea.bchen.dev/brendan/nanodrop/pulls/7 (merge_commit 398c008c329143910321a15f5ab58a7553611ae3) — final bullet (nanodrop) closed; all 5 projects (authd PR #12, buchinese PR #8, inventory PR #19, movement PR #16, nanodrop PR #7) now use `up -d --build --remove-orphans` without a preceding `compose down`. Cross-project deploy-down-before-up retrofit chore complete. **2026-05-10 — cross-project: deploy workflows run `docker compose down` BEFORE `docker compose up -d --build`, guaranteeing downtime on any failed rebuild; remove the pre-emptive `down` line so a failed build leaves the previous container running** User directive (spawn-host, 2026-05-10) after the 2026-05-10 buchinese outage from PR #6: "flag the down before up pattern as something that should be fixed, for all projects." The buchinese outage made the cost concrete — a single `npm ci` lockfile mismatch took prod to HTTP 502 for the duration of the human-intervention turnaround, because the deploy script destroyed the running container BEFORE finding out whether the new image even builds. **The anti-pattern (present in every deployed project):** ```yaml - name: Deploy on server with Docker run: | ssh ... <<EOF cd ~/${{ vars.DIRECTORY_NAME }} ... docker compose -f compose.yaml down # ← kills the running container UNCONDITIONALLY docker compose -f compose.yaml up -d --build # ← only NOW we find out if the new image builds EOF ``` If `docker compose up --build` fails (npm ci mismatch, missing env var, syntax error in code, native-build break on alpine, etc.), the previous container is already gone. Result: HTTP 502 from the reverse proxy until a human notices and pushes a fix. The 2026-05-10 buchinese outage and the 2026-05-09 lockfile incident (commit `4dcc9c3 Update lockfile`) both hit this — neither would have caused user-visible downtime if the previous container had stayed up while the build was attempted. **The fix (minimal, ~1-line per project):** delete the `docker compose down` line. The remaining `docker compose up -d --build` is sufficient and superior: ```yaml - name: Deploy on server with Docker run: | ssh ... <<EOF cd ~/${{ vars.DIRECTORY_NAME }} ... docker compose -f compose.yaml up -d --build --remove-orphans EOF ``` **Why `up --build` alone is correct:** `docker compose up -d --build` builds the image first; **only after a successful build** does compose stop+replace the existing container (compose calls this "recreate"). On a build failure, compose exits non-zero with the previous container still running and serving traffic. The `--remove-orphans` flag cleans up containers from services that were renamed/removed in `compose.yaml` (drop-in safe — no-op if there are no orphans). **Optional follow-up improvements (file as separate chore items if wanted, NOT in scope for this fix):** - **Healthcheck-gated swap:** add a `healthcheck:` block to each service in `compose.yaml` and use `docker compose up -d --wait --build` so the deploy step only returns once the new container reports healthy. This catches runtime failures (env var unset, port collision, crash on startup) before the SSH step "succeeds". Compose v2 supports `--wait` out of the box. - **Image-pre-build:** `docker compose build` then `docker compose up -d --no-build` as two separate steps, so build failure produces a clear "build failed" log line distinct from "deploy step ran but container is unhealthy". Slightly nicer signal in CI logs. - **Blue-green / port-swap:** run new container on parallel port, healthcheck, flip the reverse proxy upstream, kill old. Overkill for personal-scale apps — skip unless one of the apps grows real availability requirements. **Affected projects (audited 2026-05-10 via grep across `~/*/.gitea/workflows/*.yml` and `~/*/.github/workflows/*.yml`):** - ~~`~/buchinese/.gitea/workflows/deploy.yml:69-70`~~ ✅ **DONE 2026-05-11** in PR https://gitea.bchen.dev/brendan/buchinese/pulls/8 (merge_commit `902e30f2635f34a5fe9bed23c10109a1ca423125`). Two-line diff matching the authd precedent: dropped `docker compose -f compose.yaml down`; added `--remove-orphans` to surviving `up -d --build`. Refactor noop. 1 cycle. Needs manual deploy verification by user post-merge (next push to main should show no `Stopping buchinese ...` line before the build step). - ~~`~/inventory/.github/workflows/deploy.yml:63-64`~~ ✅ **DONE 2026-05-11** in PR https://gitea.bchen.dev/brendan/inventory/pulls/19 (merge_commit `ecc933cbed56bbaf9c9de62675a6f870f6e473c7`). Two-line diff matching the authd + buchinese precedents: dropped `docker compose -f compose.yaml down`; added `--remove-orphans` to surviving `up -d --build`. Refactor noop. 1 cycle. Needs manual deploy verification by user post-merge (next push to main should show no `Stopping inventory ...` line before the build step). - ~~`~/authd/.github/workflows/deploy.yml:54-55`~~ ✅ **DONE 2026-05-11** in PR https://gitea.bchen.dev/brendan/authd/pulls/12 (merge_commit `ea19e09eedf632fb04c88bd16668f3831d25a435`). Two-line diff: dropped `docker compose -f compose.yaml down`; added `--remove-orphans` to surviving `up -d --build`. Refactor noop. 1 cycle. Needs manual deploy verification by user post-merge (next push to main should show no `Stopping authd ...` line before the build step). - ~~`~/nanodrop/.github/workflows/deploy.yml:54-55`~~ ✅ **DONE 2026-05-11** in PR https://gitea.bchen.dev/brendan/nanodrop/pulls/7 (merge_commit `398c008c329143910321a15f5ab58a7553611ae3`). Two-line diff matching the authd + buchinese + inventory + movement precedents: dropped `docker compose -f compose.yaml down`; added `--remove-orphans` to surviving `up -d --build`. (Note: prior PR #5 had already renamed `deploy-homelab.yml` → `deploy.yml` and `docker-compose.yml` → `compose.yaml`, so the fix landed on the canonical filenames.) Refactor noop. 1 cycle. Needs manual deploy verification by user post-merge (next push to main should show no `Stopping nanodrop ...` line before the build step). - ~~`~/movement/.github/workflows/deploy.yml:44-45`~~ ✅ **DONE 2026-05-11** in PR https://gitea.bchen.dev/brendan/movement/pulls/16 (merge_commit `e0863a6c9e4fe3c98fa77b06930da3319ddabc12`). Two-line diff matching the authd + buchinese + inventory precedents: dropped `docker compose -f compose.yaml down`; added `--remove-orphans` to surviving `up -d --build`. Refactor noop. 1 cycle. Needs manual deploy verification by user post-merge (next push to main should show no `Stopping movement ...` line before the build step). **Not affected (verified — no deploy workflow exists yet):** - `~/dashcam` — no `.gitea/workflows` or `.github/workflows` directory. When a deploy workflow is added, it MUST follow the `up --build` (no `down`) pattern from the start. - `~/portman`, `~/tradebot` — README-only at filing time, no project code, no workflow. Same standing rule applies when they get one. **Suggested PR sequencing for the bug-fixer:** 5 separate PRs (one per project), each a one-line removal. Tiny diffs, no test changes needed, no env-var changes, no migration. Auto-merge fine on all 5 — these are pure CI-config tweaks. **Order (highest cross-cutting risk first):** 1. **authd** — biggest blast radius (OAuth provider for the family). 2. **buchinese** — currently 502'd; the lockfile fix from the [URGENT] item above will fix the immediate outage, but this fix prevents the NEXT one. Land this right after the lockfile regen so the next deploy benefits from atomic swap. 3. **inventory** — no current outage, but it's the canonical-workflow reference for the family per the 2026-05-09 cross-project alignment item further down this file. Fixing it in the canonical also propagates the pattern to future workflow ports. 4. **movement, nanodrop** — same one-line fix, different filenames. **Acceptance per PR:** - Workflow yaml shows `docker compose ... up -d --build [--remove-orphans]` with NO preceding `docker compose ... down` line in the same `ssh` heredoc. - A subsequent push to `main` that introduces a deliberate build failure (e.g. an intentionally broken `package.json` in a throwaway test branch) does NOT take down the running container. **Don't actually test this in prod** — the bug-fixer can simulate by reading the workflow yaml and confirming the `down` line is gone; the behavioral claim is mechanical (compose's `up --build` swap is well-documented), no live test needed. Production smoke = the next legitimate deploy completes without 502. - **No regression** in normal-case deploys: a clean build still replaces the container (which compose does because `--build` triggers recreate when the image hash changes). **Out-of-scope notes for the bug-fixer:** - DO NOT add a `--wait` flag, healthcheck blocks, or blue-green logic in these PRs. Those are real improvements but they require `healthcheck:` config in each compose file and per-service health endpoints — much bigger diff, easier to get wrong. The 1-line `down` removal captures 90% of the safety benefit at 1% of the complexity. - DO NOT change anything else in the workflow (env block, SSH setup, runner image). Each PR title should be `chore(deploy): drop pre-emptive 'compose down' so failed builds don't take prod offline` (or analogous wording). - DO NOT bundle multiple projects into one PR. The repos are separate; the deploys are independent; the PRs should be too. **Prevention going forward:** add to `~/roles/_sub-claude-rules.md` (or the deploy-workflow section of the new-project bootstrap rules) a standing rule that any new deploy workflow MUST use `docker compose up -d --build` WITHOUT a preceding `docker compose down`. The reporter can wire this into the canonical-workflow alignment item that already exists further down this bugs file. Marked as a follow-up; not load-bearing for this fix. Source: user via spawn-host (2026-05-10), triggered by the buchinese PR #6 outage. Audit done across all 5 deployed projects at filing time. -->
brendan added the bug label 2026-05-13 19:01:43 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: brendan/nanodrop#12