Deploy pipeline verification — 2026-05-04¶

End-to-end verification of the RBO V2 deploy pipeline as it stands at Phase-0 kickoff. Filed under #87 so the next agent who needs to know "what runtime targets exist?" can find this without reading three weeks of CI commits.

Verified state¶

The most recent main-branch pipeline (#88, SHA c3bce97, 2026-05-03) ran end-to-end green:

Job	Stage	Brick	Outcome
`docs:build` (MR-only)	build	`.mkdocs_pages` (v5)	n/a on `main` (rules: `CI_COMMIT_BRANCH != $CI_DEFAULT_BRANCH`)
`build:test`	build	`.buildx_build` (v5)	success — 12s warm
`deploy:test`	deploy-test	`.keystone_deploy` (v3 carry-forward)	success — 20s
`smoke:test`	smoke	inline (alpine + curl)	success — 7s
`deploy:prod`	deploy-prod	`.keystone_deploy` (manual gate)	success — 19s
`pages`	pages	`.mkdocs_pages` (v5)	success — 18s

All five live URLs probed by smoke:test returned 200:

https://rbo-test.wagen.io/ → FastAPI root ({"app":"racket-book","status":"skeleton"})
https://rbo-test.wagen.io/healthz → ok
https://rbo-test.wagen.io/auth/health → 200 (per-tenant gotrue)

The mkdocs site is published at https://racket-book-7cc372.gitlab.io (GitLab Pages unique-domain enabled, force_https=true).

Pipeline shape¶

CI consumes shared bricks from wagen/keystone at ref: v5 (.gitlab/ci-templates/deploy.yml).

build           build:test              .buildx_build           kst1-shared buildkitd → registry
                docs:build  (MR only)   .mkdocs_pages           mkdocs --strict → public/
deploy-test     deploy:test             .keystone_deploy        ssh keystone@kst1 → 11-app-deploy.sh
smoke           smoke:test              inline                  curl /, /healthz, /auth/health
deploy-prod     deploy:prod  (manual)   .keystone_deploy        ssh keystone@kst1 → 11-app-deploy.sh
pages           pages                   .mkdocs_pages           public/ → GitLab Pages

Stages list (canonical from v4, inherited via include:): install → check → test → build → db-migrate → deploy-test → smoke → deploy-prod → pages. RBO does not currently use install, check, test, or db-migrate (no test suite, no migrations yet — see #89 for the migration wiring).

Why deploy:test / deploy:prod extend .keystone_deploy directly (NOT the v4 .deploy_test / .deploy_prod wrappers): the v4 wrappers pin needs: [db:migrate:<env>, build:<env>] (and for prod also test-e2e). RBO has no db:migrate:* and no build:prod job today; extending the wrappers would fail at pipeline-create with unresolved-need errors. Switching to the wrappers is part of #89.

Image build path¶

Builder: persistent kst1-shared buildkitd container on the kst1 host (created by keystone/scripts/13-runner-buildx-init.sh). Pruned weekly Sunday 04:00 UTC by /etc/cron.d/keystone-buildx-prune.
Runner: self-hosted on kst1, tag [kst1]. Shared GitLab.com runners are disabled on this project (shared_runners_enabled=false).
Image: <container-registry>/<project>:${CI_COMMIT_SHORT_SHA}-${APP_ENV} (project's own registry, tag per commit + env). RBO is env-agnostic at build time (FastAPI, no client-side bundling) so deploy:prod re-uses the <sha>-test artifact via an IMAGE_TAG override — saves a redundant build:prod.
Dockerfile final stage MUST be named runner (the .buildx_build brick hardcodes --target=runner per ADR-0010 Rule 2 #3; the v5 brick lets consumers override via TARGET: but RBO inherits the default).

Deploy path (kst1 side)¶

The shared .keystone_deploy brick does:

chmod 0600 "$DEPLOY_SSH_KEY"   # variable_type=file; runner sets it 0644 by default
printf '%s' "$CI_JOB_TOKEN" | ssh -i "$DEPLOY_SSH_KEY" keystone@${DEPLOY_HOST} \
    "bash /opt/keystone/scripts/11-app-deploy.sh ${APP_SLUG} ${APP_ENV} ${IMAGE_TAG}"

On the host side:

The keystone user's ~/.ssh/authorized_keys carries a command=-restricted entry pinning the deploy key to /opt/keystone/scripts/keystone-deploy-entrypoint.sh.
The wrapper validates <app> against config/apps/<app>.env, validates <env> ∈ {prod, test, dev, staging}, validates <tag> against ^[A-Za-z0-9._-]{1,128}$, then exec's 11-app-deploy.sh.
11-app-deploy.sh does git pull --ff-only on /opt/keystone first (deploy script self-updates per call), then docker login against the registry using the piped CI_JOB_TOKEN (ephemeral DOCKER_CONFIG, per ADR-0006), then docker compose pull && up -d against /srv/apps/rbo/${APP_ENV}/compose.yml.

Compose stacks live at /srv/apps/rbo/{test,prod}/; auto-generated by keystone/scripts/10-app-onboard.sh. Caddy snippets at /srv/platform/caddy/snippets/rbo-{test,prod}.caddy reverse-proxy https://rbo[-test].wagen.io/ to 127.0.0.1:${APP_PORT} and /auth/* to the per-tenant gotrue container on db_net.

DB connectivity¶

Per ADR-0006 §"App runtime contract — DATABASE_URL host", DATABASE_URL points at keystone-pgbouncer:6432 (the platform's shared PgBouncer container on db_net), NOT keystone-postgres:5432. PgBouncer runs in transaction-pool mode.

/healthz/db (manual probe; not currently in smoke:test) runs SELECT 1 over a fresh psycopg async connection per request — verified working post-deploy on both envs at the time of the most recent prod deploy.

Note for Pax (BE): the V2 design must avoid session-scoped Postgres features that PgBouncer transaction-pool can't carry across queries — server-side prepared statements (psycopg 3 auto-disables when it detects PgBouncer; non-issue), SET outside transactions, advisory locks, LISTEN/NOTIFY, long idle transactions. Single-transaction DDL (the Alembic default) IS supported by transaction-pool — see #89 for the migration-wiring decision record on this point.

CI variable surface¶

Provisioned by keystone/scripts/10-app-onboard.sh, env-scoped to test and prod:

Variable	Type	Masked	Protected	Purpose
`DATABASE_URL`	env_var	yes	yes	`postgresql://<role>:<pwd>@keystone-pgbouncer:6432/rbo_<env>`
`GOTRUE_JWT_SECRET`	env_var	yes	yes	Per-tenant gotrue JWT signing/verifying secret
`GOTRUE_URL`	env_var	no	no	`https://rbo[-test].wagen.io/auth`
`IMAGE`	env_var	no	no	Registry path (consumed by `11-app-deploy.sh`)
`APP_PORT`	env_var	no	no	`3010` (loopback bind on kst1; Caddy fronts)
`APP_HOSTNAME`	env_var	no	no	`rbo[-test].wagen.io`
`DEPLOY_HOST`	env_var	no	no	`kst1.wagen.io`
`DEPLOY_SSH_KEY`	file	no	yes	Deploy private key. `masked=false` is intentional (multi-line PEM keys can't be masked); the protected flag + the host-side `command=`-restricted forced-command wrapper are the security boundaries.

STORAGE_* and NEXT_PUBLIC_* keys appear in the project variable list (legacy from a partial onboard run) but are not consumed by RBO today — the FastAPI app reads only the eight rows above. Safe to ignore.

How to debug a red pipeline¶

build:test failed? Check for buildkitd cache pressure first: ssh kst1 'sudo docker buildx du --builder kst1-shared'. Manual prune: ssh kst1 'sudo docker buildx prune --builder kst1-shared --keep-storage 20GB --force'.
deploy:test SSH failure? The host-side wrapper logs to journald: ssh kst1 'sudo journalctl -u ssh -t sshd | tail -50'. Common causes: DEPLOY_SSH_KEY rotation drift (re-run 10-app-onboard.sh), forced-command wrapper rejecting an unknown app/env/tag triplet, kst1 disk pressure preventing docker compose pull.
smoke:test 502/503? App container failed to come up. SSH kst1: cd /srv/apps/rbo/test && docker compose logs app --tail=100. Most-likely cause: missing/stale env var (DATABASE_URL pointing at pre-fix 127.0.0.1:6432 instead of keystone-pgbouncer:6432; 10-app-onboard.sh heals this drift on next run).
pages failed --strict? Cross-doc anchor or broken link. mkdocs prints the offending file:line in the job log; fix locally with mkdocs build --strict before pushing.
deploy:prod button greyed out? The job's rules require $CI_COMMIT_BRANCH == "main" OR $CI_COMMIT_TAG =~ /^v/. Check you're looking at the pipeline of a main push, not an MR pipeline.

For anything else: pipeline list at CI pipelines. Nora monitors feature-branch + main pipeline health continuously; ping her before paging Kit.

Known gaps / follow-ups¶

Alembic migration step is not wired — Stefan flagged this at Phase-0 kickoff. Tracked in #89. The wiring is ready to land the moment Pax has a migration to apply (the job will be a safe no-op until then).
smoke:test does not cover /healthz/db — adding it would couple smoke success to DB reachability through PgBouncer (currently only the app's healthcheck verifies the deploy completed; DB reachability is a separate manual probe). Worth adding once the app actually queries the DB on real routes; defer until then.
No e2e job (Playwright/Cypress) — RBO has no UI yet. Defer; file when the first UI route lands.
Container-registry retention — pinned to "last 10 tags" intent at onboard time; not currently being tracked. Worth a future audit but not blocking.