Deploy pipeline verification — 2026-05-04¶
End-to-end verification of the RBO V2 deploy pipeline as it stands at Phase-0 kickoff. Filed under #87 so the next agent who needs to know "what runtime targets exist?" can find this without reading three weeks of CI commits.
Verified state¶
The most recent main-branch pipeline (#88, SHA c3bce97, 2026-05-03) ran end-to-end green:
| Job | Stage | Brick | Outcome |
|---|---|---|---|
docs:build (MR-only) |
build | .mkdocs_pages (v5) |
n/a on main (rules: CI_COMMIT_BRANCH != $CI_DEFAULT_BRANCH) |
build:test |
build | .buildx_build (v5) |
success — 12s warm |
deploy:test |
deploy-test | .keystone_deploy (v3 carry-forward) |
success — 20s |
smoke:test |
smoke | inline (alpine + curl) | success — 7s |
deploy:prod |
deploy-prod | .keystone_deploy (manual gate) |
success — 19s |
pages |
pages | .mkdocs_pages (v5) |
success — 18s |
All five live URLs probed by smoke:test returned 200:
https://rbo-test.wagen.io/→ FastAPI root ({"app":"racket-book","status":"skeleton"})https://rbo-test.wagen.io/healthz→okhttps://rbo-test.wagen.io/auth/health→ 200 (per-tenant gotrue)
The mkdocs site is published at https://racket-book-7cc372.gitlab.io (GitLab Pages unique-domain enabled, force_https=true).
Pipeline shape¶
CI consumes shared bricks from wagen/keystone at ref: v5 (.gitlab/ci-templates/deploy.yml).
build build:test .buildx_build kst1-shared buildkitd → registry
docs:build (MR only) .mkdocs_pages mkdocs --strict → public/
deploy-test deploy:test .keystone_deploy ssh keystone@kst1 → 11-app-deploy.sh
smoke smoke:test inline curl /, /healthz, /auth/health
deploy-prod deploy:prod (manual) .keystone_deploy ssh keystone@kst1 → 11-app-deploy.sh
pages pages .mkdocs_pages public/ → GitLab Pages
Stages list (canonical from v4, inherited via include:):
install → check → test → build → db-migrate → deploy-test → smoke → deploy-prod → pages. RBO does not currently use install, check, test, or db-migrate (no test suite, no migrations yet — see #89 for the migration wiring).
Why deploy:test / deploy:prod extend .keystone_deploy directly (NOT the v4 .deploy_test / .deploy_prod wrappers): the v4 wrappers pin needs: [db:migrate:<env>, build:<env>] (and for prod also test-e2e). RBO has no db:migrate:* and no build:prod job today; extending the wrappers would fail at pipeline-create with unresolved-need errors. Switching to the wrappers is part of #89.
Image build path¶
- Builder: persistent
kst1-sharedbuildkitd container on the kst1 host (created bykeystone/scripts/13-runner-buildx-init.sh). Pruned weekly Sunday 04:00 UTC by/etc/cron.d/keystone-buildx-prune. - Runner: self-hosted on kst1, tag
[kst1]. Shared GitLab.com runners are disabled on this project (shared_runners_enabled=false). - Image:
<container-registry>/<project>:${CI_COMMIT_SHORT_SHA}-${APP_ENV}(project's own registry, tag per commit + env). RBO is env-agnostic at build time (FastAPI, no client-side bundling) sodeploy:prodre-uses the<sha>-testartifact via anIMAGE_TAGoverride — saves a redundantbuild:prod. - Dockerfile final stage MUST be named
runner(the.buildx_buildbrick hardcodes--target=runnerper ADR-0010 Rule 2 #3; the v5 brick lets consumers override viaTARGET:but RBO inherits the default).
Deploy path (kst1 side)¶
The shared .keystone_deploy brick does:
chmod 0600 "$DEPLOY_SSH_KEY" # variable_type=file; runner sets it 0644 by default
printf '%s' "$CI_JOB_TOKEN" | ssh -i "$DEPLOY_SSH_KEY" keystone@${DEPLOY_HOST} \
"bash /opt/keystone/scripts/11-app-deploy.sh ${APP_SLUG} ${APP_ENV} ${IMAGE_TAG}"
On the host side:
- The
keystoneuser's~/.ssh/authorized_keyscarries acommand=-restricted entry pinning the deploy key to/opt/keystone/scripts/keystone-deploy-entrypoint.sh. - The wrapper validates
<app>againstconfig/apps/<app>.env, validates<env>∈ {prod, test, dev, staging}, validates<tag>against^[A-Za-z0-9._-]{1,128}$, then exec's11-app-deploy.sh. 11-app-deploy.shdoesgit pull --ff-onlyon/opt/keystonefirst (deploy script self-updates per call), thendocker loginagainst the registry using the pipedCI_JOB_TOKEN(ephemeral DOCKER_CONFIG, per ADR-0006), thendocker compose pull && up -dagainst/srv/apps/rbo/${APP_ENV}/compose.yml.
Compose stacks live at /srv/apps/rbo/{test,prod}/; auto-generated by keystone/scripts/10-app-onboard.sh. Caddy snippets at /srv/platform/caddy/snippets/rbo-{test,prod}.caddy reverse-proxy https://rbo[-test].wagen.io/ to 127.0.0.1:${APP_PORT} and /auth/* to the per-tenant gotrue container on db_net.
DB connectivity¶
Per ADR-0006 §"App runtime contract — DATABASE_URL host", DATABASE_URL points at keystone-pgbouncer:6432 (the platform's shared PgBouncer container on db_net), NOT keystone-postgres:5432. PgBouncer runs in transaction-pool mode.
/healthz/db (manual probe; not currently in smoke:test) runs SELECT 1 over a fresh psycopg async connection per request — verified working post-deploy on both envs at the time of the most recent prod deploy.
Note for Pax (BE): the V2 design must avoid session-scoped Postgres features that PgBouncer transaction-pool can't carry across queries — server-side prepared statements (psycopg 3 auto-disables when it detects PgBouncer; non-issue), SET outside transactions, advisory locks, LISTEN/NOTIFY, long idle transactions. Single-transaction DDL (the Alembic default) IS supported by transaction-pool — see #89 for the migration-wiring decision record on this point.
CI variable surface¶
Provisioned by keystone/scripts/10-app-onboard.sh, env-scoped to test and prod:
| Variable | Type | Masked | Protected | Purpose |
|---|---|---|---|---|
DATABASE_URL |
env_var | yes | yes | postgresql://<role>:<pwd>@keystone-pgbouncer:6432/rbo_<env> |
GOTRUE_JWT_SECRET |
env_var | yes | yes | Per-tenant gotrue JWT signing/verifying secret |
GOTRUE_URL |
env_var | no | no | https://rbo[-test].wagen.io/auth |
IMAGE |
env_var | no | no | Registry path (consumed by 11-app-deploy.sh) |
APP_PORT |
env_var | no | no | 3010 (loopback bind on kst1; Caddy fronts) |
APP_HOSTNAME |
env_var | no | no | rbo[-test].wagen.io |
DEPLOY_HOST |
env_var | no | no | kst1.wagen.io |
DEPLOY_SSH_KEY |
file | no | yes | Deploy private key. masked=false is intentional (multi-line PEM keys can't be masked); the protected flag + the host-side command=-restricted forced-command wrapper are the security boundaries. |
STORAGE_* and NEXT_PUBLIC_* keys appear in the project variable list (legacy from a partial onboard run) but are not consumed by RBO today — the FastAPI app reads only the eight rows above. Safe to ignore.
How to debug a red pipeline¶
build:testfailed? Check for buildkitd cache pressure first:ssh kst1 'sudo docker buildx du --builder kst1-shared'. Manual prune:ssh kst1 'sudo docker buildx prune --builder kst1-shared --keep-storage 20GB --force'.deploy:testSSH failure? The host-side wrapper logs to journald:ssh kst1 'sudo journalctl -u ssh -t sshd | tail -50'. Common causes:DEPLOY_SSH_KEYrotation drift (re-run10-app-onboard.sh), forced-command wrapper rejecting an unknown app/env/tag triplet, kst1 disk pressure preventingdocker compose pull.smoke:test502/503? App container failed to come up. SSH kst1:cd /srv/apps/rbo/test && docker compose logs app --tail=100. Most-likely cause: missing/stale env var (DATABASE_URL pointing at pre-fix127.0.0.1:6432instead ofkeystone-pgbouncer:6432;10-app-onboard.shheals this drift on next run).pagesfailed--strict? Cross-doc anchor or broken link. mkdocs prints the offending file:line in the job log; fix locally withmkdocs build --strictbefore pushing.deploy:prodbutton greyed out? The job's rules require$CI_COMMIT_BRANCH == "main"OR$CI_COMMIT_TAG =~ /^v/. Check you're looking at the pipeline of amainpush, not an MR pipeline.
For anything else: pipeline list at CI pipelines. Nora monitors feature-branch + main pipeline health continuously; ping her before paging Kit.
Known gaps / follow-ups¶
- Alembic migration step is not wired — Stefan flagged this at Phase-0 kickoff. Tracked in #89. The wiring is ready to land the moment Pax has a migration to apply (the job will be a safe no-op until then).
smoke:testdoes not cover/healthz/db— adding it would couple smoke success to DB reachability through PgBouncer (currently only the app's healthcheck verifies the deploy completed; DB reachability is a separate manual probe). Worth adding once the app actually queries the DB on real routes; defer until then.- No
e2ejob (Playwright/Cypress) — RBO has no UI yet. Defer; file when the first UI route lands. - Container-registry retention — pinned to "last 10 tags" intent at onboard time; not currently being tracked. Worth a future audit but not blocking.