Skip to content

Deploy pipeline verification — 2026-05-04

End-to-end verification of the RBO V2 deploy pipeline as it stands at Phase-0 kickoff. Filed under #87 so the next agent who needs to know "what runtime targets exist?" can find this without reading three weeks of CI commits.

Verified state

The most recent main-branch pipeline (#88, SHA c3bce97, 2026-05-03) ran end-to-end green:

Job Stage Brick Outcome
docs:build (MR-only) build .mkdocs_pages (v5) n/a on main (rules: CI_COMMIT_BRANCH != $CI_DEFAULT_BRANCH)
build:test build .buildx_build (v5) success — 12s warm
deploy:test deploy-test .keystone_deploy (v3 carry-forward) success — 20s
smoke:test smoke inline (alpine + curl) success — 7s
deploy:prod deploy-prod .keystone_deploy (manual gate) success — 19s
pages pages .mkdocs_pages (v5) success — 18s

All five live URLs probed by smoke:test returned 200:

  • https://rbo-test.wagen.io/ → FastAPI root ({"app":"racket-book","status":"skeleton"})
  • https://rbo-test.wagen.io/healthzok
  • https://rbo-test.wagen.io/auth/health → 200 (per-tenant gotrue)

The mkdocs site is published at https://racket-book-7cc372.gitlab.io (GitLab Pages unique-domain enabled, force_https=true).

Pipeline shape

CI consumes shared bricks from wagen/keystone at ref: v5 (.gitlab/ci-templates/deploy.yml).

build           build:test              .buildx_build           kst1-shared buildkitd → registry
                docs:build  (MR only)   .mkdocs_pages           mkdocs --strict → public/
deploy-test     deploy:test             .keystone_deploy        ssh keystone@kst1 → 11-app-deploy.sh
smoke           smoke:test              inline                  curl /, /healthz, /auth/health
deploy-prod     deploy:prod  (manual)   .keystone_deploy        ssh keystone@kst1 → 11-app-deploy.sh
pages           pages                   .mkdocs_pages           public/ → GitLab Pages

Stages list (canonical from v4, inherited via include:): install → check → test → build → db-migrate → deploy-test → smoke → deploy-prod → pages. RBO does not currently use install, check, test, or db-migrate (no test suite, no migrations yet — see #89 for the migration wiring).

Why deploy:test / deploy:prod extend .keystone_deploy directly (NOT the v4 .deploy_test / .deploy_prod wrappers): the v4 wrappers pin needs: [db:migrate:<env>, build:<env>] (and for prod also test-e2e). RBO has no db:migrate:* and no build:prod job today; extending the wrappers would fail at pipeline-create with unresolved-need errors. Switching to the wrappers is part of #89.

Image build path

  • Builder: persistent kst1-shared buildkitd container on the kst1 host (created by keystone/scripts/13-runner-buildx-init.sh). Pruned weekly Sunday 04:00 UTC by /etc/cron.d/keystone-buildx-prune.
  • Runner: self-hosted on kst1, tag [kst1]. Shared GitLab.com runners are disabled on this project (shared_runners_enabled=false).
  • Image: <container-registry>/<project>:${CI_COMMIT_SHORT_SHA}-${APP_ENV} (project's own registry, tag per commit + env). RBO is env-agnostic at build time (FastAPI, no client-side bundling) so deploy:prod re-uses the <sha>-test artifact via an IMAGE_TAG override — saves a redundant build:prod.
  • Dockerfile final stage MUST be named runner (the .buildx_build brick hardcodes --target=runner per ADR-0010 Rule 2 #3; the v5 brick lets consumers override via TARGET: but RBO inherits the default).

Deploy path (kst1 side)

The shared .keystone_deploy brick does:

chmod 0600 "$DEPLOY_SSH_KEY"   # variable_type=file; runner sets it 0644 by default
printf '%s' "$CI_JOB_TOKEN" | ssh -i "$DEPLOY_SSH_KEY" keystone@${DEPLOY_HOST} \
    "bash /opt/keystone/scripts/11-app-deploy.sh ${APP_SLUG} ${APP_ENV} ${IMAGE_TAG}"

On the host side:

  1. The keystone user's ~/.ssh/authorized_keys carries a command=-restricted entry pinning the deploy key to /opt/keystone/scripts/keystone-deploy-entrypoint.sh.
  2. The wrapper validates <app> against config/apps/<app>.env, validates <env> ∈ {prod, test, dev, staging}, validates <tag> against ^[A-Za-z0-9._-]{1,128}$, then exec's 11-app-deploy.sh.
  3. 11-app-deploy.sh does git pull --ff-only on /opt/keystone first (deploy script self-updates per call), then docker login against the registry using the piped CI_JOB_TOKEN (ephemeral DOCKER_CONFIG, per ADR-0006), then docker compose pull && up -d against /srv/apps/rbo/${APP_ENV}/compose.yml.

Compose stacks live at /srv/apps/rbo/{test,prod}/; auto-generated by keystone/scripts/10-app-onboard.sh. Caddy snippets at /srv/platform/caddy/snippets/rbo-{test,prod}.caddy reverse-proxy https://rbo[-test].wagen.io/ to 127.0.0.1:${APP_PORT} and /auth/* to the per-tenant gotrue container on db_net.

DB connectivity

Per ADR-0006 §"App runtime contract — DATABASE_URL host", DATABASE_URL points at keystone-pgbouncer:6432 (the platform's shared PgBouncer container on db_net), NOT keystone-postgres:5432. PgBouncer runs in transaction-pool mode.

/healthz/db (manual probe; not currently in smoke:test) runs SELECT 1 over a fresh psycopg async connection per request — verified working post-deploy on both envs at the time of the most recent prod deploy.

Note for Pax (BE): the V2 design must avoid session-scoped Postgres features that PgBouncer transaction-pool can't carry across queries — server-side prepared statements (psycopg 3 auto-disables when it detects PgBouncer; non-issue), SET outside transactions, advisory locks, LISTEN/NOTIFY, long idle transactions. Single-transaction DDL (the Alembic default) IS supported by transaction-pool — see #89 for the migration-wiring decision record on this point.

CI variable surface

Provisioned by keystone/scripts/10-app-onboard.sh, env-scoped to test and prod:

Variable Type Masked Protected Purpose
DATABASE_URL env_var yes yes postgresql://<role>:<pwd>@keystone-pgbouncer:6432/rbo_<env>
GOTRUE_JWT_SECRET env_var yes yes Per-tenant gotrue JWT signing/verifying secret
GOTRUE_URL env_var no no https://rbo[-test].wagen.io/auth
IMAGE env_var no no Registry path (consumed by 11-app-deploy.sh)
APP_PORT env_var no no 3010 (loopback bind on kst1; Caddy fronts)
APP_HOSTNAME env_var no no rbo[-test].wagen.io
DEPLOY_HOST env_var no no kst1.wagen.io
DEPLOY_SSH_KEY file no yes Deploy private key. masked=false is intentional (multi-line PEM keys can't be masked); the protected flag + the host-side command=-restricted forced-command wrapper are the security boundaries.

STORAGE_* and NEXT_PUBLIC_* keys appear in the project variable list (legacy from a partial onboard run) but are not consumed by RBO today — the FastAPI app reads only the eight rows above. Safe to ignore.

How to debug a red pipeline

  1. build:test failed? Check for buildkitd cache pressure first: ssh kst1 'sudo docker buildx du --builder kst1-shared'. Manual prune: ssh kst1 'sudo docker buildx prune --builder kst1-shared --keep-storage 20GB --force'.
  2. deploy:test SSH failure? The host-side wrapper logs to journald: ssh kst1 'sudo journalctl -u ssh -t sshd | tail -50'. Common causes: DEPLOY_SSH_KEY rotation drift (re-run 10-app-onboard.sh), forced-command wrapper rejecting an unknown app/env/tag triplet, kst1 disk pressure preventing docker compose pull.
  3. smoke:test 502/503? App container failed to come up. SSH kst1: cd /srv/apps/rbo/test && docker compose logs app --tail=100. Most-likely cause: missing/stale env var (DATABASE_URL pointing at pre-fix 127.0.0.1:6432 instead of keystone-pgbouncer:6432; 10-app-onboard.sh heals this drift on next run).
  4. pages failed --strict? Cross-doc anchor or broken link. mkdocs prints the offending file:line in the job log; fix locally with mkdocs build --strict before pushing.
  5. deploy:prod button greyed out? The job's rules require $CI_COMMIT_BRANCH == "main" OR $CI_COMMIT_TAG =~ /^v/. Check you're looking at the pipeline of a main push, not an MR pipeline.

For anything else: pipeline list at CI pipelines. Nora monitors feature-branch + main pipeline health continuously; ping her before paging Kit.

Known gaps / follow-ups

  • Alembic migration step is not wired — Stefan flagged this at Phase-0 kickoff. Tracked in #89. The wiring is ready to land the moment Pax has a migration to apply (the job will be a safe no-op until then).
  • smoke:test does not cover /healthz/db — adding it would couple smoke success to DB reachability through PgBouncer (currently only the app's healthcheck verifies the deploy completed; DB reachability is a separate manual probe). Worth adding once the app actually queries the DB on real routes; defer until then.
  • No e2e job (Playwright/Cypress) — RBO has no UI yet. Defer; file when the first UI route lands.
  • Container-registry retention — pinned to "last 10 tags" intent at onboard time; not currently being tracked. Worth a future audit but not blocking.