Skip to content

ADR-0006: JWT/session contract with gotrue

  • Status: Accepted (amended 2026-05-04)
  • Date: 2026-05-02
  • Decider(s): Theo (SA), with defaults pre-confirmed by Stefan
  • Closes: #79
  • Amended by: #108 (HS256-reality amendment)

Amended 2026-05-04 — the active verify path in production is HS256 + shared GOTRUE_JWT_SECRET, not RS256+JWKS. Atlas field-verified in keystone#102 / !127 that the deployed gotrue is v2.189.0, HS256-onlycurl https://rbo-test.wagen.io/auth/.well-known/jwks.json returns 404 (the JWKS endpoint does not exist on this image). RS256+JWKS becomes the aspirational verify path, contingent on a future gotrue upgrade landing a working /auth/.well-known/jwks.json endpoint; until then the RS256 code stays in the tree behind a feature flag (Settings.jwt_verification_mode, default "hs256"). Cookie attributes, claim shape, refresh/logout flow, and chokepoint integration are unchanged — they are gotrue-version-agnostic. Trigger to revisit: gotrue upgrade in keystone with a working JWKS endpoint → flip jwt_verification_mode to "rs256" and retire the HS256 path. See Change log below for the affected sections. Pax-B's parallel code rework lives in #106. Per project memory project_gotrue_hs256.md.

Context

ADR-0001 and auth-and-tenancy.md name the boundary — gotrue owns identity, RBO owns authorization — but never lock the wire-level contract between the two. Without that lock, Pax cannot write the auth middleware deterministically; Iris's auth-flow requirements assume a defaulted set; the Phase 0 gotrue smoke test cannot be written; and the V3 client-portal slot cannot be opened.

This ADR locks the contract: claim shape, key resolution, expiry, refresh, cookies, logout, row-resolution from gotrue user → RBO row, and validation failure modes. Everything below is normative for V2 and the V3 client-portal slot.

Stefan's pre-baked defaults (carried in here as locked, no further confirmation needed):

  • HttpOnly + Secure + SameSite=Lax cookie carries the access token.
  • 1h access-token TTL.
  • Refresh delegated to gotrue (refresh token in second cookie).

The remaining knobs are decided here with documented defaults; any that need Stefan-confirm are flagged in Open questions at the bottom.

Verify path (amended 2026-05-04): the production verify path is HS256 + shared GOTRUE_JWT_SECRET — gotrue v2.189.0 on the keystone platform is HS256-only and exposes no JWKS endpoint. The RS256+JWKS path described below in the original 2026-05-02 decision is the aspirational future state, gated by Settings.jwt_verification_mode ("hs256" default; "rs256" re-engages the JWKS code once a JWKS-capable gotrue is deployed). Everything else in this ADR — claim shape, cookie attributes, refresh / logout flow, chokepoint integration, V3 client-portal slot — is gotrue-version-agnostic and unchanged.

Options

Key resolution

  • (K-1) JWKS endpoint, fetched at boot + cached, refreshed on kid miss. Standard. Survives rotation cleanly. Costs one in-process HTTP call to gotrue at boot + on rotation. Aspirational; not viable on the deployed gotrue v2.189.0 image (no JWKS endpoint).
  • (K-2) Static pubkey from env var. One pubkey baked into the deployment env. No network coupling RBO ↔ gotrue at validation time. Rotation requires re-deploying RBO with a new env var.
  • (K-3) HS256 (shared secret). Pubkey-less; symmetric. Originally rejected — any RBO-side leak forges admin tokens for the whole gotrue tenant. Amended 2026-05-04: this is the current production verify path because gotrue v2.189.0 is HS256-only. The leak-radius concern is mitigated by Atlas's CI-variable surface (env-scoped, masked + protected) and the per-tenant aud validation introduced by keystone !127 (a leaked rbo-test secret cannot mint a rbo-prod token). Re-evaluated to (K-1) when gotrue gains JWKS support.

Claim shape — sub

  • (S-1) sub = gotrue user UUID (chosen). Stable. Survives email change. RBO resolves to either Stringer.gotrue_user_id or Person.gotrue_user_id at request time.
  • (S-2) sub = stringer_id directly. Couples RBO's PK to gotrue's claim — rejected; V3 needs the same sub to resolve to a Person, and Stringer.id ≠ Person.id.
  • (S-3) sub = email. Email is mutable; rejected.
  • (C-1) Domain= unset (default — host-only cookie on rbo.wagen.io). Tightest scope. The platform-global cookie idea (cross-app SSO) is V3-or-later.
  • (C-2) Domain=.wagen.io (cross-subdomain SSO). Lets gotrue's session cookie be visible to every app. Tempting but premature; cross-app SSO needs its own ADR (keystone scope), not RBO's. Defer.

Logout

  • (L-1) Cookie clear only. RBO clears cookies; gotrue session technically still valid until access-token TTL expires. Cheap; arguably sloppy.
  • (L-2) Cookie clear + gotrue refresh-token revocation (chosen). RBO calls gotrue's revoke endpoint with the refresh token before clearing the cookie. The access token stays valid until natural expiry (1h ceiling) but cannot be refreshed — so a stolen access cookie is bounded by the 1h TTL even after a logout. This matches the access-token model's actual security posture.

Decision

Wire-level contract

Aspect Decision
Signing algorithm HS256 (symmetric) in production today, gated behind Settings.jwt_verification_mode = "hs256" (default). gotrue v2.189.0 signs with GOTRUE_JWT_SECRET; RBO validates with the same secret. RS256 (asymmetric) is the aspirational mode (jwt_verification_mode = "rs256"), unblocked when a future gotrue upgrade exposes a JWKS endpoint. (Amended 2026-05-04 per Atlas's keystone#102 / !127 field-verification; was previously RS256-only.)
Key resolution HS256 mode (current): the shared secret in GOTRUE_JWT_SECRET is read from Settings at boot. No network coupling RBO ↔ gotrue at validation time. Rotation = Atlas re-runs 10-app-onboard.sh → RBO restart picks up the new value. RS256 mode (aspirational, Option K-1): RBO fetches ${GOTRUE_URL}/.well-known/jwks.json at boot, caches in-process, refreshes on a kid cache miss; refresh rate-limited (one call per minute per kid miss). The RS256 code stays in the tree (under app/auth/jwks.py) so the future flip is a one-env-var change. (Amended 2026-05-04.)
iss claim (issuer) Validated against the env-var GOTRUE_ISSUER = https://<app>[-<env>].wagen.io/auth (e.g. https://rbo-test.wagen.io/auth) — the same value as API_EXTERNAL_URL on the gotrue container. Mismatch → reject. (Amended 2026-05-04: ISSUER format diverged from the originally-proposed https://auth.wagen.io/<app> because that hostname does not exist on the platform; see ADR-0009 §"ISSUER format deviation" for rationale.)
aud claim (audience) Set to <app>-<env> (rbo-prod, rbo-test) per the keystone-platform contract. Validated. Mismatch → reject. This protects against a token minted for another keystone-platform app (or another env of the same app) being replayed at RBO. The per-tenant aud flips gotrue's hardcoded aud=authenticated default; it is provisioned by 10-app-onboard.sh per keystone !127.
sub claim gotrue user UUID. RBO resolves to a Stringer row via Stringer.gotrue_user_id = sub (V2 path), or to a Person row via Person.gotrue_user_id = sub (V3 client-portal slot — not built in V2 but the resolution path exists from day one).
Resolution precedence Stringer first, then Person. A single gotrue user is normally exactly one of: a Stringer or a Person. The "Stefan-also-becomes-a-client" edge case (Stringer + Person sharing a gotrue_user_id) is documented in data-model.md — RBO does not auto-link them in V2. The middleware resolves Stringer first; if found, binds current_stringer_id. Only if no Stringer matches does it fall through to Person resolution. (V3 will likely add an explicit role-switcher; flagged below.)
Resolution miss → behavior If the gotrue user is verified but no Stringer/Person row exists in RBO: reject with 403 + structured-log "verified gotrue user without RBO row". This catches misconfigured invites; admin must create the Stringer row before the magic-link is consumed.
Access-token TTL 1h (Stefan default). Configured on gotrue (GOTRUE_JWT_EXP=3600).
Refresh Delegated to gotrue. Refresh token lives in a second HttpOnly+Secure cookie (rbo_refresh). RBO does not validate the refresh token directly — when the access token is expired or near-expired, the browser hits an RBO /auth/refresh endpoint that proxies to gotrue's /token?grant_type=refresh_token and rewrites both cookies on the response.
Refresh-token TTL gotrue default (30d, configurable per env). Documented but owned by gotrue, not RBO.
Pre-emptive refresh When access token has < 5 minutes left, the RBO middleware silently refreshes on the next request before serving the handler. This avoids mid-action 401s on slow hand-fills.
Cookie attributes (access token) HttpOnly; Secure; SameSite=Lax; Path=/; Max-Age=3600. No Domain= — host-only on rbo.wagen.io (or rbo-test.wagen.io).
Cookie attributes (refresh token) Same, but Path=/auth/refresh; Max-Age=2592000 (30d). The narrow path means the refresh cookie is sent only to the refresh endpoint, never to handler routes — limits exposure.
Cookie name (access) rbo_session.
Cookie name (refresh) rbo_refresh.
CSRF posture SameSite=Lax + state-changing routes restricted to POST + a synchronizer-token (per-session, embedded in the page, validated on POST). HTMX requests carry the token in the X-CSRF-Token header. Per-form hidden inputs for non-HTMX paths.
Logout RBO /auth/logout POST: (1) reads rbo_refresh, (2) calls gotrue's /logout revoking the refresh token, (3) clears both cookies (Max-Age=0), (4) redirects to /login. Idempotent (no error if no cookie).
Validation failure modes Three loud rejection paths: (a) signature invalid → 401 + cookie clear; (b) expired access token + refresh fails → 401 + cookie clear + redirect to /login; (c) sub resolves to no RBO row → 403 + structured-log entry. No silent re-auth: the user always sees the login page.
current_stringer_id / current_person_id binding Per auth-and-tenancy.md, the middleware binds the appropriate ContextVar after JWT validation + row resolution. This ADR fixes the resolution path; the chokepoint mechanics are unchanged.
Admin role Determined entirely by Stringer.role = 'admin', not by a JWT claim. The JWT carries identity only; authorization (including admin) is RBO's domain. (See ADR-0001 §Tenancy.)
Bypass-tenant attribute Set per-session in RBO based on a per-request opt-in by an admin user (e.g. via a ?admin_bypass=1 flag on catalogue moderation routes); never carried in the JWT.

Sequence: validated request path

Browser ──(rbo_session cookie)──▶ RBO middleware
                              JWT decode + signature check
                              (HS256 with GOTRUE_JWT_SECRET — current;
                               RS256 with JWKS-cache — aspirational, flag-gated)
                              Validate iss, aud, exp
                                    ├── invalid: 401 + cookie clear
                              Resolve sub → Stringer (then Person)
                                    ├── miss: 403 + log
                              Bind current_stringer_id (or current_person_id) ContextVar
                              Handler runs; chokepoint enforces tenancy
                              Response (cookie refreshed if < 5 min remaining)

Sequence: refresh

Browser ──(401 from any route or pre-emptive)──▶ POST /auth/refresh (carries rbo_refresh)
                                                gotrue /token?grant_type=refresh_token
                                                new access + refresh tokens
                                                Set-Cookie both, return 204
Browser retries the original request transparently (HTMX wrapper or 401-retry handler)

Sequence: logout

Browser ──(POST /auth/logout, rbo_refresh cookie)──▶ RBO
                                                gotrue /logout (revoke refresh)
                                                Set-Cookie rbo_session=; Max-Age=0
                                                Set-Cookie rbo_refresh=; Max-Age=0
                                                302 → /login

Key rotation runbook (architectural slot, ops detail in keystone)

HS256 mode (current — amended 2026-05-04).

When gotrue's GOTRUE_JWT_SECRET is rotated:

  1. Atlas updates the secret on the gotrue container and re-runs 10-app-onboard.sh for RBO (which writes the new value into the env-scoped CI variable). Both must happen in the same maintenance window — gotrue and RBO must agree on the secret at any instant.
  2. Atlas redeploys RBO (docker compose up -d) so the container picks up the new env. In-flight tokens signed with the old secret start failing validation immediately on the new RBO instance (401 + cookie clear → user re-logs in).
  3. There is no overlap window for HS256 — symmetric secrets cannot be presented in plurality. Rotation is therefore disruptive (every active session re-authenticates). Acceptable at our scale; the access-token TTL of 1h is the natural ceiling.

Failure of GOTRUE_JWT_SECRET at validation time: if the env var is missing, pydantic-settings raises at boot and the container exits non-zero (per ADR-0009 §7); Atlas's monitoring surfaces it. There is no JWKS endpoint to fall back on.

RS256 mode (aspirational, future).

When gotrue eventually exposes JWKS and jwt_verification_mode is flipped to "rs256":

  1. New kid appears at the JWKS endpoint alongside the old one (gotrue maintains overlap).
  2. RBO's in-process JWKS cache misses on the next token signed with the new kid → fetches fresh JWKS → validates.
  3. Old kid is purged from gotrue after the overlap window. Tokens still in flight signed with the old key fail validation gracefully (401 + cookie clear → user re-logs in).

Overlap window: owned by gotrue config, recommended ≥ 1h (≥ access-token TTL) so no in-flight session breaks. Documented as an operator commitment in keystone's auth runbook — RBO simply tolerates rotation by re-fetching JWKS on cache miss.

Failure of JWKS endpoint at validation time: RBO serves the cached JWKS; if no cached JWKS exists (cold start during a JWKS outage), every request 503s with structured log. This is correct behavior — RBO without JWKS cannot validate tokens, so it cannot serve. The cold-start-during-JWKS-outage window is small (boot is fast; JWKS fetch is one round-trip).

V3 client-portal slot

The same JWT contract serves V3 clients with zero schema or middleware changes:

  • gotrue mints a magic-link for a Person; the resulting JWT has the Person's gotrue_user_id as sub.
  • The same middleware decode+validate path runs.
  • Resolution: no Stringer matches (Persons don't get Stringer rows); Person matches via Person.gotrue_user_id = sub; current_person_id ContextVar is bound (instead of current_stringer_id).
  • The chokepoint applies the Person-scoped predicate per auth-and-tenancy.md.

The Stefan-as-both case (Stefan is a Stringer AND has a Person row from another stringer's perspective): in V2, never auto-linked. In V3, explicit role-switcher in the UI (flagged as Open question below) chooses which ContextVar is bound for the session — implementation detail, not a wire-contract change.

What this ADR does NOT cover

  • Password complexity, rate limiting, MFA, account lockout — gotrue's domain. RBO inherits whatever gotrue is configured for.
  • Magic-link minting and validation — gotrue's domain. RBO consumes the resulting JWT.
  • Email of magic-links / password resets — gotrue calls Resend SMTP per integrations.md.
  • Cross-app SSO via Domain=.wagen.io cookie — deferred; needs its own ADR in keystone scope.
  • Token introspection or revocation list — out of scope; the 1h TTL is the bound on a stolen access token.

Required tests (this ADR mandates them)

Amended 2026-05-04: tests 1–9 below are gotrue-version-agnostic and run in HS256 mode today. Test 10 (JWKS rotation) is gated behind jwt_verification_mode == "rs256" and stays in the tree as a regression for the future flip; it does not run in CI by default. A new HS256-specific test 11 is added.

  1. JWT signature validation. Forge a token with a wrong key → middleware rejects with 401 and clears cookie. (HS256: forge by signing with the wrong secret. RS256: forge by signing with the wrong RSA key.)
  2. iss / aud mismatch. Token signed by gotrue but with wrong iss or wrong aud → reject. Specifically, a token minted for another keystone app (different aud) is rejected by RBO. Specifically also: a token with the gotrue v2.189.0 default aud=authenticated is rejected because it does not match <app>-<env>.
  3. Expired access token + valid refresh. Pre-emptive refresh path replaces both cookies; original request succeeds without surfacing a 401.
  4. Expired access + expired refresh. 401 + redirect to /login.
  5. sub resolves to no RBO row. 403 + structured log entry containing the gotrue user UUID. Cookie NOT cleared (the JWT is valid; the misconfiguration is server-side).
  6. Stringer-first resolution. A gotrue user UUID matching both a Stringer and a Person row resolves to Stringer. (V2 fixture; V3 will revisit.)
  7. Logout idempotency. Two consecutive POST /auth/logout calls both return 302 to /login; the second is a no-op.
  8. Logout revokes refresh. After logout, presenting the old refresh token at /auth/refresh returns 401 (gotrue refused).
  9. CSRF. State-changing POST without X-CSRF-Token (HTMX path) or hidden form input (non-HTMX) returns 403.
  10. JWKS rotation (RS256-mode regression — gated). Mock a JWKS endpoint serving two kids; validate tokens signed with each; rotate (drop old kid from JWKS); cached old kid continues to validate until next cache refresh; new tokens with new kid trigger refresh and validate. Skipped when jwt_verification_mode == "hs256"; runs only when the future RS256 flip lands.
  11. HS256 secret-rotation discipline (HS256-mode regression). A token minted with the previous GOTRUE_JWT_SECRET is rejected with 401 + cookie clear after RBO restarts with a new secret. (No HS256 overlap window — verifies the rotation behaviour described in §"Key rotation runbook".)

Consequences

Good

  • One normative wire contract. Pax has a single page to implement against; Iris's auth-flow requirements have a defaulted answer for every cookie / claim / TTL knob; the chokepoint stays decoupled from JWT internals.
  • V3 client portal lights up without schema or middleware changes. The contract is shape-symmetric for Stringer and Person resolution.
  • Key rotation is operationally cheap. JWKS-on-cache-miss + gotrue's overlap window means no coordinated deploy across RBO and gotrue.
  • Stolen access cookie is bounded at 1h. Logout revokes refresh; access-token TTL caps the damage. This is the textbook short-TTL access + revocable-refresh model.
  • Cross-app token confusion is structurally prevented by aud validation. A future CSD or ALJ JWT cannot be replayed at RBO.
  • JSON-stdout structured logs at every rejection path mean an auth incident is auditable end-to-end.

Costs we accept

  • Two cookies, two paths. Slightly more middleware code than a single-cookie design. Worth it: refresh cookie is path-scoped to /auth/refresh, never sent to handler routes.
  • Pre-emptive refresh adds latency on the request that triggers it (one extra round-trip to gotrue). The < 5 min threshold means most users never hit it. The alternative (let it 401, retry) is worse UX.
  • JWKS cache cold-start during a JWKS outage = 503. Acceptable: RBO without JWKS cannot validate, so it cannot serve. Boot order: RBO waits for JWKS at startup before declaring itself ready.
  • CSRF synchronizer token is one more thing to forget. Mitigated by enforcing it at the middleware level for all state-changing routes — not per-handler.
  • Stefan-as-both edge case is documented but not auto-resolved in V2. Two rows, one gotrue user — V2 resolves to Stringer. The V3 role-switcher is a UX problem, not a contract problem.
  • No revocation list. Once issued, an access token is valid for its TTL — there is no "kill this specific token now." Acceptable at our threat model and TTL.

Open questions (Stefan-confirm)

  1. JWKS cache TTL — currently "no time-based TTL, refresh on kid miss only" (Option K-1 default). Alternative: 5-minute background refresh. The miss-driven model is simpler and avoids unnecessary network calls, but a poisoned-cache scenario (stale JWKS pinned in memory) is theoretically possible. Default proposed: miss-driven only. Stefan to confirm or flip. (Aspirational — RS256-mode-only; immaterial in current HS256 mode.)
  2. V3 role-switcher for the Stefan-as-both case — explicit UI affordance ("acting as Stringer / acting as Client") in V3, vs. URL-segment-driven (/admin vs. /me). Architectural slot named here; UX decision is V3.
  3. Pre-emptive refresh threshold — defaulted to 5 minutes. Could be 1 minute (more 401-retry exposure) or 10 minutes (more refresh chatter). 5 min is a balanced default.
  4. Refresh cookie Path=/auth/refresh scope — narrow scope is the default. Alternative is Path=/ (cookie sent on every request) for simpler debugging. Narrow default kept.

All four default to the values above if Stefan does not flip them. Each is a single-line config change.

Change log

Date Change Reason
2026-05-02 Initial accepted version (RS256+JWKS verify path; HS256 rejected as K-3).
2026-05-04 Amendment — verify path flipped to HS256 + shared GOTRUE_JWT_SECRET as the current production reality; RS256+JWKS demoted to aspirational, gated behind Settings.jwt_verification_mode (default "hs256"). Updated: top-of-file amendment block, Stefan's pre-baked defaults (added "Verify path" paragraph), Option K-1 / K-3 (re-evaluated rejection rationale), Wire-level contract decision table (Signing algorithm + Key resolution + ISSUER format + AUDIENCE rows), validated-request sequence diagram (annotated for both modes), Key rotation runbook (split into HS256-mode + RS256-mode subsections), Required tests (gated test 10 behind RS256, added test 11 for HS256 secret-rotation discipline), Cross-references (added keystone#102 / !127 + #106 + project memory + ADR-0009). Cookie attributes, claim shape, refresh / logout flow, chokepoint integration, V3 client-portal slot are unchanged — they are gotrue-version-agnostic. Trigger to revisit: gotrue upgrade in keystone landing a working /auth/.well-known/jwks.json endpoint → flip jwt_verification_mode to "rs256" and retire HS256. Atlas's field-verification of deployed gotrue v2.189.0 (HS256-only, no JWKS endpoint) in keystone#102 / !127. RS256+JWKS would 401 every authenticated request because the JWKS fetch 404s. Pax-B's parallel code rework lives in #106. Per project memory project_gotrue_hs256.md. Tracked in #108.

Cross-references

  • ADR-0001 — names the boundary; this ADR fills in the wire-level contract.
  • ADR-0004 — specifies the V3 client-portal Person-bound session slot; this ADR specifies the JWT path that lights it up.
  • ADR-0009 — env-var contract (GOTRUE_JWT_SECRET, GOTRUE_ISSUER, GOTRUE_AUDIENCE) co-amended 2026-05-04 with the same trigger.
  • docs/architecture/auth-and-tenancy.md — chokepoint mechanics; binds current_stringer_id / current_person_id set by this ADR's middleware.
  • keystone ADR-0005 — Resend SMTP that gotrue uses for magic-links.
  • keystone#102 / !127 — Atlas's field-verification + per-tenant gotrue env provisioning (the trigger for this amendment).
  • Iris's auth-flow requirements (docs/requirements/use-cases.md and the requirements log).
  • 106 — Pax-B's parallel HS256 code rework (companion MR to this amendment).

  • 108 — this amendment.