ADR-0006: JWT/session contract with gotrue¶
- Status: Accepted (amended 2026-05-04)
- Date: 2026-05-02
- Decider(s): Theo (SA), with defaults pre-confirmed by Stefan
- Closes: #79
- Amended by: #108 (HS256-reality amendment)
Amended 2026-05-04 — the active verify path in production is HS256 + shared
GOTRUE_JWT_SECRET, not RS256+JWKS. Atlas field-verified in keystone#102 / !127 that the deployed gotrue is v2.189.0, HS256-only —curl https://rbo-test.wagen.io/auth/.well-known/jwks.jsonreturns 404 (the JWKS endpoint does not exist on this image). RS256+JWKS becomes the aspirational verify path, contingent on a future gotrue upgrade landing a working/auth/.well-known/jwks.jsonendpoint; until then the RS256 code stays in the tree behind a feature flag (Settings.jwt_verification_mode, default"hs256"). Cookie attributes, claim shape, refresh/logout flow, and chokepoint integration are unchanged — they are gotrue-version-agnostic. Trigger to revisit: gotrue upgrade in keystone with a working JWKS endpoint → flipjwt_verification_modeto"rs256"and retire the HS256 path. See Change log below for the affected sections. Pax-B's parallel code rework lives in #106. Per project memoryproject_gotrue_hs256.md.
Context¶
ADR-0001 and auth-and-tenancy.md name the boundary — gotrue owns identity, RBO owns authorization — but never lock the wire-level contract between the two. Without that lock, Pax cannot write the auth middleware deterministically; Iris's auth-flow requirements assume a defaulted set; the Phase 0 gotrue smoke test cannot be written; and the V3 client-portal slot cannot be opened.
This ADR locks the contract: claim shape, key resolution, expiry, refresh, cookies, logout, row-resolution from gotrue user → RBO row, and validation failure modes. Everything below is normative for V2 and the V3 client-portal slot.
Stefan's pre-baked defaults (carried in here as locked, no further confirmation needed):
- HttpOnly + Secure + SameSite=Lax cookie carries the access token.
- 1h access-token TTL.
- Refresh delegated to gotrue (refresh token in second cookie).
The remaining knobs are decided here with documented defaults; any that need Stefan-confirm are flagged in Open questions at the bottom.
Verify path (amended 2026-05-04): the production verify path is HS256 + shared GOTRUE_JWT_SECRET — gotrue v2.189.0 on the keystone platform is HS256-only and exposes no JWKS endpoint. The RS256+JWKS path described below in the original 2026-05-02 decision is the aspirational future state, gated by Settings.jwt_verification_mode ("hs256" default; "rs256" re-engages the JWKS code once a JWKS-capable gotrue is deployed). Everything else in this ADR — claim shape, cookie attributes, refresh / logout flow, chokepoint integration, V3 client-portal slot — is gotrue-version-agnostic and unchanged.
Options¶
Key resolution¶
- (K-1) JWKS endpoint, fetched at boot + cached, refreshed on
kidmiss. Standard. Survives rotation cleanly. Costs one in-process HTTP call to gotrue at boot + on rotation. Aspirational; not viable on the deployed gotrue v2.189.0 image (no JWKS endpoint). - (K-2) Static pubkey from env var. One pubkey baked into the deployment env. No network coupling RBO ↔ gotrue at validation time. Rotation requires re-deploying RBO with a new env var.
- (K-3) HS256 (shared secret). Pubkey-less; symmetric. Originally rejected — any RBO-side leak forges admin tokens for the whole gotrue tenant. Amended 2026-05-04: this is the current production verify path because gotrue v2.189.0 is HS256-only. The leak-radius concern is mitigated by Atlas's CI-variable surface (env-scoped, masked + protected) and the per-tenant
audvalidation introduced by keystone !127 (a leakedrbo-testsecret cannot mint arbo-prodtoken). Re-evaluated to (K-1) when gotrue gains JWKS support.
Claim shape — sub¶
- (S-1)
sub = gotrue user UUID(chosen). Stable. Survives email change. RBO resolves to eitherStringer.gotrue_user_idorPerson.gotrue_user_idat request time. - (S-2)
sub = stringer_iddirectly. Couples RBO's PK to gotrue's claim — rejected; V3 needs the samesubto resolve to a Person, and Stringer.id ≠ Person.id. - (S-3)
sub = email. Email is mutable; rejected.
Cookie scope¶
- (C-1)
Domain=unset (default — host-only cookie onrbo.wagen.io). Tightest scope. The platform-global cookie idea (cross-app SSO) is V3-or-later. - (C-2)
Domain=.wagen.io(cross-subdomain SSO). Lets gotrue's session cookie be visible to every app. Tempting but premature; cross-app SSO needs its own ADR (keystone scope), not RBO's. Defer.
Logout¶
- (L-1) Cookie clear only. RBO clears cookies; gotrue session technically still valid until access-token TTL expires. Cheap; arguably sloppy.
- (L-2) Cookie clear + gotrue refresh-token revocation (chosen). RBO calls gotrue's revoke endpoint with the refresh token before clearing the cookie. The access token stays valid until natural expiry (1h ceiling) but cannot be refreshed — so a stolen access cookie is bounded by the 1h TTL even after a logout. This matches the access-token model's actual security posture.
Decision¶
Wire-level contract¶
| Aspect | Decision |
|---|---|
| Signing algorithm | HS256 (symmetric) in production today, gated behind Settings.jwt_verification_mode = "hs256" (default). gotrue v2.189.0 signs with GOTRUE_JWT_SECRET; RBO validates with the same secret. RS256 (asymmetric) is the aspirational mode (jwt_verification_mode = "rs256"), unblocked when a future gotrue upgrade exposes a JWKS endpoint. (Amended 2026-05-04 per Atlas's keystone#102 / !127 field-verification; was previously RS256-only.) |
| Key resolution | HS256 mode (current): the shared secret in GOTRUE_JWT_SECRET is read from Settings at boot. No network coupling RBO ↔ gotrue at validation time. Rotation = Atlas re-runs 10-app-onboard.sh → RBO restart picks up the new value. RS256 mode (aspirational, Option K-1): RBO fetches ${GOTRUE_URL}/.well-known/jwks.json at boot, caches in-process, refreshes on a kid cache miss; refresh rate-limited (one call per minute per kid miss). The RS256 code stays in the tree (under app/auth/jwks.py) so the future flip is a one-env-var change. (Amended 2026-05-04.) |
iss claim (issuer) |
Validated against the env-var GOTRUE_ISSUER = https://<app>[-<env>].wagen.io/auth (e.g. https://rbo-test.wagen.io/auth) — the same value as API_EXTERNAL_URL on the gotrue container. Mismatch → reject. (Amended 2026-05-04: ISSUER format diverged from the originally-proposed https://auth.wagen.io/<app> because that hostname does not exist on the platform; see ADR-0009 §"ISSUER format deviation" for rationale.) |
aud claim (audience) |
Set to <app>-<env> (rbo-prod, rbo-test) per the keystone-platform contract. Validated. Mismatch → reject. This protects against a token minted for another keystone-platform app (or another env of the same app) being replayed at RBO. The per-tenant aud flips gotrue's hardcoded aud=authenticated default; it is provisioned by 10-app-onboard.sh per keystone !127. |
sub claim |
gotrue user UUID. RBO resolves to a Stringer row via Stringer.gotrue_user_id = sub (V2 path), or to a Person row via Person.gotrue_user_id = sub (V3 client-portal slot — not built in V2 but the resolution path exists from day one). |
| Resolution precedence | Stringer first, then Person. A single gotrue user is normally exactly one of: a Stringer or a Person. The "Stefan-also-becomes-a-client" edge case (Stringer + Person sharing a gotrue_user_id) is documented in data-model.md — RBO does not auto-link them in V2. The middleware resolves Stringer first; if found, binds current_stringer_id. Only if no Stringer matches does it fall through to Person resolution. (V3 will likely add an explicit role-switcher; flagged below.) |
| Resolution miss → behavior | If the gotrue user is verified but no Stringer/Person row exists in RBO: reject with 403 + structured-log "verified gotrue user without RBO row". This catches misconfigured invites; admin must create the Stringer row before the magic-link is consumed. |
| Access-token TTL | 1h (Stefan default). Configured on gotrue (GOTRUE_JWT_EXP=3600). |
| Refresh | Delegated to gotrue. Refresh token lives in a second HttpOnly+Secure cookie (rbo_refresh). RBO does not validate the refresh token directly — when the access token is expired or near-expired, the browser hits an RBO /auth/refresh endpoint that proxies to gotrue's /token?grant_type=refresh_token and rewrites both cookies on the response. |
| Refresh-token TTL | gotrue default (30d, configurable per env). Documented but owned by gotrue, not RBO. |
| Pre-emptive refresh | When access token has < 5 minutes left, the RBO middleware silently refreshes on the next request before serving the handler. This avoids mid-action 401s on slow hand-fills. |
| Cookie attributes (access token) | HttpOnly; Secure; SameSite=Lax; Path=/; Max-Age=3600. No Domain= — host-only on rbo.wagen.io (or rbo-test.wagen.io). |
| Cookie attributes (refresh token) | Same, but Path=/auth/refresh; Max-Age=2592000 (30d). The narrow path means the refresh cookie is sent only to the refresh endpoint, never to handler routes — limits exposure. |
| Cookie name (access) | rbo_session. |
| Cookie name (refresh) | rbo_refresh. |
| CSRF posture | SameSite=Lax + state-changing routes restricted to POST + a synchronizer-token (per-session, embedded in the page, validated on POST). HTMX requests carry the token in the X-CSRF-Token header. Per-form hidden inputs for non-HTMX paths. |
| Logout | RBO /auth/logout POST: (1) reads rbo_refresh, (2) calls gotrue's /logout revoking the refresh token, (3) clears both cookies (Max-Age=0), (4) redirects to /login. Idempotent (no error if no cookie). |
| Validation failure modes | Three loud rejection paths: (a) signature invalid → 401 + cookie clear; (b) expired access token + refresh fails → 401 + cookie clear + redirect to /login; (c) sub resolves to no RBO row → 403 + structured-log entry. No silent re-auth: the user always sees the login page. |
current_stringer_id / current_person_id binding |
Per auth-and-tenancy.md, the middleware binds the appropriate ContextVar after JWT validation + row resolution. This ADR fixes the resolution path; the chokepoint mechanics are unchanged. |
| Admin role | Determined entirely by Stringer.role = 'admin', not by a JWT claim. The JWT carries identity only; authorization (including admin) is RBO's domain. (See ADR-0001 §Tenancy.) |
| Bypass-tenant attribute | Set per-session in RBO based on a per-request opt-in by an admin user (e.g. via a ?admin_bypass=1 flag on catalogue moderation routes); never carried in the JWT. |
Sequence: validated request path¶
Browser ──(rbo_session cookie)──▶ RBO middleware
│
▼
JWT decode + signature check
(HS256 with GOTRUE_JWT_SECRET — current;
RS256 with JWKS-cache — aspirational, flag-gated)
│
▼
Validate iss, aud, exp
│
├── invalid: 401 + cookie clear
│
▼
Resolve sub → Stringer (then Person)
│
├── miss: 403 + log
│
▼
Bind current_stringer_id (or current_person_id) ContextVar
│
▼
Handler runs; chokepoint enforces tenancy
│
▼
Response (cookie refreshed if < 5 min remaining)
Sequence: refresh¶
Browser ──(401 from any route or pre-emptive)──▶ POST /auth/refresh (carries rbo_refresh)
│
▼
gotrue /token?grant_type=refresh_token
│
▼
new access + refresh tokens
│
▼
Set-Cookie both, return 204
│
▼
Browser retries the original request transparently (HTMX wrapper or 401-retry handler)
Sequence: logout¶
Browser ──(POST /auth/logout, rbo_refresh cookie)──▶ RBO
│
▼
gotrue /logout (revoke refresh)
│
▼
Set-Cookie rbo_session=; Max-Age=0
Set-Cookie rbo_refresh=; Max-Age=0
│
▼
302 → /login
Key rotation runbook (architectural slot, ops detail in keystone)¶
HS256 mode (current — amended 2026-05-04).
When gotrue's GOTRUE_JWT_SECRET is rotated:
- Atlas updates the secret on the gotrue container and re-runs
10-app-onboard.shfor RBO (which writes the new value into the env-scoped CI variable). Both must happen in the same maintenance window — gotrue and RBO must agree on the secret at any instant. - Atlas redeploys RBO (
docker compose up -d) so the container picks up the new env. In-flight tokens signed with the old secret start failing validation immediately on the new RBO instance (401 + cookie clear → user re-logs in). - There is no overlap window for HS256 — symmetric secrets cannot be presented in plurality. Rotation is therefore disruptive (every active session re-authenticates). Acceptable at our scale; the access-token TTL of 1h is the natural ceiling.
Failure of GOTRUE_JWT_SECRET at validation time: if the env var is missing, pydantic-settings raises at boot and the container exits non-zero (per ADR-0009 §7); Atlas's monitoring surfaces it. There is no JWKS endpoint to fall back on.
RS256 mode (aspirational, future).
When gotrue eventually exposes JWKS and jwt_verification_mode is flipped to "rs256":
- New
kidappears at the JWKS endpoint alongside the old one (gotrue maintains overlap). - RBO's in-process JWKS cache misses on the next token signed with the new
kid→ fetches fresh JWKS → validates. - Old
kidis purged from gotrue after the overlap window. Tokens still in flight signed with the old key fail validation gracefully (401 + cookie clear → user re-logs in).
Overlap window: owned by gotrue config, recommended ≥ 1h (≥ access-token TTL) so no in-flight session breaks. Documented as an operator commitment in keystone's auth runbook — RBO simply tolerates rotation by re-fetching JWKS on cache miss.
Failure of JWKS endpoint at validation time: RBO serves the cached JWKS; if no cached JWKS exists (cold start during a JWKS outage), every request 503s with structured log. This is correct behavior — RBO without JWKS cannot validate tokens, so it cannot serve. The cold-start-during-JWKS-outage window is small (boot is fast; JWKS fetch is one round-trip).
V3 client-portal slot¶
The same JWT contract serves V3 clients with zero schema or middleware changes:
- gotrue mints a magic-link for a Person; the resulting JWT has the Person's
gotrue_user_idassub. - The same middleware decode+validate path runs.
- Resolution: no Stringer matches (Persons don't get Stringer rows); Person matches via
Person.gotrue_user_id = sub;current_person_idContextVar is bound (instead ofcurrent_stringer_id). - The chokepoint applies the Person-scoped predicate per
auth-and-tenancy.md.
The Stefan-as-both case (Stefan is a Stringer AND has a Person row from another stringer's perspective): in V2, never auto-linked. In V3, explicit role-switcher in the UI (flagged as Open question below) chooses which ContextVar is bound for the session — implementation detail, not a wire-contract change.
What this ADR does NOT cover¶
- Password complexity, rate limiting, MFA, account lockout — gotrue's domain. RBO inherits whatever gotrue is configured for.
- Magic-link minting and validation — gotrue's domain. RBO consumes the resulting JWT.
- Email of magic-links / password resets — gotrue calls Resend SMTP per
integrations.md. - Cross-app SSO via
Domain=.wagen.iocookie — deferred; needs its own ADR in keystone scope. - Token introspection or revocation list — out of scope; the 1h TTL is the bound on a stolen access token.
Required tests (this ADR mandates them)¶
Amended 2026-05-04: tests 1–9 below are gotrue-version-agnostic and run in HS256 mode today. Test 10 (JWKS rotation) is gated behind jwt_verification_mode == "rs256" and stays in the tree as a regression for the future flip; it does not run in CI by default. A new HS256-specific test 11 is added.
- JWT signature validation. Forge a token with a wrong key → middleware rejects with 401 and clears cookie. (HS256: forge by signing with the wrong secret. RS256: forge by signing with the wrong RSA key.)
iss/audmismatch. Token signed by gotrue but with wrongissor wrongaud→ reject. Specifically, a token minted for another keystone app (differentaud) is rejected by RBO. Specifically also: a token with the gotrue v2.189.0 defaultaud=authenticatedis rejected because it does not match<app>-<env>.- Expired access token + valid refresh. Pre-emptive refresh path replaces both cookies; original request succeeds without surfacing a 401.
- Expired access + expired refresh. 401 + redirect to
/login. subresolves to no RBO row. 403 + structured log entry containing the gotrue user UUID. Cookie NOT cleared (the JWT is valid; the misconfiguration is server-side).- Stringer-first resolution. A gotrue user UUID matching both a Stringer and a Person row resolves to Stringer. (V2 fixture; V3 will revisit.)
- Logout idempotency. Two consecutive POST
/auth/logoutcalls both return 302 to/login; the second is a no-op. - Logout revokes refresh. After logout, presenting the old refresh token at
/auth/refreshreturns 401 (gotrue refused). - CSRF. State-changing POST without
X-CSRF-Token(HTMX path) or hidden form input (non-HTMX) returns 403. - JWKS rotation (RS256-mode regression — gated). Mock a JWKS endpoint serving two
kids; validate tokens signed with each; rotate (drop oldkidfrom JWKS); cached oldkidcontinues to validate until next cache refresh; new tokens with newkidtrigger refresh and validate. Skipped whenjwt_verification_mode == "hs256"; runs only when the future RS256 flip lands. - HS256 secret-rotation discipline (HS256-mode regression). A token minted with the previous
GOTRUE_JWT_SECRETis rejected with 401 + cookie clear after RBO restarts with a new secret. (No HS256 overlap window — verifies the rotation behaviour described in §"Key rotation runbook".)
Consequences¶
Good¶
- One normative wire contract. Pax has a single page to implement against; Iris's auth-flow requirements have a defaulted answer for every cookie / claim / TTL knob; the chokepoint stays decoupled from JWT internals.
- V3 client portal lights up without schema or middleware changes. The contract is shape-symmetric for Stringer and Person resolution.
- Key rotation is operationally cheap. JWKS-on-cache-miss + gotrue's overlap window means no coordinated deploy across RBO and gotrue.
- Stolen access cookie is bounded at 1h. Logout revokes refresh; access-token TTL caps the damage. This is the textbook short-TTL access + revocable-refresh model.
- Cross-app token confusion is structurally prevented by
audvalidation. A future CSD or ALJ JWT cannot be replayed at RBO. - JSON-stdout structured logs at every rejection path mean an auth incident is auditable end-to-end.
Costs we accept¶
- Two cookies, two paths. Slightly more middleware code than a single-cookie design. Worth it: refresh cookie is path-scoped to
/auth/refresh, never sent to handler routes. - Pre-emptive refresh adds latency on the request that triggers it (one extra round-trip to gotrue). The
< 5 minthreshold means most users never hit it. The alternative (let it 401, retry) is worse UX. - JWKS cache cold-start during a JWKS outage = 503. Acceptable: RBO without JWKS cannot validate, so it cannot serve. Boot order: RBO waits for JWKS at startup before declaring itself ready.
- CSRF synchronizer token is one more thing to forget. Mitigated by enforcing it at the middleware level for all state-changing routes — not per-handler.
- Stefan-as-both edge case is documented but not auto-resolved in V2. Two rows, one gotrue user — V2 resolves to Stringer. The V3 role-switcher is a UX problem, not a contract problem.
- No revocation list. Once issued, an access token is valid for its TTL — there is no "kill this specific token now." Acceptable at our threat model and TTL.
Open questions (Stefan-confirm)¶
- JWKS cache TTL — currently "no time-based TTL, refresh on
kidmiss only" (Option K-1 default). Alternative: 5-minute background refresh. The miss-driven model is simpler and avoids unnecessary network calls, but a poisoned-cache scenario (stale JWKS pinned in memory) is theoretically possible. Default proposed: miss-driven only. Stefan to confirm or flip. (Aspirational — RS256-mode-only; immaterial in current HS256 mode.) - V3 role-switcher for the Stefan-as-both case — explicit UI affordance ("acting as Stringer / acting as Client") in V3, vs. URL-segment-driven (
/adminvs./me). Architectural slot named here; UX decision is V3. - Pre-emptive refresh threshold — defaulted to 5 minutes. Could be 1 minute (more 401-retry exposure) or 10 minutes (more refresh chatter). 5 min is a balanced default.
- Refresh cookie
Path=/auth/refreshscope — narrow scope is the default. Alternative isPath=/(cookie sent on every request) for simpler debugging. Narrow default kept.
All four default to the values above if Stefan does not flip them. Each is a single-line config change.
Change log¶
| Date | Change | Reason |
|---|---|---|
| 2026-05-02 | Initial accepted version (RS256+JWKS verify path; HS256 rejected as K-3). | — |
| 2026-05-04 | Amendment — verify path flipped to HS256 + shared GOTRUE_JWT_SECRET as the current production reality; RS256+JWKS demoted to aspirational, gated behind Settings.jwt_verification_mode (default "hs256"). Updated: top-of-file amendment block, Stefan's pre-baked defaults (added "Verify path" paragraph), Option K-1 / K-3 (re-evaluated rejection rationale), Wire-level contract decision table (Signing algorithm + Key resolution + ISSUER format + AUDIENCE rows), validated-request sequence diagram (annotated for both modes), Key rotation runbook (split into HS256-mode + RS256-mode subsections), Required tests (gated test 10 behind RS256, added test 11 for HS256 secret-rotation discipline), Cross-references (added keystone#102 / !127 + #106 + project memory + ADR-0009). Cookie attributes, claim shape, refresh / logout flow, chokepoint integration, V3 client-portal slot are unchanged — they are gotrue-version-agnostic. Trigger to revisit: gotrue upgrade in keystone landing a working /auth/.well-known/jwks.json endpoint → flip jwt_verification_mode to "rs256" and retire HS256. |
Atlas's field-verification of deployed gotrue v2.189.0 (HS256-only, no JWKS endpoint) in keystone#102 / !127. RS256+JWKS would 401 every authenticated request because the JWKS fetch 404s. Pax-B's parallel code rework lives in #106. Per project memory project_gotrue_hs256.md. Tracked in #108. |
Cross-references¶
- ADR-0001 — names the boundary; this ADR fills in the wire-level contract.
- ADR-0004 — specifies the V3 client-portal Person-bound session slot; this ADR specifies the JWT path that lights it up.
- ADR-0009 — env-var contract (
GOTRUE_JWT_SECRET,GOTRUE_ISSUER,GOTRUE_AUDIENCE) co-amended 2026-05-04 with the same trigger. docs/architecture/auth-and-tenancy.md— chokepoint mechanics; bindscurrent_stringer_id/current_person_idset by this ADR's middleware.- keystone ADR-0005 — Resend SMTP that gotrue uses for magic-links.
- keystone#102 / !127 — Atlas's field-verification + per-tenant gotrue env provisioning (the trigger for this amendment).
- Iris's auth-flow requirements (
docs/requirements/use-cases.mdand the requirements log). -
106 — Pax-B's parallel HS256 code rework (companion MR to this amendment).¶
-
108 — this amendment.¶