ADR-0009: Observability and secrets boundary with keystone/Atlas¶
- Status: Proposed (amended 2026-05-04)
- Date: 2026-05-04
- Decider(s): Theo (SA), with platform-side defaults inherited from Atlas (keystone) and Stefan-confirmed contract points
- Closes: #99
- Amended by: #108 (HS256-reality amendment, co-amended with ADR-0006)
Amended 2026-05-04 — the env-contract table (§8) and Keystone-side gaps section are updated to reflect what Atlas actually shipped in keystone !127 (closing keystone#102):
GOTRUE_ISSUERformat diverged from the originally-proposedhttps://auth.wagen.io/<app>tohttps://<app>[-<env>].wagen.io/auth(the existing per-tenantAPI_EXTERNAL_URL);GOTRUE_AUDIENCE=<app>-<env>matches the original proposal;GOTRUE_JWT_SECRETis kept (load-bearing, deprecated-in-place — gotrue v2.189.0 is HS256-only and cannot do RS256), NOT dropped;GOTRUE_ADMIN_API_KEYis provisioned but unused until M21. All three "Keystone-side gaps" flagged in the original §"Keystone-side gaps" are now closed. Logging format,request_idpropagation, metrics/tracing/error-reporting decisions, and the local-dev template are unchanged. Companion: ADR-0006 co-amended with the HS256 verify path. See Change log below for affected sections. Per project memoryproject_gotrue_hs256.md.
Context¶
Several earlier docs name observability + secrets concerns by reference without locking the wire-level contract:
system-overview.mdsays "RBO emits JSON-to-stdout per Atlas's convention" but never specifies the field set.- ADR-0004
share_audit.request_idis a UUID for "correlation with structured logs" — but the loop (how does the request_id reach both the log line and the audit row?) is not closed. - ADR-0006 binds
current_stringer_id/current_person_idContextVars; how those reach log records is unspecified. integrations.mdlistsSMTP_HOST/USER/PASS/FROMas "injected by Atlas's CI" — but the boundary contract (which secret is whose) is implicit, not written.- The local-dev story is undocumented:
.envis gitignored but no.env.exampletemplate exists, and.gitignorewould in fact silently swallow one.
Phase 1 is in flight (#97, Pax) — the schema + tenancy chokepoint land before any of the cross-cutting middleware. Pax's settings module (app/core/settings.py) and Kit's CI variable surface (deploy-verification §"CI variable surface") are the surfaces this ADR commits.
This ADR locks the contract so Pax can write the request_id middleware deterministically, Kit can verify CI variable plumbing matches, and future readers do not re-litigate "should we add Prometheus / Sentry / OpenTelemetry."
Options¶
Logging format¶
- (L-1) JSON-to-stdout (chosen). Atlas's documented convention. Container stdout is captured by the platform; no on-disk rotation; one structured event per line;
jq-greppable. Costs one logger filter to inject context. - (L-2) Plain text to stdout. What
app/core/logging.pydoes today as a placeholder. Readable indocker logsbut not machine-queryable; loses field structure once volume grows. Acceptable as a temporary scaffold; not the destination. - (L-3) Structured logs to a log shipper (Loki, OpenSearch, etc.). Requires platform-side infrastructure RBO does not get to choose. Out of scope; if Atlas adds one, RBO consumes it transparently because stdout is the seam.
request_id source¶
- (R-1) Always-generate at RBO middleware (chosen, with optional inbound passthrough). Middleware mints a UUIDv4 unless a configured inbound header is present and well-formed. Bound to a ContextVar for the request lifetime. Survives both with and without an Atlas-side header injection.
- (R-2) Trust Caddy / Atlas to inject. Couples RBO to a header Atlas hasn't committed to. If the header is dropped or renamed, every request gets a NULL
request_id. Rejected. - (R-3) Use the inbound header verbatim, generate only on miss, but no validation. A malicious or misconfigured client could inject a 1MB string. Validate format (UUID-shaped) and length cap (64 bytes); fall back to a fresh UUID on malformed input. (This is the chosen behavior of (R-1), spelled out.)
Metrics surface (V2)¶
- (M-1) None — observability via structured logs only (chosen). At V2 scale (~50 jobs/year/stringer, single-digit stringer count) Prometheus / StatsD / etc. earn nothing.
duration_msin the structured log gives latency; counts come fromjq | wc -l. Re-evaluate at the multi-stringer threshold (Atlas-defined). - (M-2) Prometheus
/metricsendpoint (prometheus-fastapi-instrumentator). ~50 LOC. Free at our scale. But Atlas does not run a Prometheus today; we'd ship metrics no one scrapes. Defer. - (M-3) StatsD push. Requires a StatsD sidecar Atlas does not provide. Rejected for V2.
Tracing surface (V2)¶
- (T-1) None (chosen). OpenTelemetry adds a meaningful dependency (
opentelemetry-api,opentelemetry-instrumentation-*, an exporter) and a collector Atlas does not run. At single-process FastAPI + Postgres scale, traces would tell us nothing structured logs don't. - (T-2) OpenTelemetry stub (no collector). "Future-proof" instrumentation that exports nowhere. Pure cost, zero benefit. Rejected.
Error reporting surface (V2)¶
- (E-1) Structured logs only (chosen). A 5xx writes one structured log line with
error_class,error_msg,traceback(truncated). Atlas's log aggregation is the surface; Stefan-as-operator greps. Free, no third-party dependency, no PII leakage to a third party. - (E-2) Sentry. Useful at multi-stringer scale where noise is real and triage matters. At our scale, the noise floor is ~0; Sentry's free tier is more setup than the value returned. V3 candidate if multi-stringer churn produces actual triage volume.
- (E-3) Resend-emailed error digests to Stefan. Tempting one-liner, but at the first DB outage that 5xx-spams every request, the Resend free-tier 100/day cap is gone and so is every business-relevant email for the rest of the day. Rejected.
Secrets format / location¶
- (F-1) Plain env vars, provisioned by
keystone/scripts/10-app-onboard.shinto GitLab CI variables (env-scoped, masked + protected) (chosen). Matches keystone's existing convention as recorded in the deploy-verification doc. Single source of truth (keystone repo); RBO'spyproject.toml-sidepydantic-settingsreads at boot. - (F-2) Mozilla
sops-encrypted secrets file in the RBO repo. Adds a key-management story (which key, where) RBO does not need at our scale. Not Atlas's convention; rejected. - (F-3) HashiCorp Vault. Way out of scope for the platform.
Local dev secrets¶
- (D-1)
.env.exampletemplate, checked in;.envper developer, gitignored, populated from the template (chosen). Standard. Cheap. The current.gitignoreswallows.env.examplevia.env.*— fix with a!.env.exampleexception. - (D-2) Magic ad-hoc shell rc files. Undocumented, non-portable, breaks for the next contributor. Rejected.
Decision¶
1. Logging format and field set (normative)¶
Logs are emitted as JSON to stdout, one event per line. Atlas's container runtime captures stdout; RBO writes nothing to disk. Replace the plain-text scaffold in app/core/logging.py with a JSON formatter (recommended: python-json-logger, ~3 deps including stdlib logging adapter).
The mandatory field set for every log record from RBO request handlers is:
| Field | Type | Source | Notes |
|---|---|---|---|
timestamp |
ISO-8601 UTC string (2026-05-04T12:34:56.789Z) |
logger | Microsecond precision recommended; second precision acceptable. Always UTC. |
level |
string | logger | DEBUG, INFO, WARNING, ERROR, CRITICAL. |
logger |
string | logger | Python logger name (app.api.routes_orders etc.). |
message |
string | handler | Free text; does not carry structured fields — use named log args for those. |
request_id |
UUID string | ContextVar | UUIDv4. See §2. Absent (or null) for non-request logs (boot, shutdown, scheduled tasks). |
app_env |
string | settings | dev | test | prod. Tagged once on startup; included on every line so a multi-env log aggregator can filter. |
path |
string | middleware | URL path of the request (e.g. /orders/123). Request logs only. |
method |
string | middleware | HTTP method. Request logs only. |
status |
integer | middleware | HTTP response status. Request-completion logs only. |
duration_ms |
integer | middleware | Wall-clock from request start to response start. Request-completion logs only. |
The conditional field set (present when bound):
| Field | Type | When present | Notes |
|---|---|---|---|
stringer_id |
integer | After auth middleware binds current_stringer_id ContextVar |
NULL for unauthenticated, Person-bound, or boot logs. |
person_id |
integer | After auth middleware binds current_person_id ContextVar (V3 client portal) |
NULL for V2 stringer paths. |
The error field set (present when level >= ERROR):
| Field | Type | Notes |
|---|---|---|
error_class |
string | Exception class name (ValueError, IntegrityError, …). |
error_msg |
string | str(exc). PII-bearing strings (e.g. raw SQL params) are NOT logged here; the handler must scrub before raising or the middleware must use a generic message. |
traceback |
string | traceback.format_exc(), truncated to 8 KB. |
Forbidden fields: raw SQL query text with bound parameters; raw JWT contents; raw cookie values; raw email contents; raw passwords / API keys (this last one applies even at DEBUG — never log a value sourced from Settings.smtp_pass etc.).
Implementation shape (informative, for Pax):
app/core/logging.pyconfigures the root logger with a JSON formatter and aContextFilterthat injectsrequest_id,stringer_id,person_id,app_envfrom ContextVars onto every record.- Add fields to records via
logging.LoggerAdapteror via the filter — not by passingextra={}at every call site (too easy to forget). - One middleware (let's call it
RequestContextMiddleware) is the only place that reads the inboundrequest_idheader, mints UUIDs, sets the ContextVar, and emits the request-completion log line.
2. request_id propagation contract¶
- Generation:
RequestContextMiddlewareruns first in the FastAPI middleware stack. It reads the inbound headerX-Request-ID(case-insensitive). If present and the value matches^[A-Za-z0-9._-]{1,64}$, it is used verbatim; otherwise a freshuuid.uuid4()is generated. - Why a permissive validator instead of "must be UUID": Atlas's Caddy ingress may eventually inject any opaque token (Cloudflare-style hex, ULID, etc.). Validating "non-empty, no control chars, length-bounded" gives Atlas latitude without coupling RBO to a specific format. UUIDv4 is the default for RBO-minted IDs; inbound IDs are honored as-given.
- Binding: the ID is set on
request_context_var: ContextVar[str]and onResponse.headers["X-Request-ID"](so a client / Caddy can correlate the response). - Inbound header name:
X-Request-ID(chosen). De-facto standard; what Caddy ships in itsrequest_idplaceholder under the same name. Atlas adopts; if Atlas's Caddy snippet does not yet inject this header, RBO's middleware mints one and the chain still works. - Outbound header name:
X-Request-IDon the response, same value as bound on the ContextVar. - Lifetime: ContextVar is set at the start of the middleware and cleared by Python's normal context-var unwinding when the middleware returns (no manual
reset()needed because FastAPI / Starlette runs the middleware chain in a fresh asyncio task). - Scheduled-task / boot logs:
request_idisnull(the JSON formatter emitsnullrather than omitting the key — consistent shape eases log queries).
3. Closing the share_audit.request_id loop¶
ADR-0004 declared share_audit.request_id UUID, "for correlation with structured logs." This ADR specifies the wiring:
- The same
request_context_varContextVar is read by theshare_auditwriter (whether theshared_readrow inserted at chokepoint admit-time, or thegrant_created/grant_revokedrow inserted by the share-management handler). - One helper (
get_request_id() -> str | None) lives inapp/core/logging.py(or a siblingapp/core/request_context.py) — both the logger filter and the audit writer call it. - A regression test asserts that for a request with
X-Request-ID: <known-uuid>, both (a) the structured log line and (b) theshare_auditrow carry the samerequest_id.
This is the closure ADR-0004 deferred. With it, an audit investigation can move from a share_audit row to its causal log lines in one query.
4. Logger context binding (no manual passing)¶
current_stringer_id and current_person_id are bound by the auth middleware (per ADR-0006 and auth-and-tenancy.md).
- A
logging.Filterreads them from the same ContextVars and copies them onto everyLogRecord. The JSON formatter then emits them. - Handlers and modules call
logger.info("string parsed", extra={"order_id": 42})— they do NOT passstringer_id=...manually. The filter is the single seam. - A regression test asserts that a log line emitted from inside an authenticated request contains
stringer_id, and a log line emitted from boot does not (the field isnull).
5. Metrics¶
No metrics surface in V2. No Prometheus, no StatsD, no /metrics endpoint.
Reasoning: the structured-log duration_ms field is the latency surface; counts come from jq | wc -l; Atlas runs no scraper today.
Re-evaluation trigger: if Atlas adds a Prometheus to keystone, OR if RBO grows past three concurrent stringers AND Stefan asks for a dashboard. Until then, this is documented "no" so future readers do not reintroduce it.
6. Tracing¶
Out of scope for V2. No OpenTelemetry, no Jaeger, no Honeycomb.
Reasoning: single-process FastAPI + a single Postgres dependency does not produce meaningful spans. The structured request_id correlates everything we need at this scale.
Re-evaluation trigger: if RBO grows a worker process (RQ/arq) AND a request fans out across the worker boundary. Until then, this is documented "no."
7. Error reporting¶
Plain structured logs in V2. A 5xx emits a single level=ERROR log line with the error fields above. Atlas's log aggregation surface (whatever it is at the time — docker logs today, possibly a Loki/OpenSearch later) is the operator surface.
No Sentry in V2. Sentry is a V3 candidate if multi-stringer noise produces real triage volume (Stefan can no longer eyeball errors in docker logs).
Boot-failure handling: the app must not silently exit on a missing required env var. pydantic-settings raises ValidationError at get_settings() time, which Pax surfaces as a structured log line on the import path before uvicorn starts; the container exits non-zero and the keystone-generated compose restart: unless-stopped policy will surface the failure to Atlas's monitoring. This is the platform contract working as designed.
8. Secrets / config — boundary table (load-bearing)¶
This is the contract Kit verifies against the env-scoped CI variables provisioned by keystone/scripts/10-app-onboard.sh.
| Env var | Owner (provider) | Format | RBO consumer | Rotation | Notes |
|---|---|---|---|---|---|
APP_ENV |
Atlas (per-env CI variable) | dev | test | prod |
Settings.app_env |
n/a | Tags log records and is read by health endpoints. |
LOG_LEVEL |
RBO (defaulted in Settings) |
DEBUG | INFO | … |
Settings.log_level |
Per-deploy | RBO can override per-env (e.g. DEBUG on test); not an Atlas-provisioned var by default. |
PORT |
Atlas (APP_PORT=3010 in config/apps/rbo.env per keystone onboarding) |
integer | Settings.port |
n/a | The Compose service binds the container to 127.0.0.1:$APP_PORT; Caddy fronts. |
DATABASE_URL |
Atlas (10-app-onboard) | postgresql+psycopg://<role>:<pwd>@keystone-pgbouncer:6432/rbo_<env> |
Settings.database_url |
Atlas-driven (rotate password → re-run onboarding script → restart RBO) | Routes through PgBouncer; transaction-pool; per project_pgbouncer_constraint.md RBO must avoid session-scoped Postgres features. |
DATABASE_URL_DIRECT |
Atlas (10-app-onboard, optional today) | postgresql+psycopg://...:5432/rbo_<env> |
Settings.database_url_direct (Alembic only) |
Atlas-driven | Bypasses PgBouncer for DDL. Falls back to DATABASE_URL in alembic/env.py if unset. |
GOTRUE_URL |
Atlas (10-app-onboard) | https://<app>[-<env>].wagen.io/auth (e.g. https://rbo-test.wagen.io/auth) |
Settings.gotrue_url |
Atlas-driven | RBO uses this as the gotrue base URL for refresh / logout / admin-API calls. ADR-0006 originally also fetched ${GOTRUE_URL}/.well-known/jwks.json; that path is non-functional on gotrue v2.189.0 (404) and is gated behind Settings.jwt_verification_mode == "rs256" (aspirational, off by default). |
GOTRUE_ISSUER |
Atlas (10-app-onboard, provisioned by !127) | URL — https://<app>[-<env>].wagen.io/auth |
Settings.gotrue_issuer |
Atlas-driven | Validated by ADR-0006 §"Wire-level contract". (Amended 2026-05-04: format diverged from the originally-proposed https://auth.wagen.io/<app> — that hostname does not exist on the platform; the chosen format reuses the existing per-tenant API_EXTERNAL_URL. See §"ISSUER format deviation" below.) |
GOTRUE_AUDIENCE |
Atlas (10-app-onboard, provisioned by !127) | string (<app>-<env>, e.g. rbo-prod, rbo-test) |
Settings.gotrue_audience |
Atlas-driven | Validated by ADR-0006 §"Wire-level contract". The per-tenant gotrue config flips GOTRUE_JWT_AUD away from gotrue's hardcoded authenticated default so cross-tenant token replay is structurally prevented. |
GOTRUE_JWT_SECRET |
Atlas (10-app-onboard) | secret string | Settings.gotrue_jwt_secret |
Atlas-driven | Load-bearing for HS256 verify (current production path). (Amended 2026-05-04: kept as load-bearing — gotrue v2.189.0 is HS256-only and cannot do RS256, so this secret is the verify key, not legacy. Originally flagged for removal as a "mismatch with ADR-0006 (RS256+JWKS)"; that flag is rescinded.) Re-evaluated for deprecation when a future gotrue upgrade lands a working JWKS endpoint and Settings.jwt_verification_mode flips to "rs256". See §"JWT_SECRET retention rationale" below. |
GOTRUE_ADMIN_API_KEY |
Atlas (10-app-onboard, provisioned by !127) | secret string | (new, when admin-API needed: invite, password reset proxy) | Atlas-driven | Used for the stringer-onboarding invite (RBO calls gotrue's invite endpoint per auth-and-tenancy.md §"First stringer sign-in"). Provisioned; consumed when M21 stringer auth lands. |
SMTP_HOST |
Atlas (keystone email runbook / 10-app-onboard) | host string (smtp.resend.com) |
Settings.smtp_host |
Atlas-driven | Per integrations.md. |
SMTP_PORT |
Atlas | int (587) |
Settings.smtp_port |
n/a | STARTTLS submission port. |
SMTP_USER |
Atlas | string (resend) |
Settings.smtp_user |
Atlas-driven | Resend's literal username for SMTP AUTH. |
SMTP_PASS |
Atlas | secret string (re_<api-key>) |
Settings.smtp_pass |
Atlas-driven (rotate via Resend dashboard → onboarding script) | Per keystone ADR-0005. |
SMTP_FROM |
Atlas | email (noreply@wagen.io) |
Settings.smtp_from |
Per-env | RBO renders this as the From-header on outbound mail. |
IMAGE |
Atlas (10-app-onboard) | string (registry path) | 11-app-deploy.sh |
n/a | Consumed by deploy script, not by RBO. |
APP_HOSTNAME |
Atlas | hostname | 11-app-deploy.sh |
n/a | Consumed by Caddy/deploy, not by RBO. |
DEPLOY_HOST |
Atlas | hostname (kst1.wagen.io) |
CI deploy job | n/a | Consumed by .keystone_deploy brick, not by RBO. |
DEPLOY_SSH_KEY |
Atlas (CI, file-typed, protected) | PEM | CI deploy job | Atlas-driven | Per ADR-0006 §App runtime contract — never reaches RBO container. |
Boundary in one sentence: Atlas provisions every secret RBO consumes; RBO consumes only via Settings; rotation is Atlas's responsibility; RBO restarts on env-var change picked up by 11-app-deploy.sh's docker compose up -d. RBO owns no platform-grade secret of its own.
ISSUER format deviation (added 2026-05-04)¶
The originally-proposed GOTRUE_ISSUER format was https://auth.wagen.io/<app> (a hypothetical platform-wide auth domain). Atlas's keystone !127 instead provisioned https://<app>[-<env>].wagen.io/auth — the existing per-tenant API_EXTERNAL_URL on each gotrue container. Rationale:
auth.wagen.iodoes not exist on the keystone platform. Inventing it would have forced a wildcard cert + Caddy reverse-proxy slot for a single use case (issuer-claim string-match), buying nothing operationally.- The chosen format is what gotrue v2.189.0 naturally emits in the
issclaim whenAPI_EXTERNAL_URLis set to that value. Aligning RBO'sGOTRUE_ISSUERwith gotrue's ownissoutput means zero translation in the verify path. - Per-env separation is preserved because the env suffix lives in the host (
rbo.wagen.iovsrbo-test.wagen.io), not in a path segment. A token minted forrbo-testcarries a differentissthan arbo-prodtoken; cross-env replay is structurally rejected.
The path-suffix /auth matches the Caddy route on each app host that fronts gotrue. This is consistent with the gotrue-base-URL pattern already used by Settings.gotrue_url.
JWT_SECRET retention rationale (added 2026-05-04)¶
The original §"Keystone-side gaps" entry proposed dropping GOTRUE_JWT_SECRET once ADR-0006's RS256+JWKS path landed. That proposal is rescinded because Atlas's field-verification revealed:
- Deployed gotrue is v2.189.0, HS256-only. The image does not expose
/auth/.well-known/jwks.json(404). The RS256+JWKS verify path RBO would use is non-functional on this image. GOTRUE_JWT_SECRETis therefore the verify key, not a legacy artefact. Dropping it would 401 every authenticated request.- Re-evaluation trigger: when keystone upgrades gotrue to a JWKS-capable version (no fixed date),
Settings.jwt_verification_modeflips from"hs256"to"rs256"and the secret can be retired. Until then the secret is load-bearing and stays in the onboarding-script variable set, marked deprecated-in-place rather than removed.
This is the project memory project_gotrue_hs256.md constraint applied to the env-contract surface; it is co-amended with ADR-0006 §"Wire-level contract".
9. Local dev template¶
Two small changes:
(a) Add .env.example to the repo root with the field set above (NO real values; commented placeholders). The file is the documented starting point for cp .env.example .env.
(b) Patch .gitignore: the existing .env.* glob silently ignores .env.example. Insert !.env.example immediately after .env.* so the template tracks.
The template covers: APP_ENV, LOG_LEVEL, PORT, DATABASE_URL, DATABASE_URL_DIRECT, GOTRUE_URL, GOTRUE_ISSUER, GOTRUE_AUDIENCE, GOTRUE_JWT_SECRET, SMTP_HOST/PORT/USER/PASS/FROM. (Amended 2026-05-04: GOTRUE_JWT_SECRET is included in the template — it is the HS256 verify key, load-bearing under gotrue v2.189.0; previously listed as "going away" pre-Atlas-field-verification.) It does NOT include GOTRUE_ADMIN_API_KEY (V2 phase 2; safe to add when M21 lands) or any DEPLOY_* keys (CI-only).
Required tests (this ADR mandates them)¶
- JSON shape test. A request hitting any handler emits at least one log line whose JSON parses, contains the mandatory field set, and has
request_idmatching the request'sX-Request-IDresponse header. request_idmint test. A request with no inboundX-Request-IDgets a fresh UUIDv4 in both the response header and the structured log.request_idpassthrough test. A request withX-Request-ID: 11111111-1111-1111-1111-111111111111carries the same value end-to-end.request_idvalidation test. A request withX-Request-ID: <2KB string>is rejected/replaced — the header is treated as malformed and a fresh UUIDv4 is minted; the response carries the minted value.stringer_idbinding test. An authenticated request's log line includesstringer_id; an unauthenticated/healthzrequest's log line hasstringer_id: null.share_audit.request_idcorrelation test. A request that triggers a share-grant write produces (a) ashare_auditrow and (b) a structured log line that share the samerequest_idvalue.- PII non-leak test. A request that fails JWT validation (signature mismatch) emits an ERROR log; the log JSON does NOT contain the raw cookie value, the raw token, or any header longer than 64 bytes.
- Boot-failure visibility test. Starting the app with no
DATABASE_URLand a request to/healthz/dbproduces a structured-format log line at ERROR;/healthz(liveness) still returns 200.
Tests 5 and 6 ride with Pax's Phase 1 chokepoint test suite; tests 1–4, 7, 8 ride with Pax's middleware MR.
Consequences¶
Good¶
- One normative log shape. Pax has a single field-set spec; future routes inherit it for free via the filter; log queries can rely on
request_idandstringer_idbeing present where expected. - Audit-to-log traversal closes. ADR-0004's
share_audit.request_idis no longer a dangling UUID — it correlates to a structured log line in onejqquery. - No platform infrastructure RBO doesn't already have. No Prometheus, no Sentry, no OTel collector. Atlas's stdout-capture is the only dependency.
- Secrets boundary is one table. Kit's CI-variable verification has a single artifact to diff against.
- Local dev template fixes a silent gap.
.env.exampletracks;.gitignoreno longer eats it. - Future re-evaluation has explicit triggers. "Add Prometheus" / "Add Sentry" are documented "no, until X" — future readers do not silently regress the decision.
Costs we accept¶
- ~~Two new env vars not yet in onboarding (
GOTRUE_ISSUER,GOTRUE_AUDIENCE) and one ADR-0006-induced cleanup (GOTRUE_JWT_SECRET→ JWKS) — flagged for Atlas in §"Keystone-side gaps" below.~~ (Amended 2026-05-04: closed by Atlas's keystone !127.GOTRUE_ISSUERandGOTRUE_AUDIENCEare now provisioned per tenant;GOTRUE_JWT_SECRETis retained as the load-bearing HS256 verify key, NOT cleaned up — the proposed RS256+JWKS migration was non-viable on gotrue v2.189.0.) python-json-logger(or equivalent) is a new dependency. ~10 KB, zero transitive deps, BSD-licensed. Acceptable.request_idfilter is the only mandatory cross-cutting middleware. One more thing to import correctly inapp_factory.py. Mitigated by tests 1–4 catching omissions.- No metrics means no SLO dashboard. At V2 scale, Stefan-as-operator is the dashboard. Documented; revisited at the multi-stringer trigger.
- Truncation rules (
traceback8 KB, header 64 B) are arbitrary cutoffs. They are Stefan-tunable inSettingsif the defaults bite; not worth a knob today.
Migration impact (for Pax + Kit)¶
- Pax (#97 + middleware MR):
Settingsgainsgotrue_issuer: str | None,gotrue_audience: str | None, optionallog_format: Literal["json", "text"](default"json"in prod/test,"text"in dev for readability).gotrue_jwt_secretis load-bearing (HS256 verify key) — see ADR-0006. Addrequest_id_header: str = "X-Request-ID"so the inbound header name is overridable for tests / Atlas coordination. (Amended 2026-05-04:gotrue_jwt_secretis no longer "deprecated until JWKS lands" — it is the active verify key under gotrue v2.189.0. Pax-B's HS256 rework lives in #106.) - Kit (CI): verify Atlas's keystone !127 provisions
GOTRUE_ISSUER,GOTRUE_AUDIENCE,GOTRUE_JWT_SECRET, andGOTRUE_ADMIN_API_KEYfor each app-env tuple. (Amended 2026-05-04: all four are now provisioned per !127; the original "remove or repurpose" guidance forGOTRUE_JWT_SECRETis rescinded.) app/core/logging.py: swap the plain-text formatter for JSON; add theContextFilter. This is the formatter swapapp/core/logging.pyexplicitly anticipates ("Replacing it with a JSON formatter is a one-line swap").- No schema migration.
share_audit.request_idalready exists in ADR-0004's spec; Pax wires the writer to read the ContextVar.
Keystone-side gaps (flagged for Atlas; do NOT open keystone MRs from this RBO MR)¶
Per the project memory entry "Atlas — propagate contract fixes," gaps in keystone artifacts are flagged with proposed fixes but the keystone MR is Atlas's to open. Three gaps surfaced while writing this ADR.
All three closed 2026-05-04 by Atlas's keystone !127 (closing keystone#102). The original gap descriptions and resolution notes are kept below as historical record.
- ~~
GOTRUE_ISSUERandGOTRUE_AUDIENCEare missing fromkeystone/scripts/10-app-onboard.sh's RBO variable set.~~ Per ADR-0006 they are required for JWT validation. Original proposed fix: extend the onboarding script to write both for each app-env tuple, defaulted tohttps://auth.wagen.io/<app>and<app>-<env>respectively. Closed by !127 — provisioned per app-env tuple. The ISSUER value diverged from the proposal (https://<app>[-<env>].wagen.io/authinstead ofhttps://auth.wagen.io/<app>); see §"ISSUER format deviation" for rationale. AUDIENCE matches the original proposal. - ~~
GOTRUE_JWT_SECRETpredates ADR-0006's RS256-via-JWKS decision.~~ Original proposed fix: drop from the onboarding script's variable set (or repurpose as a fallback HS256 key if Atlas's gotrue is configured for HS256 — confirm with Atlas first). Closed by !127 with the alternative path: kept as load-bearing. Atlas field-verified that gotrue v2.189.0 is HS256-only and exposes no JWKS endpoint, so the secret is the verify key — not a legacy artefact. See §"JWT_SECRET retention rationale" for the full reasoning. Re-evaluation trigger: future gotrue upgrade landing JWKS support. - ~~
GOTRUE_ADMIN_API_KEYis not provisioned.~~ Original proposed fix: when M21 (#74) ships, add this as an env-scoped, masked, protected CI variable via10-app-onboard.sh. Closed by !127 ahead of M21 — provisioned now; consumed when M21 lands.
These three flags landed as a comment on keystone#102 and were resolved by Atlas in the same MR (!127). #99 (this ADR's closing issue) carries the back-link.
Open questions (Stefan-confirm)¶
- Inbound
request_idheader name — defaulted toX-Request-ID. Caddy emits this name natively. If Atlas's Caddy snippet uses a different header (e.g.X-Correlation-ID), one-line override viaSettings.request_id_header. Default kept. - JSON-formatter library —
python-json-loggeris the recommended pick (smallest, BSD, mature).structlogis a richer alternative but introduces a different logger API surface; not worth it at our complexity. Default:python-json-logger. - Dev-environment log format — defaulted to plain text in
dev, JSON intest/prod. Dev-text reads better indocker logs; prod-JSON parses withjq. Stefan to confirm or flip to "JSON everywhere" (slightly more uniform, slightly less readable locally). - Truncation defaults — traceback 8 KB, error_msg uncapped (relies on the handler not raising 1MB strings), header values 64 B. Stefan to flip if a real example bites.
- Logger name pattern —
app.api.routes_orders(Python module path) vs.rbo.orders(a curated namespace). Default: Python module path (free, automatic). Stefan to flip if a curated namespace would help log-routing.
All five default to the values above; each is a single-line change.
Change log¶
| Date | Change | Reason |
|---|---|---|
| 2026-05-04 | Initial Proposed version (env-contract table flagging three keystone-side gaps; GOTRUE_ISSUER proposed as https://auth.wagen.io/<app>; GOTRUE_JWT_SECRET proposed for removal once ADR-0006's JWKS path landed). |
— |
| 2026-05-04 | Amendment — env-contract table updated to reflect Atlas's actual !127 provisioning. Updated: top-of-file amendment block, env-contract table rows for GOTRUE_URL/GOTRUE_ISSUER/GOTRUE_AUDIENCE/GOTRUE_JWT_SECRET/GOTRUE_ADMIN_API_KEY, two new sub-decision blocks (§"ISSUER format deviation", §"JWT_SECRET retention rationale"), §9 local-dev template (GOTRUE_JWT_SECRET added), §"Costs we accept" (gap row reframed as closed), §"Migration impact" (Pax + Kit guidance updated), §"Keystone-side gaps" (all three closed by !127 with historical record). Logging format, request_id propagation, metrics/tracing/error-reporting, local-dev template structure are unchanged. |
Atlas's keystone !127 (closing keystone#102) field-verified gotrue v2.189.0 as HS256-only and provisioned per-tenant GOTRUE_ISSUER, GOTRUE_AUDIENCE, GOTRUE_ADMIN_API_KEY while keeping GOTRUE_JWT_SECRET load-bearing. ADR-0006 co-amended. Per project memory project_gotrue_hs256.md. Tracked in #108. |
Cross-references¶
- ADR-0001 — names the platform boundary; this ADR fills in observability + secrets.
- ADR-0004 —
share_audit.request_idcorrelation closed here. - ADR-0006 — the JWT contract whose
current_stringer_id/current_person_idContextVars feed this ADR's logger filter; co-amended 2026-05-04 (HS256 verify path). auth-and-tenancy.md— chokepoint binds the ContextVars this ADR reads.integrations.md— SMTP_* env vars referenced in §8.system-overview.md— "JSON-to-stdout per Atlas's convention" claim that this ADR turns into a normative spec.process/deploy-verification-2026-05-04.md§"CI variable surface" — current variable set this ADR contracts against.project_pgbouncer_constraint.md— DB connectivity rule referenced in §8.- keystone#102 / !127 — Atlas's per-tenant gotrue env provisioning (closes the three §"Keystone-side gaps" entries).
-
106 — Pax-B's HS256 code rework (companion MR to this amendment).¶
-
108 — this amendment.¶
- Pax's middleware MR (to follow) — implements the spec.
- Kit's CI-verification follow-up — confirms the secrets boundary table matches
10-app-onboard.sh.