Skip to content

ADR-0009: Observability and secrets boundary with keystone/Atlas

  • Status: Proposed (amended 2026-05-04)
  • Date: 2026-05-04
  • Decider(s): Theo (SA), with platform-side defaults inherited from Atlas (keystone) and Stefan-confirmed contract points
  • Closes: #99
  • Amended by: #108 (HS256-reality amendment, co-amended with ADR-0006)

Amended 2026-05-04 — the env-contract table (§8) and Keystone-side gaps section are updated to reflect what Atlas actually shipped in keystone !127 (closing keystone#102): GOTRUE_ISSUER format diverged from the originally-proposed https://auth.wagen.io/<app> to https://<app>[-<env>].wagen.io/auth (the existing per-tenant API_EXTERNAL_URL); GOTRUE_AUDIENCE = <app>-<env> matches the original proposal; GOTRUE_JWT_SECRET is kept (load-bearing, deprecated-in-place — gotrue v2.189.0 is HS256-only and cannot do RS256), NOT dropped; GOTRUE_ADMIN_API_KEY is provisioned but unused until M21. All three "Keystone-side gaps" flagged in the original §"Keystone-side gaps" are now closed. Logging format, request_id propagation, metrics/tracing/error-reporting decisions, and the local-dev template are unchanged. Companion: ADR-0006 co-amended with the HS256 verify path. See Change log below for affected sections. Per project memory project_gotrue_hs256.md.

Context

Several earlier docs name observability + secrets concerns by reference without locking the wire-level contract:

  • system-overview.md says "RBO emits JSON-to-stdout per Atlas's convention" but never specifies the field set.
  • ADR-0004 share_audit.request_id is a UUID for "correlation with structured logs" — but the loop (how does the request_id reach both the log line and the audit row?) is not closed.
  • ADR-0006 binds current_stringer_id / current_person_id ContextVars; how those reach log records is unspecified.
  • integrations.md lists SMTP_HOST/USER/PASS/FROM as "injected by Atlas's CI" — but the boundary contract (which secret is whose) is implicit, not written.
  • The local-dev story is undocumented: .env is gitignored but no .env.example template exists, and .gitignore would in fact silently swallow one.

Phase 1 is in flight (#97, Pax) — the schema + tenancy chokepoint land before any of the cross-cutting middleware. Pax's settings module (app/core/settings.py) and Kit's CI variable surface (deploy-verification §"CI variable surface") are the surfaces this ADR commits.

This ADR locks the contract so Pax can write the request_id middleware deterministically, Kit can verify CI variable plumbing matches, and future readers do not re-litigate "should we add Prometheus / Sentry / OpenTelemetry."

Options

Logging format

  • (L-1) JSON-to-stdout (chosen). Atlas's documented convention. Container stdout is captured by the platform; no on-disk rotation; one structured event per line; jq-greppable. Costs one logger filter to inject context.
  • (L-2) Plain text to stdout. What app/core/logging.py does today as a placeholder. Readable in docker logs but not machine-queryable; loses field structure once volume grows. Acceptable as a temporary scaffold; not the destination.
  • (L-3) Structured logs to a log shipper (Loki, OpenSearch, etc.). Requires platform-side infrastructure RBO does not get to choose. Out of scope; if Atlas adds one, RBO consumes it transparently because stdout is the seam.

request_id source

  • (R-1) Always-generate at RBO middleware (chosen, with optional inbound passthrough). Middleware mints a UUIDv4 unless a configured inbound header is present and well-formed. Bound to a ContextVar for the request lifetime. Survives both with and without an Atlas-side header injection.
  • (R-2) Trust Caddy / Atlas to inject. Couples RBO to a header Atlas hasn't committed to. If the header is dropped or renamed, every request gets a NULL request_id. Rejected.
  • (R-3) Use the inbound header verbatim, generate only on miss, but no validation. A malicious or misconfigured client could inject a 1MB string. Validate format (UUID-shaped) and length cap (64 bytes); fall back to a fresh UUID on malformed input. (This is the chosen behavior of (R-1), spelled out.)

Metrics surface (V2)

  • (M-1) None — observability via structured logs only (chosen). At V2 scale (~50 jobs/year/stringer, single-digit stringer count) Prometheus / StatsD / etc. earn nothing. duration_ms in the structured log gives latency; counts come from jq | wc -l. Re-evaluate at the multi-stringer threshold (Atlas-defined).
  • (M-2) Prometheus /metrics endpoint (prometheus-fastapi-instrumentator). ~50 LOC. Free at our scale. But Atlas does not run a Prometheus today; we'd ship metrics no one scrapes. Defer.
  • (M-3) StatsD push. Requires a StatsD sidecar Atlas does not provide. Rejected for V2.

Tracing surface (V2)

  • (T-1) None (chosen). OpenTelemetry adds a meaningful dependency (opentelemetry-api, opentelemetry-instrumentation-*, an exporter) and a collector Atlas does not run. At single-process FastAPI + Postgres scale, traces would tell us nothing structured logs don't.
  • (T-2) OpenTelemetry stub (no collector). "Future-proof" instrumentation that exports nowhere. Pure cost, zero benefit. Rejected.

Error reporting surface (V2)

  • (E-1) Structured logs only (chosen). A 5xx writes one structured log line with error_class, error_msg, traceback (truncated). Atlas's log aggregation is the surface; Stefan-as-operator greps. Free, no third-party dependency, no PII leakage to a third party.
  • (E-2) Sentry. Useful at multi-stringer scale where noise is real and triage matters. At our scale, the noise floor is ~0; Sentry's free tier is more setup than the value returned. V3 candidate if multi-stringer churn produces actual triage volume.
  • (E-3) Resend-emailed error digests to Stefan. Tempting one-liner, but at the first DB outage that 5xx-spams every request, the Resend free-tier 100/day cap is gone and so is every business-relevant email for the rest of the day. Rejected.

Secrets format / location

  • (F-1) Plain env vars, provisioned by keystone/scripts/10-app-onboard.sh into GitLab CI variables (env-scoped, masked + protected) (chosen). Matches keystone's existing convention as recorded in the deploy-verification doc. Single source of truth (keystone repo); RBO's pyproject.toml-side pydantic-settings reads at boot.
  • (F-2) Mozilla sops-encrypted secrets file in the RBO repo. Adds a key-management story (which key, where) RBO does not need at our scale. Not Atlas's convention; rejected.
  • (F-3) HashiCorp Vault. Way out of scope for the platform.

Local dev secrets

  • (D-1) .env.example template, checked in; .env per developer, gitignored, populated from the template (chosen). Standard. Cheap. The current .gitignore swallows .env.example via .env.* — fix with a !.env.example exception.
  • (D-2) Magic ad-hoc shell rc files. Undocumented, non-portable, breaks for the next contributor. Rejected.

Decision

1. Logging format and field set (normative)

Logs are emitted as JSON to stdout, one event per line. Atlas's container runtime captures stdout; RBO writes nothing to disk. Replace the plain-text scaffold in app/core/logging.py with a JSON formatter (recommended: python-json-logger, ~3 deps including stdlib logging adapter).

The mandatory field set for every log record from RBO request handlers is:

Field Type Source Notes
timestamp ISO-8601 UTC string (2026-05-04T12:34:56.789Z) logger Microsecond precision recommended; second precision acceptable. Always UTC.
level string logger DEBUG, INFO, WARNING, ERROR, CRITICAL.
logger string logger Python logger name (app.api.routes_orders etc.).
message string handler Free text; does not carry structured fields — use named log args for those.
request_id UUID string ContextVar UUIDv4. See §2. Absent (or null) for non-request logs (boot, shutdown, scheduled tasks).
app_env string settings dev | test | prod. Tagged once on startup; included on every line so a multi-env log aggregator can filter.
path string middleware URL path of the request (e.g. /orders/123). Request logs only.
method string middleware HTTP method. Request logs only.
status integer middleware HTTP response status. Request-completion logs only.
duration_ms integer middleware Wall-clock from request start to response start. Request-completion logs only.

The conditional field set (present when bound):

Field Type When present Notes
stringer_id integer After auth middleware binds current_stringer_id ContextVar NULL for unauthenticated, Person-bound, or boot logs.
person_id integer After auth middleware binds current_person_id ContextVar (V3 client portal) NULL for V2 stringer paths.

The error field set (present when level >= ERROR):

Field Type Notes
error_class string Exception class name (ValueError, IntegrityError, …).
error_msg string str(exc). PII-bearing strings (e.g. raw SQL params) are NOT logged here; the handler must scrub before raising or the middleware must use a generic message.
traceback string traceback.format_exc(), truncated to 8 KB.

Forbidden fields: raw SQL query text with bound parameters; raw JWT contents; raw cookie values; raw email contents; raw passwords / API keys (this last one applies even at DEBUG — never log a value sourced from Settings.smtp_pass etc.).

Implementation shape (informative, for Pax):

  • app/core/logging.py configures the root logger with a JSON formatter and a ContextFilter that injects request_id, stringer_id, person_id, app_env from ContextVars onto every record.
  • Add fields to records via logging.LoggerAdapter or via the filter — not by passing extra={} at every call site (too easy to forget).
  • One middleware (let's call it RequestContextMiddleware) is the only place that reads the inbound request_id header, mints UUIDs, sets the ContextVar, and emits the request-completion log line.

2. request_id propagation contract

  • Generation: RequestContextMiddleware runs first in the FastAPI middleware stack. It reads the inbound header X-Request-ID (case-insensitive). If present and the value matches ^[A-Za-z0-9._-]{1,64}$, it is used verbatim; otherwise a fresh uuid.uuid4() is generated.
  • Why a permissive validator instead of "must be UUID": Atlas's Caddy ingress may eventually inject any opaque token (Cloudflare-style hex, ULID, etc.). Validating "non-empty, no control chars, length-bounded" gives Atlas latitude without coupling RBO to a specific format. UUIDv4 is the default for RBO-minted IDs; inbound IDs are honored as-given.
  • Binding: the ID is set on request_context_var: ContextVar[str] and on Response.headers["X-Request-ID"] (so a client / Caddy can correlate the response).
  • Inbound header name: X-Request-ID (chosen). De-facto standard; what Caddy ships in its request_id placeholder under the same name. Atlas adopts; if Atlas's Caddy snippet does not yet inject this header, RBO's middleware mints one and the chain still works.
  • Outbound header name: X-Request-ID on the response, same value as bound on the ContextVar.
  • Lifetime: ContextVar is set at the start of the middleware and cleared by Python's normal context-var unwinding when the middleware returns (no manual reset() needed because FastAPI / Starlette runs the middleware chain in a fresh asyncio task).
  • Scheduled-task / boot logs: request_id is null (the JSON formatter emits null rather than omitting the key — consistent shape eases log queries).

3. Closing the share_audit.request_id loop

ADR-0004 declared share_audit.request_id UUID, "for correlation with structured logs." This ADR specifies the wiring:

  • The same request_context_var ContextVar is read by the share_audit writer (whether the shared_read row inserted at chokepoint admit-time, or the grant_created / grant_revoked row inserted by the share-management handler).
  • One helper (get_request_id() -> str | None) lives in app/core/logging.py (or a sibling app/core/request_context.py) — both the logger filter and the audit writer call it.
  • A regression test asserts that for a request with X-Request-ID: <known-uuid>, both (a) the structured log line and (b) the share_audit row carry the same request_id.

This is the closure ADR-0004 deferred. With it, an audit investigation can move from a share_audit row to its causal log lines in one query.

4. Logger context binding (no manual passing)

current_stringer_id and current_person_id are bound by the auth middleware (per ADR-0006 and auth-and-tenancy.md).

  • A logging.Filter reads them from the same ContextVars and copies them onto every LogRecord. The JSON formatter then emits them.
  • Handlers and modules call logger.info("string parsed", extra={"order_id": 42}) — they do NOT pass stringer_id=... manually. The filter is the single seam.
  • A regression test asserts that a log line emitted from inside an authenticated request contains stringer_id, and a log line emitted from boot does not (the field is null).

5. Metrics

No metrics surface in V2. No Prometheus, no StatsD, no /metrics endpoint.

Reasoning: the structured-log duration_ms field is the latency surface; counts come from jq | wc -l; Atlas runs no scraper today.

Re-evaluation trigger: if Atlas adds a Prometheus to keystone, OR if RBO grows past three concurrent stringers AND Stefan asks for a dashboard. Until then, this is documented "no" so future readers do not reintroduce it.

6. Tracing

Out of scope for V2. No OpenTelemetry, no Jaeger, no Honeycomb.

Reasoning: single-process FastAPI + a single Postgres dependency does not produce meaningful spans. The structured request_id correlates everything we need at this scale.

Re-evaluation trigger: if RBO grows a worker process (RQ/arq) AND a request fans out across the worker boundary. Until then, this is documented "no."

7. Error reporting

Plain structured logs in V2. A 5xx emits a single level=ERROR log line with the error fields above. Atlas's log aggregation surface (whatever it is at the time — docker logs today, possibly a Loki/OpenSearch later) is the operator surface.

No Sentry in V2. Sentry is a V3 candidate if multi-stringer noise produces real triage volume (Stefan can no longer eyeball errors in docker logs).

Boot-failure handling: the app must not silently exit on a missing required env var. pydantic-settings raises ValidationError at get_settings() time, which Pax surfaces as a structured log line on the import path before uvicorn starts; the container exits non-zero and the keystone-generated compose restart: unless-stopped policy will surface the failure to Atlas's monitoring. This is the platform contract working as designed.

8. Secrets / config — boundary table (load-bearing)

This is the contract Kit verifies against the env-scoped CI variables provisioned by keystone/scripts/10-app-onboard.sh.

Env var Owner (provider) Format RBO consumer Rotation Notes
APP_ENV Atlas (per-env CI variable) dev | test | prod Settings.app_env n/a Tags log records and is read by health endpoints.
LOG_LEVEL RBO (defaulted in Settings) DEBUG | INFO | … Settings.log_level Per-deploy RBO can override per-env (e.g. DEBUG on test); not an Atlas-provisioned var by default.
PORT Atlas (APP_PORT=3010 in config/apps/rbo.env per keystone onboarding) integer Settings.port n/a The Compose service binds the container to 127.0.0.1:$APP_PORT; Caddy fronts.
DATABASE_URL Atlas (10-app-onboard) postgresql+psycopg://<role>:<pwd>@keystone-pgbouncer:6432/rbo_<env> Settings.database_url Atlas-driven (rotate password → re-run onboarding script → restart RBO) Routes through PgBouncer; transaction-pool; per project_pgbouncer_constraint.md RBO must avoid session-scoped Postgres features.
DATABASE_URL_DIRECT Atlas (10-app-onboard, optional today) postgresql+psycopg://...:5432/rbo_<env> Settings.database_url_direct (Alembic only) Atlas-driven Bypasses PgBouncer for DDL. Falls back to DATABASE_URL in alembic/env.py if unset.
GOTRUE_URL Atlas (10-app-onboard) https://<app>[-<env>].wagen.io/auth (e.g. https://rbo-test.wagen.io/auth) Settings.gotrue_url Atlas-driven RBO uses this as the gotrue base URL for refresh / logout / admin-API calls. ADR-0006 originally also fetched ${GOTRUE_URL}/.well-known/jwks.json; that path is non-functional on gotrue v2.189.0 (404) and is gated behind Settings.jwt_verification_mode == "rs256" (aspirational, off by default).
GOTRUE_ISSUER Atlas (10-app-onboard, provisioned by !127) URL — https://<app>[-<env>].wagen.io/auth Settings.gotrue_issuer Atlas-driven Validated by ADR-0006 §"Wire-level contract". (Amended 2026-05-04: format diverged from the originally-proposed https://auth.wagen.io/<app> — that hostname does not exist on the platform; the chosen format reuses the existing per-tenant API_EXTERNAL_URL. See §"ISSUER format deviation" below.)
GOTRUE_AUDIENCE Atlas (10-app-onboard, provisioned by !127) string (<app>-<env>, e.g. rbo-prod, rbo-test) Settings.gotrue_audience Atlas-driven Validated by ADR-0006 §"Wire-level contract". The per-tenant gotrue config flips GOTRUE_JWT_AUD away from gotrue's hardcoded authenticated default so cross-tenant token replay is structurally prevented.
GOTRUE_JWT_SECRET Atlas (10-app-onboard) secret string Settings.gotrue_jwt_secret Atlas-driven Load-bearing for HS256 verify (current production path). (Amended 2026-05-04: kept as load-bearing — gotrue v2.189.0 is HS256-only and cannot do RS256, so this secret is the verify key, not legacy. Originally flagged for removal as a "mismatch with ADR-0006 (RS256+JWKS)"; that flag is rescinded.) Re-evaluated for deprecation when a future gotrue upgrade lands a working JWKS endpoint and Settings.jwt_verification_mode flips to "rs256". See §"JWT_SECRET retention rationale" below.
GOTRUE_ADMIN_API_KEY Atlas (10-app-onboard, provisioned by !127) secret string (new, when admin-API needed: invite, password reset proxy) Atlas-driven Used for the stringer-onboarding invite (RBO calls gotrue's invite endpoint per auth-and-tenancy.md §"First stringer sign-in"). Provisioned; consumed when M21 stringer auth lands.
SMTP_HOST Atlas (keystone email runbook / 10-app-onboard) host string (smtp.resend.com) Settings.smtp_host Atlas-driven Per integrations.md.
SMTP_PORT Atlas int (587) Settings.smtp_port n/a STARTTLS submission port.
SMTP_USER Atlas string (resend) Settings.smtp_user Atlas-driven Resend's literal username for SMTP AUTH.
SMTP_PASS Atlas secret string (re_<api-key>) Settings.smtp_pass Atlas-driven (rotate via Resend dashboard → onboarding script) Per keystone ADR-0005.
SMTP_FROM Atlas email (noreply@wagen.io) Settings.smtp_from Per-env RBO renders this as the From-header on outbound mail.
IMAGE Atlas (10-app-onboard) string (registry path) 11-app-deploy.sh n/a Consumed by deploy script, not by RBO.
APP_HOSTNAME Atlas hostname 11-app-deploy.sh n/a Consumed by Caddy/deploy, not by RBO.
DEPLOY_HOST Atlas hostname (kst1.wagen.io) CI deploy job n/a Consumed by .keystone_deploy brick, not by RBO.
DEPLOY_SSH_KEY Atlas (CI, file-typed, protected) PEM CI deploy job Atlas-driven Per ADR-0006 §App runtime contract — never reaches RBO container.

Boundary in one sentence: Atlas provisions every secret RBO consumes; RBO consumes only via Settings; rotation is Atlas's responsibility; RBO restarts on env-var change picked up by 11-app-deploy.sh's docker compose up -d. RBO owns no platform-grade secret of its own.

ISSUER format deviation (added 2026-05-04)

The originally-proposed GOTRUE_ISSUER format was https://auth.wagen.io/<app> (a hypothetical platform-wide auth domain). Atlas's keystone !127 instead provisioned https://<app>[-<env>].wagen.io/auth — the existing per-tenant API_EXTERNAL_URL on each gotrue container. Rationale:

  • auth.wagen.io does not exist on the keystone platform. Inventing it would have forced a wildcard cert + Caddy reverse-proxy slot for a single use case (issuer-claim string-match), buying nothing operationally.
  • The chosen format is what gotrue v2.189.0 naturally emits in the iss claim when API_EXTERNAL_URL is set to that value. Aligning RBO's GOTRUE_ISSUER with gotrue's own iss output means zero translation in the verify path.
  • Per-env separation is preserved because the env suffix lives in the host (rbo.wagen.io vs rbo-test.wagen.io), not in a path segment. A token minted for rbo-test carries a different iss than a rbo-prod token; cross-env replay is structurally rejected.

The path-suffix /auth matches the Caddy route on each app host that fronts gotrue. This is consistent with the gotrue-base-URL pattern already used by Settings.gotrue_url.

JWT_SECRET retention rationale (added 2026-05-04)

The original §"Keystone-side gaps" entry proposed dropping GOTRUE_JWT_SECRET once ADR-0006's RS256+JWKS path landed. That proposal is rescinded because Atlas's field-verification revealed:

  • Deployed gotrue is v2.189.0, HS256-only. The image does not expose /auth/.well-known/jwks.json (404). The RS256+JWKS verify path RBO would use is non-functional on this image.
  • GOTRUE_JWT_SECRET is therefore the verify key, not a legacy artefact. Dropping it would 401 every authenticated request.
  • Re-evaluation trigger: when keystone upgrades gotrue to a JWKS-capable version (no fixed date), Settings.jwt_verification_mode flips from "hs256" to "rs256" and the secret can be retired. Until then the secret is load-bearing and stays in the onboarding-script variable set, marked deprecated-in-place rather than removed.

This is the project memory project_gotrue_hs256.md constraint applied to the env-contract surface; it is co-amended with ADR-0006 §"Wire-level contract".

9. Local dev template

Two small changes:

(a) Add .env.example to the repo root with the field set above (NO real values; commented placeholders). The file is the documented starting point for cp .env.example .env.

(b) Patch .gitignore: the existing .env.* glob silently ignores .env.example. Insert !.env.example immediately after .env.* so the template tracks.

The template covers: APP_ENV, LOG_LEVEL, PORT, DATABASE_URL, DATABASE_URL_DIRECT, GOTRUE_URL, GOTRUE_ISSUER, GOTRUE_AUDIENCE, GOTRUE_JWT_SECRET, SMTP_HOST/PORT/USER/PASS/FROM. (Amended 2026-05-04: GOTRUE_JWT_SECRET is included in the template — it is the HS256 verify key, load-bearing under gotrue v2.189.0; previously listed as "going away" pre-Atlas-field-verification.) It does NOT include GOTRUE_ADMIN_API_KEY (V2 phase 2; safe to add when M21 lands) or any DEPLOY_* keys (CI-only).

Required tests (this ADR mandates them)

  1. JSON shape test. A request hitting any handler emits at least one log line whose JSON parses, contains the mandatory field set, and has request_id matching the request's X-Request-ID response header.
  2. request_id mint test. A request with no inbound X-Request-ID gets a fresh UUIDv4 in both the response header and the structured log.
  3. request_id passthrough test. A request with X-Request-ID: 11111111-1111-1111-1111-111111111111 carries the same value end-to-end.
  4. request_id validation test. A request with X-Request-ID: <2KB string> is rejected/replaced — the header is treated as malformed and a fresh UUIDv4 is minted; the response carries the minted value.
  5. stringer_id binding test. An authenticated request's log line includes stringer_id; an unauthenticated /healthz request's log line has stringer_id: null.
  6. share_audit.request_id correlation test. A request that triggers a share-grant write produces (a) a share_audit row and (b) a structured log line that share the same request_id value.
  7. PII non-leak test. A request that fails JWT validation (signature mismatch) emits an ERROR log; the log JSON does NOT contain the raw cookie value, the raw token, or any header longer than 64 bytes.
  8. Boot-failure visibility test. Starting the app with no DATABASE_URL and a request to /healthz/db produces a structured-format log line at ERROR; /healthz (liveness) still returns 200.

Tests 5 and 6 ride with Pax's Phase 1 chokepoint test suite; tests 1–4, 7, 8 ride with Pax's middleware MR.

Consequences

Good

  • One normative log shape. Pax has a single field-set spec; future routes inherit it for free via the filter; log queries can rely on request_id and stringer_id being present where expected.
  • Audit-to-log traversal closes. ADR-0004's share_audit.request_id is no longer a dangling UUID — it correlates to a structured log line in one jq query.
  • No platform infrastructure RBO doesn't already have. No Prometheus, no Sentry, no OTel collector. Atlas's stdout-capture is the only dependency.
  • Secrets boundary is one table. Kit's CI-variable verification has a single artifact to diff against.
  • Local dev template fixes a silent gap. .env.example tracks; .gitignore no longer eats it.
  • Future re-evaluation has explicit triggers. "Add Prometheus" / "Add Sentry" are documented "no, until X" — future readers do not silently regress the decision.

Costs we accept

  • ~~Two new env vars not yet in onboarding (GOTRUE_ISSUER, GOTRUE_AUDIENCE) and one ADR-0006-induced cleanup (GOTRUE_JWT_SECRET → JWKS) — flagged for Atlas in §"Keystone-side gaps" below.~~ (Amended 2026-05-04: closed by Atlas's keystone !127. GOTRUE_ISSUER and GOTRUE_AUDIENCE are now provisioned per tenant; GOTRUE_JWT_SECRET is retained as the load-bearing HS256 verify key, NOT cleaned up — the proposed RS256+JWKS migration was non-viable on gotrue v2.189.0.)
  • python-json-logger (or equivalent) is a new dependency. ~10 KB, zero transitive deps, BSD-licensed. Acceptable.
  • request_id filter is the only mandatory cross-cutting middleware. One more thing to import correctly in app_factory.py. Mitigated by tests 1–4 catching omissions.
  • No metrics means no SLO dashboard. At V2 scale, Stefan-as-operator is the dashboard. Documented; revisited at the multi-stringer trigger.
  • Truncation rules (traceback 8 KB, header 64 B) are arbitrary cutoffs. They are Stefan-tunable in Settings if the defaults bite; not worth a knob today.

Migration impact (for Pax + Kit)

  • Pax (#97 + middleware MR): Settings gains gotrue_issuer: str | None, gotrue_audience: str | None, optional log_format: Literal["json", "text"] (default "json" in prod/test, "text" in dev for readability). gotrue_jwt_secret is load-bearing (HS256 verify key) — see ADR-0006. Add request_id_header: str = "X-Request-ID" so the inbound header name is overridable for tests / Atlas coordination. (Amended 2026-05-04: gotrue_jwt_secret is no longer "deprecated until JWKS lands" — it is the active verify key under gotrue v2.189.0. Pax-B's HS256 rework lives in #106.)
  • Kit (CI): verify Atlas's keystone !127 provisions GOTRUE_ISSUER, GOTRUE_AUDIENCE, GOTRUE_JWT_SECRET, and GOTRUE_ADMIN_API_KEY for each app-env tuple. (Amended 2026-05-04: all four are now provisioned per !127; the original "remove or repurpose" guidance for GOTRUE_JWT_SECRET is rescinded.)
  • app/core/logging.py: swap the plain-text formatter for JSON; add the ContextFilter. This is the formatter swap app/core/logging.py explicitly anticipates ("Replacing it with a JSON formatter is a one-line swap").
  • No schema migration. share_audit.request_id already exists in ADR-0004's spec; Pax wires the writer to read the ContextVar.

Keystone-side gaps (flagged for Atlas; do NOT open keystone MRs from this RBO MR)

Per the project memory entry "Atlas — propagate contract fixes," gaps in keystone artifacts are flagged with proposed fixes but the keystone MR is Atlas's to open. Three gaps surfaced while writing this ADR.

All three closed 2026-05-04 by Atlas's keystone !127 (closing keystone#102). The original gap descriptions and resolution notes are kept below as historical record.

  1. ~~GOTRUE_ISSUER and GOTRUE_AUDIENCE are missing from keystone/scripts/10-app-onboard.sh's RBO variable set.~~ Per ADR-0006 they are required for JWT validation. Original proposed fix: extend the onboarding script to write both for each app-env tuple, defaulted to https://auth.wagen.io/<app> and <app>-<env> respectively. Closed by !127 — provisioned per app-env tuple. The ISSUER value diverged from the proposal (https://<app>[-<env>].wagen.io/auth instead of https://auth.wagen.io/<app>); see §"ISSUER format deviation" for rationale. AUDIENCE matches the original proposal.
  2. ~~GOTRUE_JWT_SECRET predates ADR-0006's RS256-via-JWKS decision.~~ Original proposed fix: drop from the onboarding script's variable set (or repurpose as a fallback HS256 key if Atlas's gotrue is configured for HS256 — confirm with Atlas first). Closed by !127 with the alternative path: kept as load-bearing. Atlas field-verified that gotrue v2.189.0 is HS256-only and exposes no JWKS endpoint, so the secret is the verify key — not a legacy artefact. See §"JWT_SECRET retention rationale" for the full reasoning. Re-evaluation trigger: future gotrue upgrade landing JWKS support.
  3. ~~GOTRUE_ADMIN_API_KEY is not provisioned.~~ Original proposed fix: when M21 (#74) ships, add this as an env-scoped, masked, protected CI variable via 10-app-onboard.sh. Closed by !127 ahead of M21 — provisioned now; consumed when M21 lands.

These three flags landed as a comment on keystone#102 and were resolved by Atlas in the same MR (!127). #99 (this ADR's closing issue) carries the back-link.

Open questions (Stefan-confirm)

  1. Inbound request_id header name — defaulted to X-Request-ID. Caddy emits this name natively. If Atlas's Caddy snippet uses a different header (e.g. X-Correlation-ID), one-line override via Settings.request_id_header. Default kept.
  2. JSON-formatter librarypython-json-logger is the recommended pick (smallest, BSD, mature). structlog is a richer alternative but introduces a different logger API surface; not worth it at our complexity. Default: python-json-logger.
  3. Dev-environment log format — defaulted to plain text in dev, JSON in test/prod. Dev-text reads better in docker logs; prod-JSON parses with jq. Stefan to confirm or flip to "JSON everywhere" (slightly more uniform, slightly less readable locally).
  4. Truncation defaults — traceback 8 KB, error_msg uncapped (relies on the handler not raising 1MB strings), header values 64 B. Stefan to flip if a real example bites.
  5. Logger name patternapp.api.routes_orders (Python module path) vs. rbo.orders (a curated namespace). Default: Python module path (free, automatic). Stefan to flip if a curated namespace would help log-routing.

All five default to the values above; each is a single-line change.

Change log

Date Change Reason
2026-05-04 Initial Proposed version (env-contract table flagging three keystone-side gaps; GOTRUE_ISSUER proposed as https://auth.wagen.io/<app>; GOTRUE_JWT_SECRET proposed for removal once ADR-0006's JWKS path landed).
2026-05-04 Amendment — env-contract table updated to reflect Atlas's actual !127 provisioning. Updated: top-of-file amendment block, env-contract table rows for GOTRUE_URL/GOTRUE_ISSUER/GOTRUE_AUDIENCE/GOTRUE_JWT_SECRET/GOTRUE_ADMIN_API_KEY, two new sub-decision blocks (§"ISSUER format deviation", §"JWT_SECRET retention rationale"), §9 local-dev template (GOTRUE_JWT_SECRET added), §"Costs we accept" (gap row reframed as closed), §"Migration impact" (Pax + Kit guidance updated), §"Keystone-side gaps" (all three closed by !127 with historical record). Logging format, request_id propagation, metrics/tracing/error-reporting, local-dev template structure are unchanged. Atlas's keystone !127 (closing keystone#102) field-verified gotrue v2.189.0 as HS256-only and provisioned per-tenant GOTRUE_ISSUER, GOTRUE_AUDIENCE, GOTRUE_ADMIN_API_KEY while keeping GOTRUE_JWT_SECRET load-bearing. ADR-0006 co-amended. Per project memory project_gotrue_hs256.md. Tracked in #108.

Cross-references

  • ADR-0001 — names the platform boundary; this ADR fills in observability + secrets.
  • ADR-0004share_audit.request_id correlation closed here.
  • ADR-0006 — the JWT contract whose current_stringer_id / current_person_id ContextVars feed this ADR's logger filter; co-amended 2026-05-04 (HS256 verify path).
  • auth-and-tenancy.md — chokepoint binds the ContextVars this ADR reads.
  • integrations.md — SMTP_* env vars referenced in §8.
  • system-overview.md — "JSON-to-stdout per Atlas's convention" claim that this ADR turns into a normative spec.
  • process/deploy-verification-2026-05-04.md §"CI variable surface" — current variable set this ADR contracts against.
  • project_pgbouncer_constraint.md — DB connectivity rule referenced in §8.
  • keystone#102 / !127 — Atlas's per-tenant gotrue env provisioning (closes the three §"Keystone-side gaps" entries).
  • 106 — Pax-B's HS256 code rework (companion MR to this amendment).

  • 108 — this amendment.

  • Pax's middleware MR (to follow) — implements the spec.
  • Kit's CI-verification follow-up — confirms the secrets boundary table matches 10-app-onboard.sh.