V1 → V2 Upload Spec¶

How Stefan migrates his V1 spreadsheet into RBO V2 at cutover. This document is the operator-facing companion to the M15 ETL (migration/etl.py) and reconciliation report (migration/report.py) shipped in MR !40 (#96) and MR !54 (#112).

Audience: Stefan (the operator who runs the upload). Engineers maintaining the parser / ETL look here for the contract; the ETL implementation lives in migration/.

The repo XLSX is the format spec, not production data¶

stringing_orders.xlsx at the repository root is the format specification, not the production data source. Stefan continues using V1 personally until V2 cutover, so:

The committed XLSX is a snapshot frozen at some past date — it is stale by design.
Tests use a small fixture XLSX built in-memory by tests/migration/conftest.py::miniature_xlsx. CI never imports the repo XLSX.
The production dataset is whatever XLSX file Stefan exports from his local V1 instance at upload time. It always has more rows than the committed copy, never fewer.

If a future engineer needs to refresh the spec snapshot for documentation purposes, that is fine — but understand that it does not affect the migration in any way.

The cutover flow¶

Each upload is the same operation; the difference between test and prod is which DATABASE_URL_DIRECT it points at.

Test upload (dress rehearsal).
```
DATABASE_URL_DIRECT=postgres://…test… \
  python -m migration.etl \
    --xlsx /path/to/stefans_current_xlsx.xlsx \
    --reconcile \
    --report-output /tmp/recon-test.md \
    --dry-run
```
--dry-run wraps the ETL in a transaction that is rolled back at exit — no rows are committed, but the parser runs end-to-end, the ETL exercises the full write path against the test schema, and the reconciliation report is produced.
Stefan reviews the reconciliation report.
The Summary line at the top says OK or FAIL.
When non-clean, a Differences section names exactly which counts and per-year revenue numbers diverge.
The Mean turnaround section lets Stefan sanity-check that the parsed Ordered → Strung distribution matches his memory of how V1 has looked over the years.
The V1 data-quality issues table lists every parser-side warning (date-causal violations, orphan racket-IDs in self-orders, out-of-range tensions) with Sheet:Row references.
Fix V1 issues (if any) and re-upload. The XLSX is Stefan's editable artefact; if a date-causal violation or a typo is worth fixing, fix it in V1 and re-run. The ETL is idempotent (see § Idempotency below) so a re-upload converges cleanly.
Test upload without --dry-run once Stefan is satisfied. This commits to the test database. He can now click around the V2 app pointed at the test DB and verify visually.
Production upload at switch-over. Same command, prod connection string, same XLSX (or a slightly newer one if he kept stringing in V1 between the dress rehearsal and the actual switch-over).
Final re-run at the cutover window if needed. If Stefan stringed any rackets between the prod upload and the actual switch-over moment, he uploads his then-latest XLSX again. The ETL skips every row already imported (V1MigrationLog hit) and writes only the new ones.

Idempotency¶

The ETL is idempotent by construction. Each row inserted is logged in v1_migration_log (v1_kind, v1_ref, v2_id) inside the same transaction as the row write. On a re-run:

Looking up (kind, ref) returns the previously assigned V2 PK → skip insert.
A re-upload produces an ETLStats summary with all-zero insert counters and a non-zero skipped_idempotent.
The receipt-number assignment (_assign_v1_receipt_numbers) only updates Orders whose receipt_number IS NULL, so already-numbered rows are left alone.
The receipt-counter seed (_seed_receipt_counters) uses INSERT ... ON CONFLICT DO UPDATE SET last_n = GREATEST(...) so the live mint never decreases.

The provenance table is the durable trail. Given any V2 row, joining <table>.id = v1_migration_log.v2_id reveals the V1 origin (kind + ref). After Stefan signs off the migration he can DROP the table — it is V1-only and the V2 schema does not depend on it.

XLSX format spec¶

The V1 spreadsheet has 4 sheets. The parser (migration/parse_v1.py) hard-codes column indices and asserts each header on parse — see § Safety net below. If Stefan renames a column or reorders cells in V1 ahead of cutover, the parser fails loudly rather than silently shifting downstream by one column.

Sheet `Client-Stringing-Orders`¶

35 columns, two-row header (group label on row 1, field name on row 2). The parser reads the first 27 columns (cols 28-35 are V1-derived statuses and year/month aggregates that the ETL recomputes from the raw data).

Group	Field	Purpose
ID	`ID`	V1 sequential, 1..N
Dates	`Ordered`, `Strung`, `Returned`, `Paid`	Lifecycle dates; all four optional in V1
Player	`Last Name`, `First Name`	The only client identifiers V1 has
Racquet	`Model`, `Head Size`, `String Pattern`	Free text per order; not normalized
Main-String	`Manufacturer`, `Model`, `Gauge`, `Tension`, `Price`, `BYO`	"Bring your own" Yes/No on price
Cross-String	(same six fields)
Price	`Labor`, `Strings`, `Total`	CHF; BYO sides excluded from `Strings`
DT	`Dynamic Tension`	Mostly empty in current data
Comments	free text

The parser treats a row as "real" iff Last Name is non-null. V1 pre-fills the ID column into the future (rows beyond the last data row carry an ID but blank everything else); the lightest discriminator that survives all the partial-state lifecycle cases is the player's last name.

Sheet `Self-Stringing-Orders`¶

20 columns, two-row header. Single date (Strung); no order/return/paid lifecycle. Each row references a stable racket ID (e.g. PA98_2023_25) that VLOOKUPs into Lists.My Racquets.

Group	Field	Purpose
Date	`Strung`	The only date V1 records for self-jobs
Racket	`ID`, `Model`, `String Pattern`	`ID` references `Lists.My Racquets`
Main	`Manufacturer`, `Model`, `Gauge`, `Color`, `Tension`	Self-side has `Color` (client-side does not)
Cross	(same five fields)
(single col)	`Method`	"Standard" or "Sergetti" in current data
Status	`Strung`, `Year`, `Month`	V1-derived; ETL recomputes
Dynamic Tension	`After Stringing`	mostly empty in current data
Comments	free text

If a self-order references a racket ID that is not in Lists.My Racquets (retired-pre-tracking case), the ETL synthesizes a Racket row owned by Stefan's self-ClientProfile with visibility=catalogue_private and the V1 racket ID stored on instance_id. Each synthesis emits a parser-side warning that surfaces in the reconciliation report.

Sheet `Lists`¶

Two lookup tables stacked in column A:

BYO: Yes, No.
My Racquets: 7 columns — ID, Racket, String Pattern, Head Size, Year, Current (yes/no, in rotation), # Stringjobs (V1 COUNTIF over Self-Stringing-Orders).

The # Stringjobs value is informational; the ETL recomputes it from the migrated Order rows.

Sheet `Stats`¶

V1's derived dashboard. The parser does not read it — it is V1-side scaffolding. The reconciliation report compares the ETL's per-year revenue against the parser's own re-aggregation of the raw client-order rows, not against the Stats sheet's pre-computed totals. (Why: see § The 328-vs-329 discrepancy below.)

Safety net: hard-fail on column rename¶

The parser refuses to run if any V1 header drifts. From migration/parse_v1.py::_assert_header:

_assert_header(ws, 2, _CLIENT_COLS["id"], "ID", "Client-Stringing-Orders")
_assert_header(ws, 2, _CLIENT_COLS["ordered"], "Ordered", "Client-Stringing-Orders")
# ... and so on for every column the parser reads

If Stefan renames Strung to Strung Date in V1 (or any other column), the parser raises a ValueError naming the sheet, the row, the expected header, and the value it found. The ETL never starts; no V2 rows are written from a half-mapped XLSX.

This is the only parser failure mode that aborts the run. Soft issues (date-causal violations, out-of-range tensions, unknown racket IDs in self-orders) are warnings: the row is kept and the warning is collected into the reconciliation report.

Reconciliation report¶

migration/report.py::render_markdown produces the markdown report Stefan signs off on. Sections, in render order:

Summary — V1 row counts, the revenue tolerance the report was scored against, and an OK / FAIL status line.
Differences — only when non-clean. Lists the specific count + per-year revenue mismatches that caused the failure. Sub-tolerance revenue diffs are NOT echoed here (they still appear in the per-year table for the audit trail) — the intent is "tell Stefan what is wrong, nothing else."
Counts — V1 vs V2 row counts per category (client orders, self orders).
Per-year revenue (CHF, client orders only) — V1 sum vs V2 sum vs delta, per year. Self-orders have no price column so they are excluded.
Mean turnaround (client orders, Ordered → Strung) — sample size, mean, median, min, max in days. Self-orders are excluded (no Ordered date). Negative turnarounds (Strung-before-Ordered, V1 date-causal violations) are kept in the sample so the aggregate is honest.
V2 entity totals — per-table row counts (orders, client_profiles, persons, rackets).
Per-client order counts (top 20) — display name → count, descending. Self-orders bucketed under the literal (self) key.
V1 data-quality issues — every parser-side warning with Sheet:Row references and the message text.

Stefan's at-a-glance signal is the Summary status line plus the Differences section (when present). The detailed tables are the audit trail.

Upload checklist¶

Before signing off either the test or the prod upload:

[ ] Status line is OK. When FAIL, the Differences section names the specific gap. Investigate before signing off.
[ ] Counts match. Client orders V1 vs V2 — the delta should be 0.
[ ] Revenue per year matches within tolerance. Default tolerance is 0.01 CHF (one cent); the Summary echoes the value used.
[ ] Warning count is the expected V1 baseline + new issues. The 18 known V1 client-order date-causal violations are warnings (relaxed at the ADR-0007 boundary; see migration/etl.py::_suppress_order_hooks). Anything new should be triaged.
[ ] Per-client top-20 looks right. Stefan knows his clients; a rename or de-dupe issue is fastest spotted here.
[ ] Mean turnaround is in the expected range. A wildly off mean signals a date-parsing regression.

The 328-vs-329 discrepancy (expected, not a bug)¶

V1's Stats sheet shows "Client orders: 328" but the parser counts 329. This is a known V1-side artefact — the V1 Stats SUMIF formula excludes one row for reasons buried in the V1 spreadsheet's own logic (likely a row whose Status flag is unset in a way the SUMIF treats as "not a real order"). It is not a parser bug.

The reconciliation report uses the parser's count (329), not the V1 Stats sheet, as the source of truth for the V1 side. The expectation is therefore:

Parser count: 329.
V2 count after ETL: 329.
Reconciliation delta: 0.

If on Stefan's real upload the parser count differs from V1 Stats by one (or a small number) on Client orders, that is the same kind of V1-side counting quirk and is not by itself cause for alarm. What matters is that the parser count and the V2 count agree.

Why no production XLSX in CI¶

The committed XLSX is stale-by-design (see § The repo XLSX is the format spec…). Importing it in CI would either:

Wedge CI to a frozen snapshot that diverges further from production every week, OR
Force Stefan to re-commit a fresh XLSX before every cutover dress-rehearsal.

Neither is desirable. The CI smoke job (test:migrate-smoke) covers the schema and the Alembic chain; the import is operator-driven at cutover and validated by Stefan from the reconciliation report.

migration/parse_v1.py — V1 XLSX parser (MR !40, #96).
migration/etl.py — V1 → V2 ETL driver, idempotent, with --reconcile + --dry-run (MR !54, #112).
migration/report.py — markdown reconciliation report renderer.
app/db/models/migration.py::V1MigrationLog — provenance / idempotency table.
docs/architecture/data-model.md § Migration shape — entity-level migration spec.
docs/requirements/v1-baseline.md — V1 system as the V2 migration target.
ADR-0007 § Order lifecycle state machine — date-causal invariant, relaxed for V1-migrated rows.
ADR-0008 § Migration from V1 — receipt-numbering migration path P-2.