Catalogue imports (Phase 1 of #164)¶
Status: Phase 1 shipped (racket-book#166, 2026-05-21). Architecture proposal: racket-book#164 (Theo, Accepted 2026-05-21). Parent epic: racket-book#146 (V3 catalogue auto-update). Stefan's 7-question answers (2026-05-21): runner in racket-book CI; no brand priority; same-repo code; hybrid manual-edit-on-imported semantics; skip tension; capture image URL; weekly cadence.
What Phase 1 ships¶
A staging surface for external-feed catalogue rows + an admin moderation queue, mirroring the V1-upload stage->dry-run->approve UX pattern.
Components¶
| Surface | Module | Notes |
|---|---|---|
| Table | catalogue_imports (Alembic 0014_catalogue_imports) |
One row per (source_name, source_external_id, content_hash). Dedup UNIQUE drives importer idempotency. Partial pending-queue index for the admin list. |
| Source enums | RacketSource / StringSource (shared catalogue_source Postgres ENUM) |
Phase 1 adds manual_edited_from_import. The flip-on-stringer-edit automation lands in Phase 2; the enum value lands now so Phase 2 is a no-migration code change. |
| Audit enum | AdminAuditAction + AdminAuditTargetType (TEXT cols, app-layer enums) |
Three actions: catalogue.import.promote / .reject / .match. One target type: catalogue_import. |
| Importer | app/catalogue_sources/kaggle_seed.py |
Fetches the LukeBatten17/tennis-racquets-and-strings-dataset CC0 CSVs from GitHub (Path A). ~1300 rackets + ~780 strings; brand + model + (string-type) columns only -- no tension fields per Stefan Q5. |
| CLI | scripts/seed_catalogue.py |
python -m scripts.seed_catalogue {rackets,strings,all} [--dry-run]. Runs the importer + writes via the sync engine pattern. |
| Service | app/services/catalogue_imports.py |
stage_rows (sync, batch) + get_import / list_pending / count_pending / promote / reject / match / dry_run_promote (async; route surface). |
| Routes | app/api/routes_admin_catalogue_imports_pages.py |
Six handlers under /admin/catalogue/imports/.... Admin-only via require_admin. Each decision emits one admin_audit_log row in the same transaction as the catalogue write (ADR-0011 composite case). |
| Templates | app/web/templates/admin/catalogue_imports_{queue,detail}.html + the promote-confirm modal partials |
Dense data-testid attributes per the project convention (10+/page). Mirrors the M17 + V1-upload visual posture. |
| Dashboard | app/web/templates/dashboard.html (admin chip) |
Surfaces the pending-imports count when > 0; amber accent + parcel icon. Distinct from the M17 catalogue-submissions chip. |
| CI job | .gitlab-ci.yml catalogue:sync |
Schedule-only ($CI_PIPELINE_SOURCE == "schedule"). Phase 1 runs seed_catalogue all --dry-run -- fetches the upstream CSV and validates parse-ability, no DB write. |
Lifecycle¶
+--------+ importer +---------+ admin +-----------+
| Feed | -------------> | pending | ------------> | promoted |
| (CSV | ON CONFLICT | | promote() | (new Rkt) |
| / API)| DO NOTHING +---------+ +-----------+
+--------+ |
| admin reject()
| +-----------+
|---->| rejected |
| +-----------+
|
| admin match()
| +-----------+
+---->| matched | (links to existing
+-----------+ Racket/String;
source unchanged)
Phase 2+: re-run of the same source_external_id with a DIFFERENT
content_hash -> new pending row; older row -> superseded.
Identity / dedup¶
Phase 1 keys imports on (source_name, source_external_id, content_hash).
The Kaggle importer synthesises source_external_id from brand:model
(slugged). Phase 2 brand-importers will use real manufacturer SKUs.
Slug + alias-table fuzzy match is deferred to Phase 2 (Theo's #164 risk note: dedup-mistake-on-fuzzy is the architecturally load-bearing risk; Phase 1 sidesteps it by being seed-only + admin-click-required for every promote).
Image handling¶
Per Stefan's Q6=yes: image_url lands on the import row. Templates
hot-link from the manufacturer URL (no caching, no proxying). Phase 1's
Kaggle dataset has no image URLs so the column is NULL on every Phase 1
row. Phase 2 brand-importers populate it from sitemap+JSON-LD.
Tension recommendations¶
Per Stefan's Q5=skip: NOT imported. The importer normaliser would refuse
to populate a tension field even if the upstream had one; the assertion
is enforced both in kaggle_seed.fetch_* (no tension key in the
returned dict) and in the integration test
test_admin_catalogue_imports.test_fetch_rackets_normalises_correctly
(asserts "tension" not in payload).
Hybrid manual-edit semantics (Stefan Q4)¶
Per Stefan's Q4=hybrid: manual_edited_from_import source value lands in
Phase 1. The automatic flip from imported -> manual_edited_from_import
on a stringer edit is a Phase 2 refinement (per-field edit detection is
its own design; the enum value lands now so Phase 2 is purely additive).
Cadence (Stefan Q7)¶
Weekly. Stefan wires the schedule via the GitLab Pipeline Schedules UI:
Build -> Pipeline schedules -> New schedule, cron 0 3 * * 0 (Sundays
03:00 UTC), branch main. The Phase 1 catalogue:sync job is a stub
that re-validates the upstream CSV parses cleanly; Phase 2 swaps it for
the real per-brand importers.
What Phase 1 does NOT ship¶
- Real per-brand sitemap+JSON-LD importers (Wilson, Babolat, Head, ...). Those are Phases 2-4 of #164. Phase 1's Kaggle seed is the one-shot day-zero population.
- Auto-promotion for unchanged-hash refreshes (Phase 3+).
- Fuzzy match + slug-based identity (Phase 2; gated by dedup-risk mitigation -- always-queued, never auto-promote on fuzzy).
- Automatic source flip from
importedtomanual_edited_from_importon stringer edit (Phase 2). - Caching / proxying of image URLs (Phase 5, only if hot-linking breaks).
See also¶
- Issue #164 -- the accepted architectural proposal (Theo).
- Issue #146 -- the V3 catalogue auto-update epic.
- Issue #166 -- this Phase 1 implementation.
- ADR-0011 § "Bypass-scope
declaration per admin endpoint" -- the three new
catalogue.import.*actions. app/services/v1_upload.py-- the sync-engine + stage/dry-run/approve pattern this MR mirrors.app/api/routes_admin_catalogue_pages.py-- the M17 catalogue moderation queue (stringer-initiated submissions; conceptually adjacent but distinct).