Skip to content

Catalogue imports (Phase 1 of #164)

Status: Phase 1 shipped (racket-book#166, 2026-05-21). Architecture proposal: racket-book#164 (Theo, Accepted 2026-05-21). Parent epic: racket-book#146 (V3 catalogue auto-update). Stefan's 7-question answers (2026-05-21): runner in racket-book CI; no brand priority; same-repo code; hybrid manual-edit-on-imported semantics; skip tension; capture image URL; weekly cadence.

What Phase 1 ships

A staging surface for external-feed catalogue rows + an admin moderation queue, mirroring the V1-upload stage->dry-run->approve UX pattern.

Components

Surface Module Notes
Table catalogue_imports (Alembic 0014_catalogue_imports) One row per (source_name, source_external_id, content_hash). Dedup UNIQUE drives importer idempotency. Partial pending-queue index for the admin list.
Source enums RacketSource / StringSource (shared catalogue_source Postgres ENUM) Phase 1 adds manual_edited_from_import. The flip-on-stringer-edit automation lands in Phase 2; the enum value lands now so Phase 2 is a no-migration code change.
Audit enum AdminAuditAction + AdminAuditTargetType (TEXT cols, app-layer enums) Three actions: catalogue.import.promote / .reject / .match. One target type: catalogue_import.
Importer app/catalogue_sources/kaggle_seed.py Fetches the LukeBatten17/tennis-racquets-and-strings-dataset CC0 CSVs from GitHub (Path A). ~1300 rackets + ~780 strings; brand + model + (string-type) columns only -- no tension fields per Stefan Q5.
CLI scripts/seed_catalogue.py python -m scripts.seed_catalogue {rackets,strings,all} [--dry-run]. Runs the importer + writes via the sync engine pattern.
Service app/services/catalogue_imports.py stage_rows (sync, batch) + get_import / list_pending / count_pending / promote / reject / match / dry_run_promote (async; route surface).
Routes app/api/routes_admin_catalogue_imports_pages.py Six handlers under /admin/catalogue/imports/.... Admin-only via require_admin. Each decision emits one admin_audit_log row in the same transaction as the catalogue write (ADR-0011 composite case).
Templates app/web/templates/admin/catalogue_imports_{queue,detail}.html + the promote-confirm modal partials Dense data-testid attributes per the project convention (10+/page). Mirrors the M17 + V1-upload visual posture.
Dashboard app/web/templates/dashboard.html (admin chip) Surfaces the pending-imports count when > 0; amber accent + parcel icon. Distinct from the M17 catalogue-submissions chip.
CI job .gitlab-ci.yml catalogue:sync Schedule-only ($CI_PIPELINE_SOURCE == "schedule"). Phase 1 runs seed_catalogue all --dry-run -- fetches the upstream CSV and validates parse-ability, no DB write.

Lifecycle

+--------+    importer    +---------+    admin       +-----------+
| Feed   | -------------> | pending | ------------>  | promoted  |
| (CSV   |  ON CONFLICT   |         |  promote()     | (new Rkt) |
|  / API)|  DO NOTHING    +---------+                +-----------+
+--------+                     |
                               | admin reject()
                               |     +-----------+
                               |---->| rejected  |
                               |     +-----------+
                               |
                               | admin match()
                               |     +-----------+
                               +---->| matched   | (links to existing
                                     +-----------+  Racket/String;
                                                    source unchanged)

Phase 2+: re-run of the same source_external_id with a DIFFERENT
content_hash -> new pending row; older row -> superseded.

Identity / dedup

Phase 1 keys imports on (source_name, source_external_id, content_hash). The Kaggle importer synthesises source_external_id from brand:model (slugged). Phase 2 brand-importers will use real manufacturer SKUs.

Slug + alias-table fuzzy match is deferred to Phase 2 (Theo's #164 risk note: dedup-mistake-on-fuzzy is the architecturally load-bearing risk; Phase 1 sidesteps it by being seed-only + admin-click-required for every promote).

Image handling

Per Stefan's Q6=yes: image_url lands on the import row. Templates hot-link from the manufacturer URL (no caching, no proxying). Phase 1's Kaggle dataset has no image URLs so the column is NULL on every Phase 1 row. Phase 2 brand-importers populate it from sitemap+JSON-LD.

Tension recommendations

Per Stefan's Q5=skip: NOT imported. The importer normaliser would refuse to populate a tension field even if the upstream had one; the assertion is enforced both in kaggle_seed.fetch_* (no tension key in the returned dict) and in the integration test test_admin_catalogue_imports.test_fetch_rackets_normalises_correctly (asserts "tension" not in payload).

Hybrid manual-edit semantics (Stefan Q4)

Per Stefan's Q4=hybrid: manual_edited_from_import source value lands in Phase 1. The automatic flip from imported -> manual_edited_from_import on a stringer edit is a Phase 2 refinement (per-field edit detection is its own design; the enum value lands now so Phase 2 is purely additive).

Cadence (Stefan Q7)

Weekly. Stefan wires the schedule via the GitLab Pipeline Schedules UI: Build -> Pipeline schedules -> New schedule, cron 0 3 * * 0 (Sundays 03:00 UTC), branch main. The Phase 1 catalogue:sync job is a stub that re-validates the upstream CSV parses cleanly; Phase 2 swaps it for the real per-brand importers.

What Phase 1 does NOT ship

  • Real per-brand sitemap+JSON-LD importers (Wilson, Babolat, Head, ...). Those are Phases 2-4 of #164. Phase 1's Kaggle seed is the one-shot day-zero population.
  • Auto-promotion for unchanged-hash refreshes (Phase 3+).
  • Fuzzy match + slug-based identity (Phase 2; gated by dedup-risk mitigation -- always-queued, never auto-promote on fuzzy).
  • Automatic source flip from imported to manual_edited_from_import on stringer edit (Phase 2).
  • Caching / proxying of image URLs (Phase 5, only if hot-linking breaks).

See also

  • Issue #164 -- the accepted architectural proposal (Theo).
  • Issue #146 -- the V3 catalogue auto-update epic.
  • Issue #166 -- this Phase 1 implementation.
  • ADR-0011 § "Bypass-scope declaration per admin endpoint" -- the three new catalogue.import.* actions.
  • app/services/v1_upload.py -- the sync-engine + stage/dry-run/approve pattern this MR mirrors.
  • app/api/routes_admin_catalogue_pages.py -- the M17 catalogue moderation queue (stringer-initiated submissions; conceptually adjacent but distinct).