tga-ingest — On-demand RTO enrichment¶
Version: 1.0
Updated: 2026-04-13 (session 52)
Worker: scripts/workers/tga-ingest
Companion docs: worker-patterns.md, briefs/backfill-01.md
What it is¶
tga-ingest is a queue consumer, not a sync worker. It has no cron and no autonomous schedule. It wakes up only when a message lands on rtopacks-ingest-queue, and the producer is the public site (apps/site) — specifically the /api/enrich and /api/search-enrich endpoints.
Flow:
User visits rtopacks.com.au and searches or views an RTO
→ apps/site /api/search-enrich (or /api/enrich for secret-gated)
→ fetches TGA JSON for that rto_code
→ queues { rto_code, tga_json } to rtopacks-ingest-queue
→ tga-ingest consumes the message
→ writes rtos + 7 child tables (contacts, addresses, trading_names,
web_addresses, legal_names, registrations, classifications)
→ ACK or retry (max 3 → rtopacks-ingest-dlq)
It is not a bulk sync. It will never run unless a message is produced.
Coverage as of 2026-04-13¶
72 of 12,515 RTOs have been enriched — 0.58% coverage. Every RTO touched was the result of a public site visitor looking it up between 2026-03-01 and 2026-03-22.
This is not a bug. The on-demand design is correct for the original goal (lazy enrichment on first lookup, avoiding a 12k-row prewarm). It just means we need one bulk backfill to populate the rest of the corpus once, then the on-demand path keeps things fresh from organic traffic going forward.
See docs/docs/briefs/backfill-01.md for the one-shot backfill script.
Why it's not a sync worker¶
A common confusion: isn't this what tga-sync does?
No. tga-sync writes the core TGA corpus — tga_organisations, tga_training_components, tga_qualification_units, and supersession/status metadata. It operates against the TGA search/list APIs and walks the whole corpus weekly.
tga-ingest writes the enrichment tables — rtos and 7 rto_* child tables — that hold rich per-RTO detail (contacts, addresses, trading names, ABNs, web addresses, legal names, registration history, classifications). The TGA API for this detail is the /api/organisation/{rto_code}?include=all endpoint, which returns a much richer payload per RTO than the search API exposes.
They write to different tables from different TGA endpoints on different schedules. Don't conflate them.
The rto_registrations metric trap¶
rto_registrations.captured_at MAX was for months interpreted as "TGA sync freshness". It isn't. That column is populated by tga-ingest, not tga-sync. So its max date just reflects "when was the last public site visitor enrichment", not "when did the TGA weekly corpus last run".
Correct freshness metrics:
| Metric | Table | Written by |
|---|---|---|
| TGA corpus freshness | tga_organisations.synced_at |
tga-sync (weekly) |
| TGA training components freshness | tga_training_components.synced_at |
tga-sync for new, never for existing (fix: briefs/tga-refresh-01.md) |
| RTO enrichment freshness | rtos.enriched_at + rto_registrations.captured_at |
tga-ingest (on-demand) |
Any dashboard or health check that uses rto_registrations.captured_at as a tga-sync signal is wrong. Use tga_organisations.synced_at instead.
Queue + DLQ¶
- Producer:
apps/site— callsenv.RTOPACKS_INGEST_QUEUE.send()from its route handlers - Consumer:
tga-ingest(this worker), batch size default, retries 3 - DLQ:
rtopacks-ingest-dlq— dead-lettered messages land here after 3 failed retries
DLQ inspection:
Common DLQ causes:
- TGA JSON shape edge cases the upsert logic doesn't handle (rare)
- D1 constraint violations on the upsert (shouldn't happen — upserts use INSERT OR REPLACE where needed)
- rto_code in the message body doesn't match any row in rtos (shouldn't happen — producer creates the rtos row if missing)
A non-zero DLQ count after a BACKFILL-01 run is worth investigating. A non-zero count during normal operation is usually fine — a handful of edge cases accumulate slowly.
Secret gating — /api/enrich vs /api/search-enrich¶
Two producer endpoints on apps/site:
| Endpoint | Auth | Rate limit | Use case |
|---|---|---|---|
/api/search-enrich |
Public | 10/min per IP | Normal site traffic (search results, RTO pages) |
/api/enrich |
X-Enrich-Key header (ENRICH_SECRET) |
None | Bulk backfill, admin tools, trusted scripts |
The secret lives in apps/site's Wrangler secret store (wrangler secret list shows it exists; can't be read back via CLI). Rotate it with wrangler secret put ENRICH_SECRET and update any script that uses it.
BACKFILL-01 uses /api/enrich to bypass the rate limit — 12k RTOs at 10/min would take 21 hours; at 67ms throttle via the secret endpoint it's ~14 minutes.
Schema notes¶
rtos row carries an enriched flag (0/1) and enriched_at timestamp. The upsert sets both. BACKFILL-01's initial SELECT rto_code FROM rtos WHERE enriched = 0 OR enriched IS NULL is the driver query.
Child tables are keyed by rto_code with 1:N rows per RTO. Upserts delete-then-insert per RTO to handle the N-row pattern cleanly — the tga-ingest consumer wraps the delete + insert batch in a single D1 transaction per RTO.
Related docs¶
docs/docs/infrastructure/worker-patterns.md— queue pattern overview (tga-ingest uses the same queue infrastructure but NOT chain dispatch — it's a one-shot consumer)docs/docs/briefs/backfill-01.md— one-shot script to populate the 99.4% gapdocs/docs/workers/inventory.md— queue + worker inventorydocs/docs/ops/standing-rules.md— enrichment coverage standing note