Skip to content

tga-ingest — On-demand RTO enrichment

Version: 1.0 Updated: 2026-04-13 (session 52) Worker: scripts/workers/tga-ingest Companion docs: worker-patterns.md, briefs/backfill-01.md


What it is

tga-ingest is a queue consumer, not a sync worker. It has no cron and no autonomous schedule. It wakes up only when a message lands on rtopacks-ingest-queue, and the producer is the public site (apps/site) — specifically the /api/enrich and /api/search-enrich endpoints.

Flow:

User visits rtopacks.com.au and searches or views an RTO
  → apps/site /api/search-enrich (or /api/enrich for secret-gated)
  → fetches TGA JSON for that rto_code
  → queues { rto_code, tga_json } to rtopacks-ingest-queue
  → tga-ingest consumes the message
  → writes rtos + 7 child tables (contacts, addresses, trading_names,
     web_addresses, legal_names, registrations, classifications)
  → ACK or retry (max 3 → rtopacks-ingest-dlq)

It is not a bulk sync. It will never run unless a message is produced.


Coverage as of 2026-04-13

72 of 12,515 RTOs have been enriched — 0.58% coverage. Every RTO touched was the result of a public site visitor looking it up between 2026-03-01 and 2026-03-22.

This is not a bug. The on-demand design is correct for the original goal (lazy enrichment on first lookup, avoiding a 12k-row prewarm). It just means we need one bulk backfill to populate the rest of the corpus once, then the on-demand path keeps things fresh from organic traffic going forward.

See docs/docs/briefs/backfill-01.md for the one-shot backfill script.


Why it's not a sync worker

A common confusion: isn't this what tga-sync does?

No. tga-sync writes the core TGA corpustga_organisations, tga_training_components, tga_qualification_units, and supersession/status metadata. It operates against the TGA search/list APIs and walks the whole corpus weekly.

tga-ingest writes the enrichment tablesrtos and 7 rto_* child tables — that hold rich per-RTO detail (contacts, addresses, trading names, ABNs, web addresses, legal names, registration history, classifications). The TGA API for this detail is the /api/organisation/{rto_code}?include=all endpoint, which returns a much richer payload per RTO than the search API exposes.

They write to different tables from different TGA endpoints on different schedules. Don't conflate them.


The rto_registrations metric trap

rto_registrations.captured_at MAX was for months interpreted as "TGA sync freshness". It isn't. That column is populated by tga-ingest, not tga-sync. So its max date just reflects "when was the last public site visitor enrichment", not "when did the TGA weekly corpus last run".

Correct freshness metrics:

Metric Table Written by
TGA corpus freshness tga_organisations.synced_at tga-sync (weekly)
TGA training components freshness tga_training_components.synced_at tga-sync for new, never for existing (fix: briefs/tga-refresh-01.md)
RTO enrichment freshness rtos.enriched_at + rto_registrations.captured_at tga-ingest (on-demand)

Any dashboard or health check that uses rto_registrations.captured_at as a tga-sync signal is wrong. Use tga_organisations.synced_at instead.


Queue + DLQ

  • Producer: apps/site — calls env.RTOPACKS_INGEST_QUEUE.send() from its route handlers
  • Consumer: tga-ingest (this worker), batch size default, retries 3
  • DLQ: rtopacks-ingest-dlq — dead-lettered messages land here after 3 failed retries

DLQ inspection:

cd scripts/workers/tga-ingest
npx wrangler queues consumer fetch rtopacks-ingest-dlq

Common DLQ causes: - TGA JSON shape edge cases the upsert logic doesn't handle (rare) - D1 constraint violations on the upsert (shouldn't happen — upserts use INSERT OR REPLACE where needed) - rto_code in the message body doesn't match any row in rtos (shouldn't happen — producer creates the rtos row if missing)

A non-zero DLQ count after a BACKFILL-01 run is worth investigating. A non-zero count during normal operation is usually fine — a handful of edge cases accumulate slowly.


Secret gating — /api/enrich vs /api/search-enrich

Two producer endpoints on apps/site:

Endpoint Auth Rate limit Use case
/api/search-enrich Public 10/min per IP Normal site traffic (search results, RTO pages)
/api/enrich X-Enrich-Key header (ENRICH_SECRET) None Bulk backfill, admin tools, trusted scripts

The secret lives in apps/site's Wrangler secret store (wrangler secret list shows it exists; can't be read back via CLI). Rotate it with wrangler secret put ENRICH_SECRET and update any script that uses it.

BACKFILL-01 uses /api/enrich to bypass the rate limit — 12k RTOs at 10/min would take 21 hours; at 67ms throttle via the secret endpoint it's ~14 minutes.


Schema notes

rtos row carries an enriched flag (0/1) and enriched_at timestamp. The upsert sets both. BACKFILL-01's initial SELECT rto_code FROM rtos WHERE enriched = 0 OR enriched IS NULL is the driver query.

Child tables are keyed by rto_code with 1:N rows per RTO. Upserts delete-then-insert per RTO to handle the N-row pattern cleanly — the tga-ingest consumer wraps the delete + insert batch in a single D1 transaction per RTO.


  • docs/docs/infrastructure/worker-patterns.md — queue pattern overview (tga-ingest uses the same queue infrastructure but NOT chain dispatch — it's a one-shot consumer)
  • docs/docs/briefs/backfill-01.md — one-shot script to populate the 99.4% gap
  • docs/docs/workers/inventory.md — queue + worker inventory
  • docs/docs/ops/standing-rules.md — enrichment coverage standing note