Skip to content

Observatory — Reading the sync feed

Version: 2.0 Updated: 2026-04-15 (session 58 — added OBS-ALIGN-01 query rules, watchdog reference, chain_failed handling, writer contract cross-link) UI: admin → Observatory Data source: ops-db.tga_sync_steps + ops-db.regulatory_events + ops-db.sentinel_log Companion docs: ../architecture/database.md (writer contract for tga_sync_steps), ../infrastructure/worker-patterns.md (chain dispatch + failure modes)


What it is

The Observatory is the admin UI for looking at what every data sync has been doing. It's the non-email counterpart to NOTIFY-01 — the email is a once-per-cycle notification, the Observatory is a live queryable view of everything the sync workers have ever done.

Three panels:

  1. Runs feed — one row per cron cycle across all workers (tga-sync, cricos-sync, qb-reconcile, ops-tender-sync, etc). Shows start time, trigger, aggregate status, duration.
  2. Steps drill-down — click a run to see per-phase step rows (writeStep output). Status, records_in/records_out, error column if any.
  3. Regulatory events — the "what changed" log. Populated by sync workers when they detect meaningful changes (status flips, supersessions, new components, deleted orgs). Shared across the Observatory feed and NOTIFY-01 emails.

Reading a run row

Status Meaning Action
running Cycle is in progress. Active chain dispatching. Wait. If stuck >30 min, check the phase pointer in D1.
complete Cycle finished cleanly, all phases complete. No action.
partial Cycle finished but one or more phases errored out. Drill into steps, find the failed phase, investigate error column.
failed Cycle died. Phase pointer may be stuck. Re-trigger manually via /trigger HTTP endpoint.
stale No heartbeat in the expected window. CF outage? Worker not deployed? Check wrangler deployments list.

Runs older than 7 days are visible but dimmed. There's no TTL — all history is retained.


Reading a step row

Each step row is one phase of one run. Columns:

  • run_id — the run this phase belongs to
  • sync_type — worker name (tga_sync, cricos_sync, enrich_sync, teqsa_sync)
  • step — phase name (see step taxonomy below)
  • statuspending / running / complete / failed / timed_out / error / not_implemented
  • triggered_bycron or manual
  • records_in — rows the phase read (corpus size queried, batch size received)
  • records_out — rows the phase wrote (inserts + updates), or for cumulative phases, the running total
  • error — free-text error on failure. Overwritten on retry (NOT COALESCE — fix from session 52, see writer contract)
  • started_at, completed_at — ISO 8601 timestamps with T separator and Z suffix

Step taxonomy

Steps fall into four categories. Knowing which category a step is in tells you what to expect when reading it:

Workflow steps — the actual work. One row per phase per run.

Step Sync types Meaning
init all Run started, seeded pending rows for subsequent steps
sweep_statuses tga_sync Status sweep across recently-changed RTOs
sync_training tga_sync TGA training components delta walk
sync_orgs tga_sync TGA organisations delta walk
training_refresh tga_sync Rotating refresh of stale tga_training_components (TGA-REFRESH-01)
sync_providers, sync_courses, sync_decisions teqsa_sync TEQSA provider/course/decision ingestion
ingest_institutions, ingest_courses, ingest_locations, ingest_course_locations cricos_sync CRICOS resource ingestion
enrich_batch enrich_sync RTO enrichment batch (cursor-based walk)
write_log, flag_restricted, sync_qual_statuses, refresh_stats, write_snapshot tga_sync Utility phase steps

Baseline steps — informational, fire instantly at cycle start. Filtered out by Observatory queries (see OBS-ALIGN-01 below).

Step Meaning
volume_check_orgs, volume_check_training SENTINEL-02 cycle-start volume check baseline. records_out is the live total in the target table (e.g. 12,500 for tga_organisations). 0ms duration.
volume_check_institutions, volume_check_courses Same, for cricos-sync.

Failure surfacing steps — written when the chain breaks visibly. Always status='failed'.

Step Meaning
chain_failed SYNC-CHAIN-01: queue-send drop. Written by selfChain() when SYNC_QUEUE.send() throws. The error column says chain_break: queue send failed advancing to phase=.... Visible in both card and log queries. See worker-patterns.md failure mode 2.
fatal Worker caught an unhandled error mid-runBatchLoop. Rare.

Watchdog-swept rows — not a step name, but a status mutation. The d1-warmer cron sweeps any running row whose started_at is older than 5 minutes and rewrites it to status='timed_out' with error='watchdog_swept: stuck running > 5 minutes' (or whatever the worker wrote when it died, plus the watchdog suffix). See "Watchdog and stuck rows" below.

The writer contract

Every worker that writes to tga_sync_steps must use the ON CONFLICT(run_id, sync_type, step) DO UPDATE pattern. Plain INSERT collides with the unique constraint added 2026-04-13 and the writer's catch swallows the error silently — leaving the DB at running while the worker thinks it succeeded.

The full writer contract — schema, required upsert pattern, field-by-field overwrite vs COALESCE rules, the "adding a new sync worker" checklist, and the TEQSA-FIX-03 incident receipt — lives in ../architecture/database.md Convention 2. Read that doc before adding a new sync worker or modifying any worker's writeStep function.

OBS-ALIGN-01 query rules

The Observatory has two queries that read tga_sync_steps:

  1. Per-card "last status" picker — for each sync_type, find the latest workflow step row and use its status as the card status. Lives in apps/admin/app/api/admin/ingest/route.ts lastRuns query.
  2. Ingestion log "last status" picker — for each init row in the last 100 runs, find the latest workflow step in that run and use its status as the log row status. Lives in the same file's logRows query.

Both queries must use the same filter so the card and the log can't disagree about the same run. As of OBS-ALIGN-01 (session 57) the filter is:

WHERE step != 'init' AND step NOT LIKE 'volume_check_%'
ORDER BY id DESC LIMIT 1

Why each filter is necessary:

  • step != 'init'init is always complete immediately after the run starts. Without this filter, every run would show as complete from the moment it started.
  • step NOT LIKE 'volume_check_%' — SENTINEL-02 baseline rows fire instantly with status='complete' at cycle start. Without this filter, runs would show as complete the instant the volume check fired, even when the actual workflow step (sync_orgs, sync_training) was still running or had failed. trf02 exhibited this in session 57: card showed "Running…" while the log showed "complete" because the two queries had different filters.

chain_failed rows are NOT excluded from the filter — they should appear as failed in both surfaces. That's the whole point of SYNC-CHAIN-01 surfacing them.

Incident receipts

  • Session 52 — UNIQUE constraint added on (run_id, sync_type, step). Migration deduped 973 rows to 217.
  • Session 52 — "partial — 0 records" stale message fix (error = excluded.error instead of COALESCE).
  • Session 57 (OBS-ALIGN-01) — card and log query alignment. Before this fix, the two queries had different filters, so trf02 showed "Running…" on the card AND "complete" in the log simultaneously.
  • Session 57 (TEQSA-FIX-03) — teqsa-sync's teqWriteStep used plain INSERT, missed the upsert pattern, silently swallowed every status update for four days. See ../architecture/database.md Convention 2 for full incident receipt.

Watchdog and stuck rows (OBS-WATCHDOG-01)

workers/d1-warmer/src/index.js runs a cron every 10 minutes that sweeps any running row whose started_at is older than 5 minutes and marks it timed_out. This is the safety net for the failure modes that selfChain() and SYNC-CHAIN-01 don't catch — specifically the case where a worker dies mid-phase before reaching selfChain() (CF wall-clock kills the isolate inside e.g. syncTrainingComponents). See ../infrastructure/worker-patterns.md failure mode 3 for the full picture.

Sweep semantics

The watchdog query (post-WATCHDOG-FIX-01):

UPDATE tga_sync_steps
SET status = 'timed_out',
    completed_at = datetime('now'),
    error = COALESCE(error, 'watchdog_swept: stuck running > 5 minutes')
WHERE status = 'running' AND started_at < ?  -- bound parameter: ISO 8601 cutoff built in JS

The cutoff is built in JS as an explicit ISO literal:

const cutoffIso = new Date(Date.now() - 5 * 60 * 1000).toISOString().slice(0, 19) + "Z";

Never rewrite this to use datetime('now', '-5 minutes'). The original code did, and the watchdog silently swept zero rows for 8 days because of the SQLite lex gotcha (see ../architecture/database.md Convention 1). WATCHDOG-FIX-01 (session 58) corrected it.

What the watchdog does NOT do

  • It does not retry the failed run. It just marks the row visible.
  • It does not page anyone. It just makes the failure show up in the Observatory on the next refresh.
  • It does not unblock the chain. If the chain was waiting on a queue message that dropped, the chain stays dead until someone manually re-triggers.
  • It does not catch runs that fail in <5 minutes (those should already be visible because the worker's own catch wrote a failed row).

When the watchdog fires, investigate

A timed_out row from the watchdog means the worker died mid-phase without writing a terminal status. The investigation order:

  1. Check tga_sync_cursor.cycle_phase — what phase did the worker think it was in when it died? (Don't bulk-delete during this — see database.md Convention 3.)
  2. Check the corresponding queue's CF metrics — was the next chain message ever delivered? If yes, it should have re-attempted.
  3. Check the worker logs (wrangler tail) — if the death was a code error, the stack trace is in the logs even if no DB row was written.
  4. Check whether SYNC-CHAIN-01 fired — is there a chain_failed row for the same run_id? If yes, the chain caught the failure cleanly. If no, the worker died mid-phase (failure mode 3 in worker-patterns.md).

Watchdog vs SYNC-CHAIN-01 vs the writer contract — which catches what

Failure Caught by
Plain INSERT collides with unique constraint The writer contract (database.md Convention 2). If a writer doesn't use UPSERT, nothing else catches the silent collision — TEQSA-FIX-03 case.
SYNC_QUEUE.send() throws after a successful phase SYNC-CHAIN-01 writes a chain_failed row immediately.
Worker isolate killed mid-phase Watchdog sweeps the running row 5-15 minutes later. The run looks dead until then.
Worker logs an exception that the catch handles The worker's own markStep('error', ...) writes a failed row. No watchdog needed.
Run completes successfully Worker writes complete rows. No watchdog needed.

Regulatory events feed

The "What changed" surface. Rows are written by sync workers whenever they detect:

  • An RTO status transition (Current → Non-current, etc.)
  • A qualification being superseded (old → new code)
  • A training component new release
  • A cricos institution or course disappearing
  • A cricos course newly appearing

Schema:

  • id, run_id, event_type, entity_type, entity_code, old_value, new_value, detail, created_at

The NOTIFY-01 email reads this table scoped to the run_id for its "What changed" section. The Observatory renders a cross-run feed sorted by date.


Finding a specific run

Observatory search (top of page) supports: - Run ID prefix (e.g. tga-sync-2026-04-13) - Worker name - Status filter (running / complete / failed / partial) - Date range

For SQL-level queries (faster for ad-hoc investigation):

-- Latest run per worker
SELECT sync_type, MAX(created_at) as latest
FROM sync_runs
GROUP BY sync_type;

-- Failed phases in the last 7 days
SELECT run_id, sync_type, step, error, created_at
FROM sync_steps
WHERE status = 'failed' AND created_at > datetime('now', '-7 days')
ORDER BY created_at DESC;

Run from apps/admin/ directory so wrangler picks up the correct account:

cd apps/admin
npx wrangler d1 execute ops-db --remote --command "SELECT ..."

(See standing rule: "D1 account context — always run from apps/admin/".)


Common investigations

"Why hasn't tga-sync run this week?"

  1. Observatory runs feed filtered to sync_type=tga_sync — is there a Sunday run?
  2. If no run row: worker cron isn't firing. Check wrangler deployments list --name tga-sync, check queue health, manually trigger /trigger.
  3. If run row exists but status is failed or partial: drill to steps, find the failed phase.
  4. Cross-check with email inbox — if NOTIFY-01 email is also missing, the cycle didn't reach write_snapshot.

"Something changed in the corpus but I don't remember what"

  1. Regulatory events feed filtered to a date range
  2. Each event has a detail column with before/after or the supersession chain
  3. Cross-reference with the NOTIFY-01 email for the relevant cycle — the email's "What changed" section is the same data

"Observatory shows running forever"

A phase is stuck or the chain died without writing a terminal row. Check:

  1. ops-db for the phase pointer row — what phase does it think it's in?
  2. Cloudflare queue metrics for tga-sync-queue or cricos-sync-queue — any pending messages?
  3. If pointer is stuck but queue is empty: dispatch was silently dropped. Re-trigger /trigger to restart the cycle from the stuck phase (phase pointer is honoured as a resume point).