Observatory — Reading the sync feed¶
Version: 2.0
Updated: 2026-04-15 (session 58 — added OBS-ALIGN-01 query rules, watchdog reference, chain_failed handling, writer contract cross-link)
UI: admin → Observatory
Data source: ops-db.tga_sync_steps + ops-db.regulatory_events + ops-db.sentinel_log
Companion docs: ../architecture/database.md (writer contract for tga_sync_steps), ../infrastructure/worker-patterns.md (chain dispatch + failure modes)
What it is¶
The Observatory is the admin UI for looking at what every data sync has been doing. It's the non-email counterpart to NOTIFY-01 — the email is a once-per-cycle notification, the Observatory is a live queryable view of everything the sync workers have ever done.
Three panels:
- Runs feed — one row per cron cycle across all workers (
tga-sync,cricos-sync,qb-reconcile,ops-tender-sync, etc). Shows start time, trigger, aggregate status, duration. - Steps drill-down — click a run to see per-phase step rows (
writeStepoutput). Status, records_in/records_out, error column if any. - Regulatory events — the "what changed" log. Populated by sync workers when they detect meaningful changes (status flips, supersessions, new components, deleted orgs). Shared across the Observatory feed and NOTIFY-01 emails.
Reading a run row¶
| Status | Meaning | Action |
|---|---|---|
running |
Cycle is in progress. Active chain dispatching. | Wait. If stuck >30 min, check the phase pointer in D1. |
complete |
Cycle finished cleanly, all phases complete. | No action. |
partial |
Cycle finished but one or more phases errored out. | Drill into steps, find the failed phase, investigate error column. |
failed |
Cycle died. Phase pointer may be stuck. | Re-trigger manually via /trigger HTTP endpoint. |
stale |
No heartbeat in the expected window. | CF outage? Worker not deployed? Check wrangler deployments list. |
Runs older than 7 days are visible but dimmed. There's no TTL — all history is retained.
Reading a step row¶
Each step row is one phase of one run. Columns:
run_id— the run this phase belongs tosync_type— worker name (tga_sync,cricos_sync,enrich_sync,teqsa_sync)step— phase name (see step taxonomy below)status—pending/running/complete/failed/timed_out/error/not_implementedtriggered_by—cronormanualrecords_in— rows the phase read (corpus size queried, batch size received)records_out— rows the phase wrote (inserts + updates), or for cumulative phases, the running totalerror— free-text error on failure. Overwritten on retry (NOT COALESCE — fix from session 52, see writer contract)started_at,completed_at— ISO 8601 timestamps withTseparator andZsuffix
Step taxonomy¶
Steps fall into four categories. Knowing which category a step is in tells you what to expect when reading it:
Workflow steps — the actual work. One row per phase per run.
| Step | Sync types | Meaning |
|---|---|---|
init |
all | Run started, seeded pending rows for subsequent steps |
sweep_statuses |
tga_sync |
Status sweep across recently-changed RTOs |
sync_training |
tga_sync |
TGA training components delta walk |
sync_orgs |
tga_sync |
TGA organisations delta walk |
training_refresh |
tga_sync |
Rotating refresh of stale tga_training_components (TGA-REFRESH-01) |
sync_providers, sync_courses, sync_decisions |
teqsa_sync |
TEQSA provider/course/decision ingestion |
ingest_institutions, ingest_courses, ingest_locations, ingest_course_locations |
cricos_sync |
CRICOS resource ingestion |
enrich_batch |
enrich_sync |
RTO enrichment batch (cursor-based walk) |
write_log, flag_restricted, sync_qual_statuses, refresh_stats, write_snapshot |
tga_sync |
Utility phase steps |
Baseline steps — informational, fire instantly at cycle start. Filtered out by Observatory queries (see OBS-ALIGN-01 below).
| Step | Meaning |
|---|---|
volume_check_orgs, volume_check_training |
SENTINEL-02 cycle-start volume check baseline. records_out is the live total in the target table (e.g. 12,500 for tga_organisations). 0ms duration. |
volume_check_institutions, volume_check_courses |
Same, for cricos-sync. |
Failure surfacing steps — written when the chain breaks visibly. Always status='failed'.
| Step | Meaning |
|---|---|
chain_failed |
SYNC-CHAIN-01: queue-send drop. Written by selfChain() when SYNC_QUEUE.send() throws. The error column says chain_break: queue send failed advancing to phase=.... Visible in both card and log queries. See worker-patterns.md failure mode 2. |
fatal |
Worker caught an unhandled error mid-runBatchLoop. Rare. |
Watchdog-swept rows — not a step name, but a status mutation. The d1-warmer cron sweeps any running row whose started_at is older than 5 minutes and rewrites it to status='timed_out' with error='watchdog_swept: stuck running > 5 minutes' (or whatever the worker wrote when it died, plus the watchdog suffix). See "Watchdog and stuck rows" below.
The writer contract¶
Every worker that writes to tga_sync_steps must use the ON CONFLICT(run_id, sync_type, step) DO UPDATE pattern. Plain INSERT collides with the unique constraint added 2026-04-13 and the writer's catch swallows the error silently — leaving the DB at running while the worker thinks it succeeded.
The full writer contract — schema, required upsert pattern, field-by-field overwrite vs COALESCE rules, the "adding a new sync worker" checklist, and the TEQSA-FIX-03 incident receipt — lives in ../architecture/database.md Convention 2. Read that doc before adding a new sync worker or modifying any worker's writeStep function.
OBS-ALIGN-01 query rules¶
The Observatory has two queries that read tga_sync_steps:
- Per-card "last status" picker — for each
sync_type, find the latest workflow step row and use its status as the card status. Lives inapps/admin/app/api/admin/ingest/route.tslastRunsquery. - Ingestion log "last status" picker — for each
initrow in the last 100 runs, find the latest workflow step in that run and use its status as the log row status. Lives in the same file'slogRowsquery.
Both queries must use the same filter so the card and the log can't disagree about the same run. As of OBS-ALIGN-01 (session 57) the filter is:
Why each filter is necessary:
step != 'init'—initis alwayscompleteimmediately after the run starts. Without this filter, every run would show ascompletefrom the moment it started.step NOT LIKE 'volume_check_%'— SENTINEL-02 baseline rows fire instantly withstatus='complete'at cycle start. Without this filter, runs would show ascompletethe instant the volume check fired, even when the actual workflow step (sync_orgs, sync_training) was stillrunningor had failed.trf02exhibited this in session 57: card showed "Running…" while the log showed "complete" because the two queries had different filters.
chain_failed rows are NOT excluded from the filter — they should appear as failed in both surfaces. That's the whole point of SYNC-CHAIN-01 surfacing them.
Incident receipts¶
- Session 52 — UNIQUE constraint added on
(run_id, sync_type, step). Migration deduped 973 rows to 217. - Session 52 — "partial — 0 records" stale message fix (
error = excluded.errorinstead ofCOALESCE). - Session 57 (OBS-ALIGN-01) — card and log query alignment. Before this fix, the two queries had different filters, so
trf02showed "Running…" on the card AND "complete" in the log simultaneously. - Session 57 (TEQSA-FIX-03) — teqsa-sync's
teqWriteStepused plainINSERT, missed the upsert pattern, silently swallowed every status update for four days. See../architecture/database.mdConvention 2 for full incident receipt.
Watchdog and stuck rows (OBS-WATCHDOG-01)¶
workers/d1-warmer/src/index.js runs a cron every 10 minutes that sweeps any running row whose started_at is older than 5 minutes and marks it timed_out. This is the safety net for the failure modes that selfChain() and SYNC-CHAIN-01 don't catch — specifically the case where a worker dies mid-phase before reaching selfChain() (CF wall-clock kills the isolate inside e.g. syncTrainingComponents). See ../infrastructure/worker-patterns.md failure mode 3 for the full picture.
Sweep semantics¶
The watchdog query (post-WATCHDOG-FIX-01):
UPDATE tga_sync_steps
SET status = 'timed_out',
completed_at = datetime('now'),
error = COALESCE(error, 'watchdog_swept: stuck running > 5 minutes')
WHERE status = 'running' AND started_at < ? -- bound parameter: ISO 8601 cutoff built in JS
The cutoff is built in JS as an explicit ISO literal:
Never rewrite this to use datetime('now', '-5 minutes'). The original code did, and the watchdog silently swept zero rows for 8 days because of the SQLite lex gotcha (see ../architecture/database.md Convention 1). WATCHDOG-FIX-01 (session 58) corrected it.
What the watchdog does NOT do¶
- It does not retry the failed run. It just marks the row visible.
- It does not page anyone. It just makes the failure show up in the Observatory on the next refresh.
- It does not unblock the chain. If the chain was waiting on a queue message that dropped, the chain stays dead until someone manually re-triggers.
- It does not catch runs that fail in <5 minutes (those should already be visible because the worker's own catch wrote a
failedrow).
When the watchdog fires, investigate¶
A timed_out row from the watchdog means the worker died mid-phase without writing a terminal status. The investigation order:
- Check
tga_sync_cursor.cycle_phase— what phase did the worker think it was in when it died? (Don't bulk-delete during this — seedatabase.mdConvention 3.) - Check the corresponding queue's CF metrics — was the next chain message ever delivered? If yes, it should have re-attempted.
- Check the worker logs (
wrangler tail) — if the death was a code error, the stack trace is in the logs even if no DB row was written. - Check whether SYNC-CHAIN-01 fired — is there a
chain_failedrow for the same run_id? If yes, the chain caught the failure cleanly. If no, the worker died mid-phase (failure mode 3 in worker-patterns.md).
Watchdog vs SYNC-CHAIN-01 vs the writer contract — which catches what¶
| Failure | Caught by |
|---|---|
Plain INSERT collides with unique constraint |
The writer contract (database.md Convention 2). If a writer doesn't use UPSERT, nothing else catches the silent collision — TEQSA-FIX-03 case. |
SYNC_QUEUE.send() throws after a successful phase |
SYNC-CHAIN-01 writes a chain_failed row immediately. |
| Worker isolate killed mid-phase | Watchdog sweeps the running row 5-15 minutes later. The run looks dead until then. |
| Worker logs an exception that the catch handles | The worker's own markStep('error', ...) writes a failed row. No watchdog needed. |
| Run completes successfully | Worker writes complete rows. No watchdog needed. |
Regulatory events feed¶
The "What changed" surface. Rows are written by sync workers whenever they detect:
- An RTO status transition (Current → Non-current, etc.)
- A qualification being superseded (old → new code)
- A training component new release
- A cricos institution or course disappearing
- A cricos course newly appearing
Schema:
id,run_id,event_type,entity_type,entity_code,old_value,new_value,detail,created_at
The NOTIFY-01 email reads this table scoped to the run_id for its "What changed" section. The Observatory renders a cross-run feed sorted by date.
Finding a specific run¶
Observatory search (top of page) supports:
- Run ID prefix (e.g. tga-sync-2026-04-13)
- Worker name
- Status filter (running / complete / failed / partial)
- Date range
For SQL-level queries (faster for ad-hoc investigation):
-- Latest run per worker
SELECT sync_type, MAX(created_at) as latest
FROM sync_runs
GROUP BY sync_type;
-- Failed phases in the last 7 days
SELECT run_id, sync_type, step, error, created_at
FROM sync_steps
WHERE status = 'failed' AND created_at > datetime('now', '-7 days')
ORDER BY created_at DESC;
Run from apps/admin/ directory so wrangler picks up the correct account:
(See standing rule: "D1 account context — always run from apps/admin/".)
Common investigations¶
"Why hasn't tga-sync run this week?"¶
- Observatory runs feed filtered to
sync_type=tga_sync— is there a Sunday run? - If no run row: worker cron isn't firing. Check
wrangler deployments list --name tga-sync, check queue health, manually trigger/trigger. - If run row exists but status is
failedorpartial: drill to steps, find the failed phase. - Cross-check with email inbox — if NOTIFY-01 email is also missing, the cycle didn't reach
write_snapshot.
"Something changed in the corpus but I don't remember what"¶
- Regulatory events feed filtered to a date range
- Each event has a
detailcolumn with before/after or the supersession chain - Cross-reference with the NOTIFY-01 email for the relevant cycle — the email's "What changed" section is the same data
"Observatory shows running forever"¶
A phase is stuck or the chain died without writing a terminal row. Check:
ops-dbfor the phase pointer row — what phase does it think it's in?- Cloudflare queue metrics for
tga-sync-queueorcricos-sync-queue— any pending messages? - If pointer is stuck but queue is empty: dispatch was silently dropped. Re-trigger
/triggerto restart the cycle from the stuck phase (phase pointer is honoured as a resume point).
Related docs¶
../architecture/database.md— the writer contract fortga_sync_steps, the lex gotcha that broke the watchdog, the bulk-delete safety rule../infrastructure/worker-patterns.md— the chain dispatch state machine the Observatory visualises, all 8 failure modes with defences../infrastructure/notifications.md— NOTIFY-01 email counterpart (reads the sametga_sync_stepsrows for the email body)../ops/standing-rules.md— D1 account context rule, NOTIFY-01 expectations../workers/inventory.md— canonical worker + queue + DB inventory