Observatory — Reading the sync feed¶

Version: 2.0 Updated: 2026-04-15 (session 58 — added OBS-ALIGN-01 query rules, watchdog reference, chain_failed handling, writer contract cross-link) UI: admin → Observatory Data source: ops-db.tga_sync_steps + ops-db.regulatory_events + ops-db.sentinel_log Companion docs: ../architecture/database.md (writer contract for tga_sync_steps), ../infrastructure/worker-patterns.md (chain dispatch + failure modes)

What it is¶

The Observatory is the admin UI for looking at what every data sync has been doing. It's the non-email counterpart to NOTIFY-01 — the email is a once-per-cycle notification, the Observatory is a live queryable view of everything the sync workers have ever done.

Three panels:

Runs feed — one row per cron cycle across all workers (tga-sync, cricos-sync, qb-reconcile, ops-tender-sync, etc). Shows start time, trigger, aggregate status, duration.
Steps drill-down — click a run to see per-phase step rows (writeStep output). Status, records_in/records_out, error column if any.
Regulatory events — the "what changed" log. Populated by sync workers when they detect meaningful changes (status flips, supersessions, new components, deleted orgs). Shared across the Observatory feed and NOTIFY-01 emails.

Reading a run row¶

Status	Meaning	Action
`running`	Cycle is in progress. Active chain dispatching.	Wait. If stuck >30 min, check the phase pointer in D1.
`complete`	Cycle finished cleanly, all phases complete.	No action.
`partial`	Cycle finished but one or more phases errored out.	Drill into steps, find the failed phase, investigate error column.
`failed`	Cycle died. Phase pointer may be stuck.	Re-trigger manually via `/trigger` HTTP endpoint.
`stale`	No heartbeat in the expected window.	CF outage? Worker not deployed? Check `wrangler deployments list`.

Runs older than 7 days are visible but dimmed. There's no TTL — all history is retained.

Reading a step row¶

Each step row is one phase of one run. Columns:

run_id — the run this phase belongs to
sync_type — worker name (tga_sync, cricos_sync, enrich_sync, teqsa_sync)
step — phase name (see step taxonomy below)
status — pending / running / complete / failed / timed_out / error / not_implemented
triggered_by — cron or manual
records_in — rows the phase read (corpus size queried, batch size received)
records_out — rows the phase wrote (inserts + updates), or for cumulative phases, the running total
error — free-text error on failure. Overwritten on retry (NOT COALESCE — fix from session 52, see writer contract)
started_at, completed_at — ISO 8601 timestamps with T separator and Z suffix

Step taxonomy¶

Steps fall into four categories. Knowing which category a step is in tells you what to expect when reading it:

Workflow steps — the actual work. One row per phase per run.

Step	Sync types	Meaning
`init`	all	Run started, seeded pending rows for subsequent steps
`sweep_statuses`	`tga_sync`	Status sweep across recently-changed RTOs
`sync_training`	`tga_sync`	TGA training components delta walk
`sync_orgs`	`tga_sync`	TGA organisations delta walk
`training_refresh`	`tga_sync`	Rotating refresh of stale `tga_training_components` (TGA-REFRESH-01)
`sync_providers`, `sync_courses`, `sync_decisions`	`teqsa_sync`	TEQSA provider/course/decision ingestion
`ingest_institutions`, `ingest_courses`, `ingest_locations`, `ingest_course_locations`	`cricos_sync`	CRICOS resource ingestion
`enrich_batch`	`enrich_sync`	RTO enrichment batch (cursor-based walk)
`write_log`, `flag_restricted`, `sync_qual_statuses`, `refresh_stats`, `write_snapshot`	`tga_sync`	Utility phase steps

Baseline steps — informational, fire instantly at cycle start. Filtered out by Observatory queries (see OBS-ALIGN-01 below).

Step	Meaning
`volume_check_orgs`, `volume_check_training`	SENTINEL-02 cycle-start volume check baseline. `records_out` is the live total in the target table (e.g. 12,500 for `tga_organisations`). 0ms duration.
`volume_check_institutions`, `volume_check_courses`	Same, for cricos-sync.

Failure surfacing steps — written when the chain breaks visibly. Always status='failed'.

Step	Meaning
`chain_failed`	SYNC-CHAIN-01: queue-send drop. Written by `selfChain()` when `SYNC_QUEUE.send()` throws. The error column says `chain_break: queue send failed advancing to phase=...`. Visible in both card and log queries. See `worker-patterns.md` failure mode 2.
`fatal`	Worker caught an unhandled error mid-runBatchLoop. Rare.

Watchdog-swept rows — not a step name, but a status mutation. The d1-warmer cron sweeps any running row whose started_at is older than 5 minutes and rewrites it to status='timed_out' with error='watchdog_swept: stuck running > 5 minutes' (or whatever the worker wrote when it died, plus the watchdog suffix). See "Watchdog and stuck rows" below.

The writer contract¶

Every worker that writes to tga_sync_steps must use the ON CONFLICT(run_id, sync_type, step) DO UPDATE pattern. Plain INSERT collides with the unique constraint added 2026-04-13 and the writer's catch swallows the error silently — leaving the DB at running while the worker thinks it succeeded.

The full writer contract — schema, required upsert pattern, field-by-field overwrite vs COALESCE rules, the "adding a new sync worker" checklist, and the TEQSA-FIX-03 incident receipt — lives in ../architecture/database.md Convention 2. Read that doc before adding a new sync worker or modifying any worker's writeStep function.

OBS-ALIGN-01 query rules¶

The Observatory has two queries that read tga_sync_steps:

Per-card "last status" picker — for each sync_type, find the latest workflow step row and use its status as the card status. Lives in apps/admin/app/api/admin/ingest/route.ts lastRuns query.
Ingestion log "last status" picker — for each init row in the last 100 runs, find the latest workflow step in that run and use its status as the log row status. Lives in the same file's logRows query.

Both queries must use the same filter so the card and the log can't disagree about the same run. As of OBS-ALIGN-01 (session 57) the filter is:

WHERE step != 'init' AND step NOT LIKE 'volume_check_%'
ORDER BY id DESC LIMIT 1

Why each filter is necessary:

step != 'init' — init is always complete immediately after the run starts. Without this filter, every run would show as complete from the moment it started.
step NOT LIKE 'volume_check_%' — SENTINEL-02 baseline rows fire instantly with status='complete' at cycle start. Without this filter, runs would show as complete the instant the volume check fired, even when the actual workflow step (sync_orgs, sync_training) was still running or had failed. trf02 exhibited this in session 57: card showed "Running…" while the log showed "complete" because the two queries had different filters.

chain_failed rows are NOT excluded from the filter — they should appear as failed in both surfaces. That's the whole point of SYNC-CHAIN-01 surfacing them.

Incident receipts¶

Session 52 — UNIQUE constraint added on (run_id, sync_type, step). Migration deduped 973 rows to 217.
Session 52 — "partial — 0 records" stale message fix (error = excluded.error instead of COALESCE).
Session 57 (OBS-ALIGN-01) — card and log query alignment. Before this fix, the two queries had different filters, so trf02 showed "Running…" on the card AND "complete" in the log simultaneously.
Session 57 (TEQSA-FIX-03) — teqsa-sync's teqWriteStep used plain INSERT, missed the upsert pattern, silently swallowed every status update for four days. See ../architecture/database.md Convention 2 for full incident receipt.

Watchdog and stuck rows (OBS-WATCHDOG-01)¶

workers/d1-warmer/src/index.js runs a cron every 10 minutes that sweeps any running row whose started_at is older than 5 minutes and marks it timed_out. This is the safety net for the failure modes that selfChain() and SYNC-CHAIN-01 don't catch — specifically the case where a worker dies mid-phase before reaching selfChain() (CF wall-clock kills the isolate inside e.g. syncTrainingComponents). See ../infrastructure/worker-patterns.md failure mode 3 for the full picture.

Sweep semantics¶

The watchdog query (post-WATCHDOG-FIX-01):

UPDATE tga_sync_steps
SET status = 'timed_out',
    completed_at = datetime('now'),
    error = COALESCE(error, 'watchdog_swept: stuck running > 5 minutes')
WHERE status = 'running' AND started_at < ?  -- bound parameter: ISO 8601 cutoff built in JS

The cutoff is built in JS as an explicit ISO literal:

const cutoffIso = new Date(Date.now() - 5 * 60 * 1000).toISOString().slice(0, 19) + "Z";

Never rewrite this to use datetime('now', '-5 minutes'). The original code did, and the watchdog silently swept zero rows for 8 days because of the SQLite lex gotcha (see ../architecture/database.md Convention 1). WATCHDOG-FIX-01 (session 58) corrected it.

What the watchdog does NOT do¶

It does not retry the failed run. It just marks the row visible.
It does not page anyone. It just makes the failure show up in the Observatory on the next refresh.
It does not unblock the chain. If the chain was waiting on a queue message that dropped, the chain stays dead until someone manually re-triggers.
It does not catch runs that fail in <5 minutes (those should already be visible because the worker's own catch wrote a failed row).

When the watchdog fires, investigate¶

A timed_out row from the watchdog means the worker died mid-phase without writing a terminal status. The investigation order:

Check tga_sync_cursor.cycle_phase — what phase did the worker think it was in when it died? (Don't bulk-delete during this — see database.md Convention 3.)
Check the corresponding queue's CF metrics — was the next chain message ever delivered? If yes, it should have re-attempted.
Check the worker logs (wrangler tail) — if the death was a code error, the stack trace is in the logs even if no DB row was written.
Check whether SYNC-CHAIN-01 fired — is there a chain_failed row for the same run_id? If yes, the chain caught the failure cleanly. If no, the worker died mid-phase (failure mode 3 in worker-patterns.md).

Watchdog vs SYNC-CHAIN-01 vs the writer contract — which catches what¶

Failure	Caught by
Plain `INSERT` collides with unique constraint	The writer contract (database.md Convention 2). If a writer doesn't use UPSERT, nothing else catches the silent collision — TEQSA-FIX-03 case.
`SYNC_QUEUE.send()` throws after a successful phase	SYNC-CHAIN-01 writes a `chain_failed` row immediately.
Worker isolate killed mid-phase	Watchdog sweeps the `running` row 5-15 minutes later. The run looks dead until then.
Worker logs an exception that the catch handles	The worker's own `markStep('error', ...)` writes a `failed` row. No watchdog needed.
Run completes successfully	Worker writes `complete` rows. No watchdog needed.

Regulatory events feed¶

The "What changed" surface. Rows are written by sync workers whenever they detect:

An RTO status transition (Current → Non-current, etc.)
A qualification being superseded (old → new code)
A training component new release
A cricos institution or course disappearing
A cricos course newly appearing

Schema:

id, run_id, event_type, entity_type, entity_code, old_value, new_value, detail, created_at

The NOTIFY-01 email reads this table scoped to the run_id for its "What changed" section. The Observatory renders a cross-run feed sorted by date.

Finding a specific run¶

Observatory search (top of page) supports: - Run ID prefix (e.g. tga-sync-2026-04-13) - Worker name - Status filter (running / complete / failed / partial) - Date range

For SQL-level queries (faster for ad-hoc investigation):

-- Latest run per worker
SELECT sync_type, MAX(created_at) as latest
FROM sync_runs
GROUP BY sync_type;

-- Failed phases in the last 7 days
SELECT run_id, sync_type, step, error, created_at
FROM sync_steps
WHERE status = 'failed' AND created_at > datetime('now', '-7 days')
ORDER BY created_at DESC;

Run from apps/admin/ directory so wrangler picks up the correct account:

cd apps/admin
npx wrangler d1 execute ops-db --remote --command "SELECT ..."

(See standing rule: "D1 account context — always run from apps/admin/".)

Common investigations¶

"Why hasn't tga-sync run this week?"¶

Observatory runs feed filtered to sync_type=tga_sync — is there a Sunday run?
If no run row: worker cron isn't firing. Check wrangler deployments list --name tga-sync, check queue health, manually trigger /trigger.
If run row exists but status is failed or partial: drill to steps, find the failed phase.
Cross-check with email inbox — if NOTIFY-01 email is also missing, the cycle didn't reach write_snapshot.

"Something changed in the corpus but I don't remember what"¶

Regulatory events feed filtered to a date range
Each event has a detail column with before/after or the supersession chain
Cross-reference with the NOTIFY-01 email for the relevant cycle — the email's "What changed" section is the same data

"Observatory shows running forever"¶

A phase is stuck or the chain died without writing a terminal row. Check:

ops-db for the phase pointer row — what phase does it think it's in?
Cloudflare queue metrics for tga-sync-queue or cricos-sync-queue — any pending messages?
If pointer is stuck but queue is empty: dispatch was silently dropped. Re-trigger /trigger to restart the cycle from the stuck phase (phase pointer is honoured as a resume point).

../architecture/database.md — the writer contract for tga_sync_steps, the lex gotcha that broke the watchdog, the bulk-delete safety rule
../infrastructure/worker-patterns.md — the chain dispatch state machine the Observatory visualises, all 8 failure modes with defences
../infrastructure/notifications.md — NOTIFY-01 email counterpart (reads the same tga_sync_steps rows for the email body)
../ops/standing-rules.md — D1 account context rule, NOTIFY-01 expectations
../workers/inventory.md — canonical worker + queue + DB inventory