Skip to content

Database Reference

Version: 1.0 Updated: 2026-04-15 (session 58) Companion docs: data-architecture.md (where data lives), docs/docs/operations/observatory-guide.md (reader side of tga_sync_steps), docs/docs/infrastructure/worker-patterns.md (worker chain dispatch).

The "how to write SQL that doesn't bite you" reference. This is the doc to brief yourself with before touching D1 in a Worker, before writing a query that filters on a timestamp, before doing a bulk update on tga_sync_steps, and before adding a new sync worker.

For which database holds which data, see data-architecture.md. This doc is purely conventions, gotchas, and table contracts.


TL;DR — the rules

  1. Never compare ISO 8601 timestamp columns to datetime('now', ...) output. SQLite is a string compare; the formats differ. Build an explicit ISO literal in JS and bind it.
  2. Every writer to tga_sync_steps must use ON CONFLICT(run_id, sync_type, step) DO UPDATE. Plain INSERT collides with the unique constraint and silently fails inside the catch.
  3. Before bulk-deleting from tga_sync_steps, check tga_sync_cursor.cycle_phase. If it's anything other than 'done', an active chain will re-seed your deleted rows on its next firing.
  4. No cross-database foreign keys. D1 doesn't enforce them. Maintain join state at the application layer.

The rest of this doc explains why each rule exists, with the incident receipts.


Convention 1 — ISO 8601 timestamps everywhere

Every timestamp column in every database stores ISO 8601 with a T separator and a Z suffix:

2026-04-15T09:42:41.000Z      ← what's in the column
2026-04-15T09:42:41Z          ← what `now()` writes (without millis)

Worker helpers consistently use:

function now(): string {
  return new Date().toISOString().slice(0, 19) + "Z";
}

This is correct and never needs to change. What needs to change is how you compare against these columns.

The SQLite lex gotcha (THE big one)

This bug has bitten the project at least four times as of session 58. Every time it looks fresh. Read this section before writing any timestamp filter.

The bug:

-- WRONG. Looks correct. Returns zero rows. Always.
WHERE started_at < datetime('now', '-5 minutes')

Why: SQLite compares strings character by character. started_at is stored as 2026-04-15T09:42:41Z. datetime('now', '-5 minutes') returns 2026-04-15 09:37:41 — note the space at position 11, and no Z. The two strings diverge at position 11: 'T' (ASCII 0x54) vs ' ' (ASCII 0x20). 'T' > ' ', so the entire ISO timestamp tests as lex-greater than the datetime-now value. The < filter is false for every row, every time. No syntax error, no warning. The query just silently returns nothing.

The fix:

// Build the cutoff in JS as an ISO literal that matches the column format.
const cutoffMs = Date.now() - 5 * 60 * 1000;
const cutoffIso = new Date(cutoffMs).toISOString().slice(0, 19) + "Z";

// Now compare like with like. SQLite string compare is correct because
// both sides are the same format.
await db.prepare(
  "UPDATE tga_sync_steps SET status = 'timed_out' WHERE status = 'running' AND started_at < ?"
).bind(cutoffIso).run();

The same applies to any column that holds ISO 8601 strings: started_at, completed_at, synced_at, enriched_at, created_at, updated_at, last_run. Never compare these against datetime('now', ...). Always build the cutoff as an ISO literal in JS first, then bind.

Incident receipts

  • Session 54enrich-sync filter WHERE enriched_at >= datetime('now', '-2 minutes') matched everything. ~20 minutes of confused debugging because the filter "looked fine."
  • Session 57 (TGA-REFRESH-01 design) — caught at code review before deploy. The freshness check originally had WHERE synced_at < datetime('now', '-7 days') and would have silently returned every record forever, making the rotating refresh phase loop infinitely.
  • Session 58 (WATCHDOG-FIX-01)workers/d1-warmer/src/index.js OBS-WATCHDOG-01 had been deployed since 2026-04-07 with WHERE status = 'running' AND started_at < datetime('now', '-5 minutes'). The watchdog had been silently sweeping zero rows the entire time it existed. Discovered when stuck tga-sync runs from yesterday were still showing as running 10 hours later.
  • Session 58 (this doc) — written precisely so this doesn't bite a fifth time.

How to spot it in a code review

Grep for any < or > comparison where the right-hand side is datetime(. If the column on the left is one of the timestamp columns above, it's probably the bug:

grep -rn "started_at\|completed_at\|synced_at\|enriched_at" --include="*.ts" --include="*.js" | grep "datetime("

If the result is non-empty, fix it.


Convention 2 — tga_sync_steps writer contract

tga_sync_steps (in ops-db) is the canonical "what each sync run did" log. It is written by every sync worker (tga-sync, cricos-sync, enrich-sync, teqsa-sync, tga-ingest indirectly) and read by the admin Observatory, the d1-warmer watchdog, the SENTINEL-02 sentinel queries, and the NOTIFY-01 completion emails.

Schema

CREATE TABLE tga_sync_steps (
  id           INTEGER PRIMARY KEY AUTOINCREMENT,
  run_id       TEXT NOT NULL,
  sync_type    TEXT NOT NULL,        -- tga_sync, cricos_sync, enrich_sync, teqsa_sync
  step         TEXT NOT NULL,        -- init, sweep_statuses, sync_training, ..., chain_failed, fatal
  status       TEXT NOT NULL,        -- pending, running, complete, failed, timed_out, error, not_implemented
  triggered_by TEXT,                 -- cron, manual
  records_in   INTEGER,
  records_out  INTEGER,
  error        TEXT,
  started_at   TEXT,                 -- ISO 8601 with T separator and Z suffix
  completed_at TEXT,                 -- same
  UNIQUE(run_id, sync_type, step)    -- added 2026-04-13
);

The UNIQUE constraint on (run_id, sync_type, step) is the contract every writer must respect. It was added on 2026-04-13 to deduplicate retry rows.

Required upsert pattern

Every writer must use ON CONFLICT DO UPDATE. Plain INSERT will collide and the catch will swallow the error silently:

async function writeStep(
  opsDb: D1Database,
  runId: string,
  syncType: string,
  step: string,
  status: string,
  triggeredBy: string,
  extra?: { records_in?: number; records_out?: number; error?: string; started_at?: string; completed_at?: string },
): Promise<void> {
  try {
    await opsDb.prepare(
      `INSERT INTO tga_sync_steps (run_id, sync_type, step, status, triggered_by, records_in, records_out, error, started_at, completed_at)
       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
       ON CONFLICT(run_id, sync_type, step) DO UPDATE SET
         status       = excluded.status,
         records_in   = COALESCE(excluded.records_in, records_in),
         records_out  = COALESCE(excluded.records_out, records_out),
         error        = excluded.error,
         started_at   = COALESCE(excluded.started_at, started_at),
         completed_at = COALESCE(excluded.completed_at, completed_at)`,
    ).bind(runId, syncType, step, status, triggeredBy, extra?.records_in ?? null, extra?.records_out ?? null, extra?.error ?? null, extra?.started_at ?? null, extra?.completed_at ?? null).run();
  } catch (e) {
    console.error(`writeStep failed for ${syncType}/${step}:`, e);
  }
}

Notes on the DO UPDATE clause:

  • status = excluded.status — overwrite, not coalesce. Every status update should reflect the latest reality, not preserve a stale earlier value.
  • error = excluded.error — overwrite. A successful retry should clear the prior error message; a COALESCE would keep the stale "partial — 0 records, more pages remain" forever.
  • records_in / records_out use COALESCE — many writers update only the status without touching counts; preserve the prior count in that case.
  • started_at / completed_at use COALESCEstarted_at is set once on the running write, completed_at is set once on the terminal write; a later write with NULL on either field shouldn't blank the prior value.

Adding a new sync worker — checklist

Before deploying any worker that writes to tga_sync_steps:

  • writeStep (or equivalent) uses INSERT ... ON CONFLICT(run_id, sync_type, step) DO UPDATE SET ...
  • All status updates write timestamps as ISO literals (new Date().toISOString().slice(0, 19) + "Z")
  • sync_type is added to the admin Observatory SYNC_TYPES array
  • sync_type schedule line + label is added to the Observatory component
  • triggered_by is one of cron or manual (no other values surface correctly in the Observatory)
  • If the worker chains via selfChain(), the chain_failed step pattern from SYNC-CHAIN-01 is wired so silent queue-send drops surface in the Observatory

Incident receipt for the writer contract

  • TEQSA-FIX-03 (session 57)workers/teqsa-sync/src/index.js teqWriteStep() used a plain INSERT that was missed when the unique constraint migration added the upsert pattern to the other workers. Every "complete" status update silently collided with the constraint and was swallowed by the catch. Result: TEQSA looked broken for four straight days (246eb50c, 84d50614, multiple stale runs) when the worker was actually completing its work — the DB just never reflected it. Compounded by the OBS-WATCHDOG-01 lex bug above, which would otherwise have surfaced the stuck rows after 5 minutes. Cost: ~half a session of investigation. Fix: 5-line change to use the upsert pattern.

Convention 3 — bulk-delete safety on tga_sync_steps

If you're cleaning up test rows, leftover artefacts, or anything else from tga_sync_steps, first check that no chain is currently mid-cycle on the run_ids you're about to delete:

cd apps/admin
npx wrangler d1 execute rtopacks-db --remote --command \
  "SELECT sync_type, last_page, status FROM tga_sync_cursor WHERE sync_type='cycle_phase'"

If cycle_phase.status is anything other than 'done', there is an active cycle whose run_id might match what you're about to delete. The chain consumer will fire its next queue message, find no existing rows for the run_id (because you just deleted them), and re-seed pending step rows because runBatchLoop checks existingRows.cnt === 0 and re-seeds on first invocation.

The "deleted" rows just reappear with fresh started_at timestamps a few seconds later. Worse: you might delete the actual production cycle state out from under the running chain, leaving the chain re-seeding new rows on every consumer firing while you keep deleting them.

Safe order

  1. setCyclePhase(env.DB, 'done') (or the SQL equivalent) first, so any in-flight queue message no-ops on its next firing.
  2. Wait ~30 seconds for any in-flight queue message to drain (no new step rows being written).
  3. Then delete.

Or just don't bulk-delete during an active cycle.

Incident receipt

  • Session 58 morning — bulk-deleted 86 yesterday's test rows from tga_sync_steps, then noticed trf01/trf02 rows had silently re-appeared within seconds. The chain had been quietly progressing through the orgs phase overnight on those run_ids and re-seeded them on its next firing. No production data was corrupted, but the cleanup looked like it had failed and led to a confused investigation. The fix was to set cycle_phase = 'done' first, then delete.

Convention 4 — no cross-database foreign keys

D1 does not support cross-database foreign keys. The schema lives in distinct databases (rtopacks-db, ops-db, engine-db-oc, etc — see data-architecture.md for the full topology) and the application maintains the joins in code.

The cross-DB joins that exist:

From To Maintained by
ops-db.rto_clients.rto_code rtopacks-db.rtos.rto_code internal-api (two queries, joined in JS)
engine-db-oc.orgs.rto_code rtopacks-db.rtos.rto_code Same pattern. Permanent org_rtopacks_ops ↔ rto_code 45329 mapping lives here.
ops-db.billing_customers.rto_code rtopacks-db.rtos.rto_code Populated during /billing/subscribe.

Never write a query that assumes referential integrity across databases. If you need to ensure both sides exist, query both and reconcile in application code.


D1 operational gotchas

A short list. Each has bitten us at least once.

wrangler d1 execute resolves to the wrong account from project root

The repo's top-level .wrangler state resolves to a different Cloudflare account (UCCO Foundation) than the one RTOpacks lives in (f95d453...). Running npx wrangler d1 execute rtopacks-db --remote ... from the project root will fail with Couldn't find DB with name 'rtopacks-db'.

Standing rule: always cd apps/admin/ before any wrangler d1 execute command. The admin app's local .wrangler state has the correct account.

D1 transient Internal error ... object to be reset (~1% rate)

D1 will occasionally return D1_ERROR: Internal error in D1 DB storage caused object to be reset. with no other context. Observed rate during heavy ingest runs is about 1% of writes. Cloudflare Queues retry handles this cleanly when the consumer uses msg.retry(). This is expected, not a bug in our code. Don't add custom retry logic — let the queue do it.

D1 SQL variable limit ≈ 100

D1's effective SQLITE_MAX_VARIABLE_NUMBER is around 100, much lower than SQLite's 999. IN (?, ?, ..., ?) queries with more than ~100 placeholders return D1_ERROR: too many SQL variables. Either chunk the input list into batches of ~50 or restructure the query to be per-row instead of bulk.

tga_sync_cursor lives in rtopacks-db, NOT ops-db

Easy mistake — tga_sync_steps is in ops-db (operational state), but tga_sync_cursor is in rtopacks-db (which is bound as env.DB in most workers). The phase pointer (cycle_phase), the per-delta cursors (org_delta, training_delta, training_refresh), and any new cursor state for a sync worker all go in rtopacks-db.

Never export rto-nrt-db

The rtopacks-db D1 (also referred to as rto-nrt-db in some older docs) is ~2.4GB and locks the database for hours on a wrangler d1 export. Never run an export on this DB. If you need a backup, take incremental snapshots through a different path.


KV operational gotchas

Never put per-invocation write state in KV

KV is fast for reads but slow and rate-limited on writes. Workers that try to advance a phase pointer or cursor by writing to KV per invocation hit transient 500s under rapid writes. Use D1 instead — the tga_sync_cursor table is the canonical place for phase pointers.

KV is fine for read-heavy state: cached search results, session tokens, OAuth refresh tokens, infrequent heartbeat writes (once-per-cron-fire is fine).


Dev databases

Five mutable D1s have dev twins under rto-*-db-staging names (D1 names retain -staging suffix per ADR-027 D1-name lag; alignment via D1-NAME-ENVIRONMENT-RENAME-01 queued brief). Read-only databases (rto-nrt-db, rto-abs-db, rto-licensing-db, rto-radar-db) are shared between production and dev — dev workers read directly from production.

Production Dev twin database_id Seeded with
rto-ops-db (ops-db) rto-ops-db-staging 6ee420d6-2151-4924-9a9b-589ff3dbf932 Sync state only: cricos_sync_cursor (6 rows) + tga_sync_steps (447 rows) copied from prod. No billing data, admin sessions, observatory logs. Billing rows fill naturally once Stage 4 webhook delivery is wired.
rto-workspace-db (engine-db-oc) rto-workspace-db-staging 25995e53-eb63-492d-b054-f46661ffb43a Three deliberately-fake tenants ("Test RTO Alpha", "Demo Training Pty Ltd", "Sample Vocational College"), one admin user per tenant (all emails client+ALIAS@rtopacks.dev → routes to client@rtopacks.dev), L4 admin roles, 7 fake rto_scope_entries referencing real TGA qual codes (BSB30120, etc. — joins to shared rto-nrt-db render real qual data), 4 fake people records (all phone +61 400 000 000 — Australian dummy non-routing number).
rto-micro-db (microcredentials-db) rto-micro-db-staging cc5b7d53-0a26-4c4d-8070-c3e67360927d Production contents (educational micro-credential catalog: domains, streams, modules, courses). No PII.
rto-landscape-db (landscape-db) rto-landscape-db-staging 4bb0d1d6-c1c9-41fc-b1de-65204229254d Production contents (vendor landscape: real Australian VET vendor names + public domains; landscape_contacts and landscape_interactions tables empty so no PII).
rto-calendar-db (calendar-db) rto-calendar-db-staging 6d5a5917-254f-419c-bfeb-ae501281e37b Production contents (public VET industry calendar events: AVETMISS deadlines, etc.). No PII.

Dev KV namespace: - SESSION_KV_STAGING (id 6c915ee0e0ba475b84999b0ea23fff57) — admin/workspace sessions, QB OAuth tokens, cron heartbeats. KV name retains _STAGING suffix per ADR-027 D1-name lag (applies equally to KV name immutability). Distinct from production SESSION_KV (21f59a0848764fb08c08c4c7eac17535) — dev session validity must not transfer to production. - Cache KVs (SEARCH_CACHE, STATS_CACHE, etc.) are shared between dev and production with a dev: key prefix on dev-side writes — dev and production never read each other's cached values. Implemented in the cache helper utility.

Dev dummy values: - All dev email recipients route to client@rtopacks.dev (via client+ALIAS@rtopacks.dev aliasing). Inbox no-ops until Tim sets it up; dev email is silent until then. - All dev phone numbers are +61 400 000 000 — Australian dummy number that does not route to any real subscriber. - All dev tenant/user/people IDs are prefixed ten_staging_, usr_staging_, p_*_* for grep-ability (ID prefixes retain staging_ per substrate immutability; alignment with D1-NAME-ENVIRONMENT-RENAME-01).

Standing rule: dev code never writes to a production database. Bindings are explicit per [env.dev] block in each worker's wrangler.jsonc. Stage 4 verification includes checking that every production binding is also present in the dev env block — non-inheritance applies (env block must redeclare every binding from the top-level config).