Worker Patterns — Chain dispatch, failure modes, pattern selection¶

Version: 2.0 Updated: 2026-04-15 (session 58, was pipeline-architecture.md) Applies to: every Worker that runs longer than a single isolate's wall-clock budget Companion docs: ../architecture/database.md (SQL conventions for the writer side), ../operations/observatory-guide.md (reader side and watchdog)

What this doc is¶

The canonical reference for how RTOpacks workers chain across invocations and what fails when they do. Three sections:

The queue pattern — the standard chain dispatch via Cloudflare Queues + D1 phase pointer (sessions 49-52, TGA-QUEUE-01). Read this when building any new long-running sync.
Failure modes catalogue — every way a chained worker has died on us, what triggers it, what catches it, and what doesn't. Read this when debugging a stuck chain or designing a new failure path.
Pattern selection guide — when to use the queue pattern vs admin-driven sequential HTTP loops vs neither. Read this when adding a new worker.

If you're building a new long-running sync, read all three before writing code. The patterns aren't intuitive and the failure modes don't surface in obvious places.

Why this doc exists¶

Sessions 49–51 rebuilt tga-sync and cricos-sync around a new dispatch pattern. The previous env.SELF.fetch() chain died silently after 3–4 minutes of wall-clock execution, leaving cursors stuck mid-cycle. The replacement pattern — Cloudflare Queues as the chain transport + D1 as the phase pointer — runs full cycles reliably with 40+ chained invocations per run.

Sessions 56–58 then added: SENTINEL-02 write-time integrity checks at chain boundaries; SYNC-CHAIN-01 visibility for queue-send failures (a chain_failed step row when SYNC_QUEUE.send() drops); WATCHDOG-FIX-01 to actually-functional stuck-row sweeping; and TEQSA-FIX-02 cursor pagination for cases where even a single phase exceeds wall-clock budget.

This doc absorbs all of that. The queue pattern is still the standard, but the failure modes section now covers every way the chain has died and every defence we've added.

The problem with `env.SELF.fetch()` chaining¶

The intuitive way to chain a worker across invocations is:

// Phase N finishes
await env.SELF.fetch("https://self/run?phase=next");

This works for a few minutes. Then it stops. Symptoms we observed:

Chain runs 20–40 invocations successfully
Around the 3–4 minute mark, a subsequent env.SELF.fetch() call returns 200 but the target handler never executes
No error, no log entry on the receiving side
Cursor left stuck at whatever phase was last written
Next cron cycle picks up and runs fine — until it hits the same wall again

Root cause: Cloudflare Workers enforce a per-isolate subrequest budget (~1000 subrequests) and a per-isolate CPU/wall clock budget. env.SELF.fetch() chain invocations all share the same originating isolate's budget — they're effectively nested subrequests, not independent executions. Eventually the budget runs out and new fetches get black-holed.

Service bindings between different workers avoid this (each target worker gets a fresh isolate). Self-dispatch does not.

The queue pattern¶

Replace env.SELF.fetch() chain dispatch with Cloudflare Queues:

// Phase N finishes
await env.TGA_SYNC_QUEUE.send({ phase: "next", runId, trigger });

Each queue message is consumed in a fresh isolate with a fresh budget. The chain can run indefinitely — 40, 60, 200 invocations per cycle — without hitting any per-invocation limit.

Minimum pieces¶

One queue named <worker>-sync-queue (e.g. tga-sync-queue)
The same worker is both producer (binds the queue) and consumer (implements the queue() handler)
Cron handler starts the cycle by sending the first message
queue() handler dispatches on the message body's phase and sends the next phase on the way out
Phase pointer lives in D1 (NOT KV — see below) so a crash is recoverable from the next cron

wrangler.jsonc / wrangler.toml binding¶

{
  "queues": {
    "producers": [
      { "binding": "TGA_SYNC_QUEUE", "queue": "tga-sync-queue" }
    ],
    "consumers": [
      {
        "queue": "tga-sync-queue",
        "max_batch_size": 1,
        "max_batch_timeout": 1,
        "max_retries": 3,
        "dead_letter_queue": "tga-sync-dlq"
      }
    ]
  }
}

max_batch_size: 1 is critical — we want one message per invocation so each phase gets its own isolate. Batching would defeat the whole point.

Phase pointer in D1, not KV¶

The second gotcha: we originally stored the phase pointer in SESSION_KV. Under rapid writes (1–2 per second across chained invocations) Cloudflare KV returns intermittent 500 errors — not 429, not rate-limit, just silent 500s. When a phase pointer write fails, the next invocation reads the stale value and the chain loops or dies.

Moved to D1. The tga_sync_cursor table has a cycle_phase row whose status column holds the current phase string. D1 writes are transactional and the pattern is reliable under the dispatch load.

Rule of thumb: any per-invocation state in a hot path belongs in D1, not KV. KV is for warm reads, session tokens, and infrequent writes.

Reference implementation — tga-sync¶

Source: scripts/workers/tga-sync/src/index.ts

Phase sequence:

cron fires
  → queue("init")
    → write run_id to ops-db.sync_runs, queue("sweep")
  → queue("sweep_statuses")
    → sweepSuperseded() — N pages per invocation, up to SWEEP_PAGE_CAP=50
    → if not done, queue("sweep_statuses") again with page cursor
    → if done, queue("sync_training")
  → queue("sync_training")
    → syncTrainingComponents() — N pages per invocation, up to TRAINING_PAGE_CAP=200
    → same pattern: continue or advance phase
  → queue("sync_orgs")
    → syncOrganisations() — ORGS_PAGE_CAP=100
  → queue("utility")
    → flag_restricted, sync_qual_statuses, refresh_stats
  → queue("write_snapshot")
    → NOTIFY-01 email fires here
  → queue("done")
    → clear phase pointer, write final step row, exit

Every phase does three things: 1. Reads its own slice of work from TGA or D1 2. Writes progress to writeStep(runId, worker, phase, status, trigger, meta) — this is the Observatory feed 3. Decides whether to self-chain (same phase, more work) or advance (next phase)

Safety valves¶

Page caps (SWEEP_PAGE_CAP, etc.) cap how many pages each phase can walk before forcibly advancing. Protects against TGA pagination loops.
Cycle cap on total chain invocations per cron cycle — hard upper bound so a runaway chain can't consume all cron budget for days.
writeStep UNIQUE constraint on (run_id, sync_type, step) with ON CONFLICT DO UPDATE — prevents duplicate step rows from retries or double-queues. Error column is excluded.error (overwrites) not COALESCE (would leave stale "partial" messages when a retry succeeds).

CRICOS-sync follows the same shape¶

Source: scripts/workers/cricos-sync/src/index.ts

Phases: institutions → courses → locations → course_locations → refresh_stats → reset → write_snapshot → done.

Same queue (cricos-sync-queue), same D1 phase pointer row (cricos_sync_cursor.cycle_phase), same NOTIFY-01 email at write_snapshot.

Cron: 0 18 1 * * (1st of month, 4am AEST) — CRICOS refreshes monthly, not weekly.

What NOT to do¶

Don't use env.SELF.fetch() for chain dispatch. It works for 3 minutes then dies.
Don't put the phase pointer in KV. Rapid-write 500s will break the chain.
Don't batch queue messages. Set max_batch_size: 1 so each message gets a fresh isolate.
Don't skip the D1 cursor. If a phase crashes mid-run, the next cron cycle reads the cursor and resumes from the right place. Without it, every crash restarts the whole chain from phase 1.
Don't await in a waitUntil() context without knowing why. Fire-and-forget .catch(() => {}) lets the worker shut down before the fetch lands (this bit us with NOTIFY-01 Resend calls). Use try { await ... } catch { ... } unless you genuinely want background execution after the response returns.
Don't use wrangler cron trigger for manual testing — it was removed in wrangler 4.x. Each worker that needs manual invocation should expose a /trigger or /_test HTTP endpoint instead.

Observability¶

Every phase writes a row to ops-db.sync_steps via writeStep(). The Observatory page at admin → Observatory reads these rows to display a live feed of sync progress:

Run-level: one row per cron cycle with aggregate status
Step-level: one row per phase with status, records_in/out, optional error

The NOTIFY-01 email reads the same rows at write_snapshot time and renders them as a table, so the email and the Observatory stay in sync.

See docs/docs/infrastructure/notifications.md for the email pipeline and docs/docs/operations/observatory-guide.md for reading the feed.

Failure modes catalogue¶

Every way a chained worker has died, in order of how often they bite. Each has an incident receipt and a defence.

1. `env.SELF.fetch()` chain dies after 3-4 minutes¶

Cause: Self-fetch nests under the originating isolate's subrequest budget. Eventually the budget runs out and new fetches return 200 but the target handler never executes.

Symptoms: Chain runs 20-40 invocations cleanly, then a fetch returns 200 with no log on the receiving side, no error, cursor stuck mid-cycle.

Defence: Use Cloudflare Queues for chain dispatch. See "The queue pattern" above. Eliminated by TGA-QUEUE-01 (session 51). No longer in the codebase.

2. Queue-send drop (silent chain break)¶

Cause: env.SYNC_QUEUE.send() throws (transient CF Queue error, binding missing, rate limit, etc). The original selfChain() wrappers caught this with .catch(e => console.error(...)) and swallowed the failure with no DB trace. The chain stopped, the running step row sat at running forever, the Observatory card showed "Running…" indefinitely.

Symptoms: Stuck running row in tga_sync_steps, no chain_failed row, no error in any log, no progress for hours. trf02 died this way overnight in session 58.

Defence (SYNC-CHAIN-01, session 57): every selfChain() now writes a chain_failed step row to ops-db when SYNC_QUEUE.send() throws. The row has status='failed' and error='chain_break: queue send failed advancing to phase=...'. The Observatory immediately surfaces it on the next refresh because OBS-ALIGN-01 doesn't filter chain_failed from the latest-step picker (only volume_check_* baselines).

Coverage: tga-sync and cricos-sync have the SYNC-CHAIN-01 wrapper. enrich-sync doesn't need it because its queue sends are inside per-phase try/catch blocks that already escalate to markStep('error', ...).

3. Worker isolate termination mid-phase (the gap SYNC-CHAIN-01 doesn't cover)¶

Cause: The phase function (e.g. syncTrainingComponents) takes longer than the CF Workers wall-clock budget. CF kills the isolate mid-execution. The running step row was already written at the start of the phase; selfChain() was never reached. No chain_failed row gets written because no JS code is alive to write it.

Symptoms: Stuck running row, no chain_failed row, no progress. Looks identical to a queue-send drop from outside the worker, but the cause is different.

Defence: the d1-warmer watchdog (OBS-WATCHDOG-01). Every 10 minutes, it sweeps any running row whose started_at is older than 5 minutes and marks it timed_out. The Observatory then shows the row as failed and an operator can investigate.

Defence-of-the-defence: the watchdog itself was broken from deploy until session 58. Its sweep filter used WHERE started_at < datetime('now', '-5 minutes') which lex-fails on ISO-with-T columns (see ../architecture/database.md Convention 1). For 8 days, the watchdog swept zero rows and nobody noticed because no chain had isolated-termination-died in that window. Fixed in WATCHDOG-FIX-01 (session 58) by building the cutoff as an explicit ISO literal in JS.

The remaining gap: the watchdog catches the row 5-15 minutes late. If you need faster surfacing, the future SYNC-CHAIN-02 brief proposes a heartbeat-deadline pattern: write running with deadline=X and have the watchdog sweep based on the per-row deadline. Not built yet.

4. Phase function exceeds wall-clock even on a single page (oversized fetch)¶

Cause: A single phase tries to do too much in one invocation. Example: teqsa-sync stepDecisions originally called fetchAll() to walk all 5,982 records in one shot — ~60 sequential paged fetches at ~1.5s each = 90+ seconds, well past CF Workers' 30s wall-clock budget. CF killed the isolate every time.

Symptoms: Curl test against the phase endpoint returns HTTP/2 framing layer error after ~60 seconds. Step row written as running at the start, never updated to complete. Watchdog sweeps it 5 minutes later.

Defence: Cursor-paginate the phase across multiple invocations. Each invocation processes a bounded chunk (e.g. 10 pages = 1000 records = ~20 seconds), updates the cursor, and either chains itself for the next chunk or advances to the next phase. TEQSA-FIX-02 (session 57) is the reference implementation: stepDecisions(env, runId, triggeredBy, startRecordOffset) plus a next-driven loop in the cron handler and the admin route.

When to apply: any phase whose work exceeds ~25 seconds on a single invocation. If you're walking >50 pages of an upstream API, or >5,000 D1 writes, or >1MB of upstream payload, plan for cursor pagination from the start.

5. Queue messages exceed 128KB¶

Cause: Cloudflare Queues caps message size at 128KB. Trying to embed an external API response (TGA org payloads, large JSON blobs) directly in a queue message hits this ceiling silently.

Symptoms: env.SYNC_QUEUE.send() throws with a 413-equivalent error. With SYNC-CHAIN-01 in place, this now writes a chain_failed row. Without it, the chain just dies.

Defence: never embed external API responses in queue messages. Pass references (rto_code, course_id, page offset) and let the consumer fetch on its own. The tga-ingest consumer pattern is the reference: messages contain only {type, rto_code, provenance, timestamp}, the consumer fetches TGA detail itself.

Incident: session 53, the original ENRICH-SYNC-01 design embedded full TGA org responses in queue messages. Two of the 12,500 RTOs had payloads exceeding 128KB. Refactored to pass rto_code only.

6. D1 transient `Internal error ... object to be reset` (~1% rate)¶

Cause: Random D1 hiccup. Cloudflare Queues retry handles it cleanly when the consumer uses msg.retry(). Not a bug in our code.

Defence: Don't add custom retry logic — let the queue retry. See ../architecture/database.md "D1 operational gotchas" for full context.

7. D1 SQL variable limit (~100 placeholders)¶

Cause: D1's effective SQLITE_MAX_VARIABLE_NUMBER is around 100, much lower than SQLite's 999. IN (?, ?, ..., ?) queries with more than ~100 placeholders return D1_ERROR: too many SQL variables.

Defence: chunk the input list into batches of ~50, or restructure to per-row queries. Bit tga-ingest once on a sentinel-check IN clause; now restructured to per-record lookups.

8. `tga_sync_steps` writer collides with the unique constraint (silently)¶

Cause: Plain INSERT INTO tga_sync_steps collides with the UNIQUE(run_id, sync_type, step) constraint added 2026-04-13. The writer's catch swallows the error. The status update silently fails. The DB never reflects the latest state.

Symptoms: Worker returns 200/complete to its caller, but the DB row stays at running until the watchdog sweeps it.

Defence: Every writer must use ON CONFLICT(run_id, sync_type, step) DO UPDATE. See ../architecture/database.md Convention 2 for the canonical pattern and the TEQSA-FIX-03 incident receipt.

Pattern selection guide¶

When adding a new worker, pick the chain dispatch pattern that matches the work shape:

Use the queue chain pattern when¶

The work is a multi-phase sync that needs to walk a paginated upstream API, write to D1, and persist a cursor between phases (tga-sync, cricos-sync, enrich-sync)
A single cycle takes longer than ~25 seconds end-to-end (i.e. exceeds a single isolate's wall-clock budget)
The phases have a fixed order (sweep → training → orgs → utility → done) but the cycle may take many invocations to complete
You need crash recovery — if the worker dies mid-cycle, the next cron should resume from the cursor, not restart from phase 1

Defences active by default: SYNC-CHAIN-01 (queue-send drop visibility), OBS-WATCHDOG-01 (stuck-row sweep), the writer contract (UPSERT pattern), Observatory query alignment (OBS-ALIGN-01).

Reference implementations: scripts/workers/tga-sync/src/index.ts, workers/cricos-sync/src/index.ts, workers/enrich-sync/src/index.ts.

Use the admin-driven sequential HTTP loop pattern when¶

The work has a fixed number of steps that don't fan out (teqsa-sync providers → courses → decisions → finalize)
Each step fits in a single isolate's wall-clock budget OR can be cursor-paginated across multiple invocations of the SAME step (TEQSA-FIX-02)
You want the admin route's POST handler to drive the chain (the operator clicks "Run now" and the chain progresses through admin-side await fetch calls)
The cron handler can do the same thing in a while (next) loop without a queue

Why not queues for this case: queue chains are heavier infrastructure than the work justifies when there's no fan-out and the steps are deterministic. The admin route is already the natural orchestrator and can pass cursor state in the request body.

Reference implementation: workers/teqsa-sync/src/index.js + the TEQSA branch of apps/admin/app/api/admin/ingest/route.ts. Both use the next-driven while loop pattern. Capped at 50 iterations for safety.

Use neither when¶

The work fits in a single isolate's wall-clock budget (stats-cache, qb-reconcile daily runs, qual-enrichment)
It's a one-shot consumer of a queue produced elsewhere (tga-ingest — fires on each message from rtopacks-ingest-queue, no chaining of its own)

Service binding authentication¶

Workers that call internal-api via service binding (env.INTERNAL_API.fetch()) bypass CF Access entirely. No CF-Access-Jwt-Assertion header arrives — the request goes through Cloudflare's internal runtime, not the public internet.

Trusted-caller header convention: Workers identify themselves to internal-api via X-RTP-Internal-Source:

Header value	Caller	`ucca_layer`	`is_super`	Use case
`admin-worker`	`rtopacks-admin`	3 (L3)	true	Full admin access
`site-worker`	`rtopacks-site`	0	false	Anonymous public search only

Internal-api checks this header before falling through to authenticate(). If present and recognised, the request is admitted without JWT validation.

When adding a new worker-to-worker call: If the target endpoint sits behind internal-api's auth check (below the if (!caller) guard), the calling worker MUST send X-RTP-Internal-Source or the call will 401. Add the new source value to the if block in workers/internal-api/src/index.ts (line ~1240).

Migration gotcha: If a worker previously called internal-api via raw fetch() through CF Access (which provided the JWT), switching to a service binding removes the JWT silently. The call succeeds at the network layer but 401s at the auth layer. Always add the trusted-caller header when converting from raw fetch to service binding.

KV default values and JSON serialization¶

Infinity does not survive JSON round-trip. JSON.stringify({x: Infinity}) produces {"x":null}. Reading it back gives null, not Infinity. In JavaScript, null < 200 evaluates to true.

This caused the ANON-SEC-01 threat system to flag every new visitor as RED on the RTOpacks account (fresh KV, all records new, all minInterval values initialised to Infinity → serialised as null → null < threshold → instant block). The system worked on the old UCCA account because the KV had warm data with real numeric values.

Rule: Never use Infinity, -Infinity, NaN, or undefined as default values in objects that will be stored in KV. Use a large number (999999999) or explicit sentinel values that survive JSON round-trip. Test cold-start behaviour (empty KV) when migrating any KV-backed rate-limiting, threat detection, or threshold system.

Production and staging environments¶

The user-facing topology has a production environment on rtopacks.com.au and a staging environment on rtopacks.dev. Each of the four user-facing workers (rtopacks-internal-api, rtopacks-admin, rtopacks-workspace, rtopacks-site) has a -staging sibling that runs the same code with different bindings, secrets, and routes.

Env-block deploy pattern¶

A single wrangler.jsonc per worker holds both production and staging configurations. The top-level fields are production. The env.staging block overrides for staging.

{
  "name": "rtopacks-X",                        // production
  "routes": [{ "pattern": "X.rtopacks.com.au/*", "zone_name": "rtopacks.com.au" }],
  "kv_namespaces": [{ "binding": "Y", "id": "<prod-kv-id>" }],
  "d1_databases": [{ "binding": "Z", "database_id": "<prod-d1-id>" }],

  "env": {
    "staging": {
      "name": "rtopacks-X-staging",            // staging
      "routes": [{ "pattern": "X.rtopacks.dev/*", "zone_name": "rtopacks.dev" }],
      "vars": { "ENV": "staging" },
      "kv_namespaces": [{ "binding": "Y", "id": "<staging-twin-id>" }],
      "d1_databases": [{ "binding": "Z", "database_id": "<staging-twin-id>" }]
    }
  }
}

Deploy via npx wrangler deploy --env staging (or npx opennextjs-cloudflare build && npx wrangler deploy --env staging for OpenNext-wrapped Next.js workers). The npm run deploy:staging script in each worker's package.json wraps this.

Critical caveat — non-inheritance. When an env block is present, non-scalar fields (kv_namespaces, d1_databases, r2_buckets, services, queues, routes) do not inherit from the top-level config. The staging env block must explicitly redeclare every binding it needs. A binding added to production but not added to the staging env block silently does not exist in staging. Verification step before any staging deploy: diff the production top-level binding list against the staging env block; fix any missing entry before deploying.

Scalar fields (account_id, compatibility_date, main, etc.) DO inherit. The Stage 1.5 spike confirmed this on a no-op apps/site deploy 2026-05-06.

Secret separation rule¶

Secrets are managed per-environment via wrangler secret put --env staging <NAME>. The staging credential model:

Sandbox / test-mode credentials graduate to staging permanently. Stripe sk_test_*, QuickBooks Sandbox B, Resend API key — all stay on staging.
Live / production credentials only ever go to production. When Stripe live mode is enabled (STRIPE-LIVE-01) or QB live (QB-LIVE-01), the live keys go on production-only.
Per-environment secrets that should never match between staging and prod: JWT_SECRET, SESSION_ATTRIBUTION_SECRET. Generate fresh staging-specific values via openssl rand -hex 32 to prevent any token signed by one environment from validating in the other. Same isolation principle applies to any token-signing or session-attribution material.

Stripe webhook split — second endpoint pattern¶

Stripe sandbox supports multiple webhook endpoints natively. Production has one webhook at internal-api.rtopacks.com.au/billing/webhook; staging registers a second endpoint at internal-api.rtopacks.dev/billing/webhook. Each endpoint has its own signing secret; staging uses a staging-specific STRIPE_WEBHOOK_SECRET. Both endpoints receive the same sandbox events (test charges, subscription updates, etc.); each environment's worker processes them into its own rto-ops-db[-staging]. There is no shared state.

When Stripe live mode is enabled, the live webhook is registered in production only. Staging continues to receive sandbox events.

QuickBooks two-sandbox-company pattern¶

QuickBooks sandbox webhook delivery is configured at the OAuth-app level, not per-connected-company. A single sandbox app has one webhook URL. So production and staging cannot share the same sandbox company without sharing the webhook URL.

Solution: production connects to Sandbox A, staging connects to Sandbox B. Both are companies under the same QB developer account. Each has its own data, its own OAuth tokens, its own webhook URL. Sandbox B requires ~30-minute one-time baseline setup (chart of accounts, sample customers, OAuth flow to obtain refresh token).

Staging environment variables: QB_CLIENT_ID, QB_CLIENT_SECRET, QB_COMPANY_ID, QB_REFRESH_TOKEN all hold Sandbox B values. QB_ENVIRONMENT is the literal string sandbox (set as a vars field, not a secret).

Public hostname with bypass paths — internal-api¶

internal-api (production and staging) is a public-hostname worker behind Cloudflare Access. The Stripe webhook path (/billing/webhook) and QB OAuth callback path (/billing/qb-callback) need to receive HTTPS calls from external IPs, so each is fronted by a per-path bypass Access policy that admits everyone. All other paths on the hostname enforce Access auth.

Production has CF Access apps named RTOpacks Internal API, RTOpacks Billing Webhooks, RTOpacks QB OAuth Callback. Staging has RTOpacks Staging — zone-wide (covering the four hostnames including internal-api), RTOpacks Staging — Billing Webhooks, RTOpacks Staging — QB OAuth Callback.

Cache key `staging:` prefix pattern¶

Cache KV namespaces (SEARCH_CACHE, STATS_CACHE) are shared infrastructure between staging and production — same KV namespace IDs in both env blocks. Isolation is by key prefix, not by namespace duplication: every cache write from a staging worker prepends staging: to the cache key; staging reads only staging:-prefixed keys; production never reads or writes prefixed keys.

const cachePrefix = env?.ENV === "staging" ? "staging:" : "";
const cacheKey = `${cachePrefix}geocode:${hash}`;

This makes the cache shared infrastructure but logically isolated data. No risk of staging cache values being served to production clients.

State KVs (SESSION_KV, and any KV holding mutable state like LEADS, MCP_API_KEYS, ANON_THREAT_KV) are twinned — separate namespace IDs in staging vs production. The brief default for KV strategy: shared+prefix where it's a cache, twinned where it's mutable state.

`INTERNAL_API` service-binding pattern in staging¶

Service bindings are environment-scoped. The production user-facing workers bind:

"services": [{ "binding": "INTERNAL_API", "service": "rtopacks-internal-api" }]

The staging env blocks bind:

"services": [{ "binding": "INTERNAL_API", "service": "rtopacks-internal-api-staging" }]

This guarantees that a staging worker never falls through to production internal-api — its env.INTERNAL_API.fetch() calls go to the staging sibling exclusively. Verification at deploy time: hit a staging worker endpoint that proxies to internal-api, observe a diagnostic header confirming the staging-side request handler ran (e.g. X-RTP-Internal-Source: site-worker-staging).

External-service no-op rule¶

Code paths that touch external services check env.ENV === "staging" and either substitute or no-op:

Email send: substitute the recipient with client@rtopacks.dev (Tim's staging-only inbox; no-op until inbox exists). Never send to a real recipient address from staging.
SMS send: no-op (log only). Twilio integration not yet wired anyway; staging is a hard belt-and-braces.
On-demand TGA enrichment endpoints (apps/site /api/enrich, /api/search-enrich): early-return 503 in staging. Disabled belt-and-braces; full retirement is ON-DEMAND-RETIRE-01.

The check pattern is inline at each call site:

to: env.ENV === "staging" ? ["client@rtopacks.dev"] : [originalRecipient],

Rather than a shared helper. The brief preferred this over premature abstraction; if email gating ever needs to expand (different aliases, multi-recipient suppression, etc.), revisit then.

Standing rule: if staging ever sends a real email or SMS, or processes a live (non-sandbox) Stripe/QB event, that is a bug.

Cross-references¶

../architecture/database.md — SQL conventions, the tga_sync_steps writer contract, the lex gotcha that broke the watchdog. Also the staging-databases section.
../operations/observatory-guide.md — reader side: how the admin Observatory queries tga_sync_steps, the OBS-ALIGN-01 query rules, the watchdog behaviour
../workers/inventory.md — canonical worker + queue + DB + KV inventory with current IDs
tga-ingest.md — on-demand enrichment consumer (one-shot queue consumer pattern, not chained)
notifications.md — NOTIFY-01 email pipeline (fires from write_snapshot phase)
../ops/standing-rules.md — operational rules (QB token, D1 account context, never-export rules)