Packaging Pipeline¶

Canonical reference for the qualification structure extraction pipeline. First run: 2026-04-16 (session 62, STUDIO-PACKAGING-PIPELINE-01). Script: tools/packaging-pipeline.mjs. Storage: qualification_packaging_rules in rtopacks-db. Related docs: tga-unitgrid-endpoint.md, source-of-truth-connectivity.md, tga-api-field-inventory.md.

What this pipeline does¶

Takes TGA's raw data (unitgrid endpoint + packaging_rules HTML) and produces structured qualification packaging rules in the DB — core/elective splits, group membership, choose-N constraints. Studio reads the output to seed sessions with real unit pools.

The four stages¶

Stage 1 — Unitgrid fetch (TGA direct, $0)¶

Endpoint: GET /api/training/{code}/releases/{releaseNumber}/unitgrid
What it gives: Per-unit isEssential: true|false — authoritative core/elective classification.
Script: tools/parse-packaging-rules.mjs --unitgrid
Writes: groups (core + elective_pool), rules (fixed + from_group), unit_grid, stage_1_complete = 1, parser_version = 'unitgrid-v1', confidence = 'high'.
Handles 404s gracefully — deleted quals skipped and logged.
Scope: All quals with a release_number in qualifications.
Runtime: ~25 min at 5x parallel with 100ms stagger.
See: tga-unitgrid-endpoint.md for the full endpoint contract.

Stage 2 — Deterministic group extraction ($0)¶

Source: qualifications.packaging_rules HTML (content bundle 0116).
Method: Regex-based section splitting on <th> and  headings + <ntr-tcref data-nrt-code="..."> extraction. No DOM parser dependency (no Cheerio) — the two HTML patterns (table-based and  based) are both handled by position-aware matching.
Script: tools/packaging-pipeline.mjs --stage 2
Writes: groups (with real Group A/B/C structure when found), rules (deterministic when parseable), stage_2_complete = 1, parser_version = 'deterministic-v1'.
Validates: Unit codes must appear in the source HTML. Rule count sum must match total_units_required.
Failures → Stage 3: Quals where rule parsing is ambiguous get their groups written (valuable on their own) but are queued for Stage 3's LLM to refine the rules.
Runtime: ~15 min at 5x parallel.

Stage 3 — Haiku LLM fallback (~$1)¶

Source: Same packaging_rules HTML, stripped to text.
Model: claude-haiku-4-5-20251001, 2048 max_tokens.
Prompt: Strict JSON-only system prompt + user message with qual code, counts, and rule text. Few-shot examples from BSB40920, CHC50121, SIT40521 are in tools/parse-packaging-rules.mjs (the original parser script, separate from the pipeline).
Script: tools/packaging-pipeline.mjs --stage 3 (or runs automatically after Stage 2 when ANTHROPIC_API_KEY is set).
Writes: groups, rules, stage_3_complete = 1, parser_version = 'llm-v1', confidence = 'medium' (or 'low' if validation fails).
Runtime: Serial at 1 req/sec. Only processes Stage 2 failures, typically 10-15% of corpus.
Key: ANTHROPIC_API_KEY in env. Never committed. Pass --skip-stage-3 to skip when no key is available.

Stage 4 — Verification report¶

Script: tools/packaging-pipeline.mjs --stage 4
Queries D1 and prints:
Total rows, stage completion counts, confidence breakdown.
Parser version distribution.
Low-confidence review query (exact SQL, copy-paste ready).
Updates: docs/ops/packaging-pipeline-last-run.md with date + counts.

Running the pipeline¶

Full corpus (all stages)¶

export CF_API_TOKEN=...
export ANTHROPIC_API_KEY=...  # optional — omit to skip Stage 3
node tools/packaging-pipeline.mjs

Stage 1 only (separate script)¶

CF_API_TOKEN=... node tools/parse-packaging-rules.mjs --unitgrid

Stage 2 only (deterministic parsing)¶

CF_API_TOKEN=... node tools/packaging-pipeline.mjs --stage 2

Single qual (any stage)¶

CF_API_TOKEN=... node tools/packaging-pipeline.mjs --qual BSB40920 --force

Force reprocess¶

CF_API_TOKEN=... node tools/packaging-pipeline.mjs --force

Reprocesses all rows regardless of stage completion flags. Use when the parsing logic has changed.

Resumability¶

Every stage uses a stage_N_complete = 0 filter by default. If the pipeline crashes at qual 4000, a rerun picks up at 4001. The --force flag reprocesses everything.

Observatory integration¶

The admin dashboard (admin.rtopacks.com.au) has a Packaging Pipeline drawer accessible from the Data Sources section header. It shows:

Coverage: N of M qualifications with packaging data.
Stage breakdown: how many rows were processed by each stage.
Confidence distribution: high / medium / low with review hint.
Parser version distribution.
Last parsed timestamp.
CLI command examples for full and single-qual runs.

The drawer is read-only. Full corpus runs are triggered from the CLI; the drawer surfaces status. Single-qual reruns via the Observatory are a future extension.

When to re-run¶

Trigger	What to do
TGA releases a new training package version	`--qual {code} --force` for affected quals
`tga-sync` updates a qual's `packaging_rules`	Same — `--qual {code} --force`
New qual added to the corpus	Default filter picks it up automatically
Low-confidence row manually reviewed	`--qual {code} --force` to re-classify
Parsing logic changed	`--force` for the full corpus

HTML patterns the Stage 2 parser handles¶

Pattern A — `` headings (BSB-style)¶

<p><strong>Core units</strong></p>
<p><ntr-tcref data-nrt-code="BSBPMG420" data-nrt-title="...">BSBPMG420</ntr-tcref> Title</p>
...
<p><strong>Group A – Project Management</strong></p>
<p><ntr-tcref data-nrt-code="BSBPMG423" ...>BSBPMG423</ntr-tcref> Title</p>

Pattern B — `<table>` with `<th>` headings (CHC/SIT-style)¶

<thead><tr><th colspan="2">Core units</th></tr></thead>
<tbody>
  <tr><td><p><ntr-tcref data-nrt-code="BSBTWK502" ...>...</ntr-tcref></p></td><td><p>Title</p></td></tr>
  ...
</tbody>

Both patterns use <ntr-tcref data-nrt-code="..." data-nrt-title="..."> for unit references. The parser finds all headings (from <th> and ) and all ntr-tcref elements by their character positions, then assigns each code to the nearest preceding heading. Zero DOM parsing needed.

Standing rule¶

PACKAGING-PIPELINE — always trigger from Observatory for status checks, and from the CLI for actual runs. Update docs/ops/packaging-pipeline.md when TGA API behaviour changes. Never run ad-hoc SQL against qualification_packaging_rules for bulk modifications — use the pipeline script with --force.

Stage 3 rate limit lessons (session 62)¶

Captured during the first full corpus run so this never has to be rediscovered.

Haiku's rate limit is token-per-minute, not request-per-minute. Large packaging_rules HTML (some quals are 100KB+) burns through the token budget fast even at low request concurrency. The number of requests doesn't matter — the total input+output tokens per minute does.

What we tried and what worked:

Concurrency	Stagger	Result
5x parallel	200ms	~68% failure rate — 429s on most calls
3x parallel	500ms	~68% failure rate — same token budget issue
2x parallel	1s	~65-72% success per pass — workable with multi-pass
1x serial	2s	Zero 429s but ~6 hours for the full corpus

The winning pattern: 2x parallel + multi-pass convergence.

Run at 2x with 2s stagger. Accept that ~30% of calls will 429 on each pass. Each pass commits successful rows individually to D1 (stage_3_complete = 1), so the default filter automatically skips them on the next run. Repeat passes with 30-60s cooldowns between them. Each pass chews through more of the queue. After 5-7 passes, the queue is empty.

For small straggler sets (<50 quals), the script drops to 1x serial with 2s gap — avoids all 429s when the remaining set fits in the token window.

max_tokens must be 4096, not 2048. Quals with 10-16 groups (e.g. ICT50220 with 16, RII30220 with 16 rules, PMB30121 with 14) generate JSON output that exceeds 2048 tokens. At 2048, the JSON is truncated mid-string, producing parse errors (Unterminated string, Unexpected end of JSON). 4096 handles every qual in the corpus.

Cost: The full corpus run (918 LLM-processed rows across 7 passes) cost ~$13 in Haiku tokens. Average ~900 input + ~550 output tokens per call.

What the pipeline does NOT do¶

❌ Replace tga-sync (that's the weekly unit/qual ingestion).
❌ Touch rto_scope_v2.
❌ Generate KN translations (separate brief: STUDIO-KN-TRANSLATION-01).
❌ Wire the canvas to read the data (that's STUDIO-PACKAGING-CANVAS-01, already shipped).