Packaging Pipeline¶
Canonical reference for the qualification structure extraction pipeline.
First run: 2026-04-16 (session 62, STUDIO-PACKAGING-PIPELINE-01).
Script: tools/packaging-pipeline.mjs.
Storage: qualification_packaging_rules in rtopacks-db.
Related docs: tga-unitgrid-endpoint.md, source-of-truth-connectivity.md, tga-api-field-inventory.md.
What this pipeline does¶
Takes TGA's raw data (unitgrid endpoint + packaging_rules HTML) and produces structured qualification packaging rules in the DB — core/elective splits, group membership, choose-N constraints. Studio reads the output to seed sessions with real unit pools.
The four stages¶
Stage 1 — Unitgrid fetch (TGA direct, $0)¶
- Endpoint:
GET /api/training/{code}/releases/{releaseNumber}/unitgrid - What it gives: Per-unit
isEssential: true|false— authoritative core/elective classification. - Script:
tools/parse-packaging-rules.mjs --unitgrid - Writes:
groups(core + elective_pool),rules(fixed + from_group),unit_grid,stage_1_complete = 1,parser_version = 'unitgrid-v1',confidence = 'high'. - Handles 404s gracefully — deleted quals skipped and logged.
- Scope: All quals with a
release_numberinqualifications. - Runtime: ~25 min at 5x parallel with 100ms stagger.
- See:
tga-unitgrid-endpoint.mdfor the full endpoint contract.
Stage 2 — Deterministic group extraction ($0)¶
- Source:
qualifications.packaging_rulesHTML (content bundle 0116). - Method: Regex-based section splitting on
<th>and<strong>headings +<ntr-tcref data-nrt-code="...">extraction. No DOM parser dependency (no Cheerio) — the two HTML patterns (table-based and<p><strong>based) are both handled by position-aware matching. - Script:
tools/packaging-pipeline.mjs --stage 2 - Writes:
groups(with real Group A/B/C structure when found),rules(deterministic when parseable),stage_2_complete = 1,parser_version = 'deterministic-v1'. - Validates: Unit codes must appear in the source HTML. Rule count sum must match
total_units_required. - Failures → Stage 3: Quals where rule parsing is ambiguous get their groups written (valuable on their own) but are queued for Stage 3's LLM to refine the rules.
- Runtime: ~15 min at 5x parallel.
Stage 3 — Haiku LLM fallback (~$1)¶
- Source: Same packaging_rules HTML, stripped to text.
- Model:
claude-haiku-4-5-20251001, 2048 max_tokens. - Prompt: Strict JSON-only system prompt + user message with qual code, counts, and rule text. Few-shot examples from BSB40920, CHC50121, SIT40521 are in
tools/parse-packaging-rules.mjs(the original parser script, separate from the pipeline). - Script:
tools/packaging-pipeline.mjs --stage 3(or runs automatically after Stage 2 whenANTHROPIC_API_KEYis set). - Writes:
groups,rules,stage_3_complete = 1,parser_version = 'llm-v1',confidence = 'medium'(or'low'if validation fails). - Runtime: Serial at 1 req/sec. Only processes Stage 2 failures, typically 10-15% of corpus.
- Key:
ANTHROPIC_API_KEYin env. Never committed. Pass--skip-stage-3to skip when no key is available.
Stage 4 — Verification report¶
- Script:
tools/packaging-pipeline.mjs --stage 4 - Queries D1 and prints:
- Total rows, stage completion counts, confidence breakdown.
- Parser version distribution.
- Low-confidence review query (exact SQL, copy-paste ready).
- Updates:
docs/ops/packaging-pipeline-last-run.mdwith date + counts.
Running the pipeline¶
Full corpus (all stages)¶
export CF_API_TOKEN=...
export ANTHROPIC_API_KEY=... # optional — omit to skip Stage 3
node tools/packaging-pipeline.mjs
Stage 1 only (separate script)¶
Stage 2 only (deterministic parsing)¶
Single qual (any stage)¶
Force reprocess¶
Reprocesses all rows regardless of stage completion flags. Use when the parsing logic has changed.
Resumability¶
Every stage uses a stage_N_complete = 0 filter by default. If the pipeline crashes at qual 4000, a rerun picks up at 4001. The --force flag reprocesses everything.
Observatory integration¶
The admin dashboard (admin.rtopacks.com.au) has a Packaging Pipeline drawer accessible from the Data Sources section header. It shows:
- Coverage: N of M qualifications with packaging data.
- Stage breakdown: how many rows were processed by each stage.
- Confidence distribution: high / medium / low with review hint.
- Parser version distribution.
- Last parsed timestamp.
- CLI command examples for full and single-qual runs.
The drawer is read-only. Full corpus runs are triggered from the CLI; the drawer surfaces status. Single-qual reruns via the Observatory are a future extension.
When to re-run¶
| Trigger | What to do |
|---|---|
| TGA releases a new training package version | --qual {code} --force for affected quals |
tga-sync updates a qual's packaging_rules |
Same — --qual {code} --force |
| New qual added to the corpus | Default filter picks it up automatically |
| Low-confidence row manually reviewed | --qual {code} --force to re-classify |
| Parsing logic changed | --force for the full corpus |
HTML patterns the Stage 2 parser handles¶
Pattern A — <p><strong> headings (BSB-style)¶
<p><strong>Core units</strong></p>
<p><ntr-tcref data-nrt-code="BSBPMG420" data-nrt-title="...">BSBPMG420</ntr-tcref> Title</p>
...
<p><strong>Group A – Project Management</strong></p>
<p><ntr-tcref data-nrt-code="BSBPMG423" ...>BSBPMG423</ntr-tcref> Title</p>
Pattern B — <table> with <th> headings (CHC/SIT-style)¶
<thead><tr><th colspan="2">Core units</th></tr></thead>
<tbody>
<tr><td><p><ntr-tcref data-nrt-code="BSBTWK502" ...>...</ntr-tcref></p></td><td><p>Title</p></td></tr>
...
</tbody>
Both patterns use <ntr-tcref data-nrt-code="..." data-nrt-title="..."> for unit references. The parser finds all headings (from <th> and <strong>) and all ntr-tcref elements by their character positions, then assigns each code to the nearest preceding heading. Zero DOM parsing needed.
Standing rule¶
PACKAGING-PIPELINE — always trigger from Observatory for status checks, and from the CLI for actual runs. Update docs/ops/packaging-pipeline.md when TGA API behaviour changes. Never run ad-hoc SQL against qualification_packaging_rules for bulk modifications — use the pipeline script with --force.
Stage 3 rate limit lessons (session 62)¶
Captured during the first full corpus run so this never has to be rediscovered.
Haiku's rate limit is token-per-minute, not request-per-minute. Large packaging_rules HTML (some quals are 100KB+) burns through the token budget fast even at low request concurrency. The number of requests doesn't matter — the total input+output tokens per minute does.
What we tried and what worked:
| Concurrency | Stagger | Result |
|---|---|---|
| 5x parallel | 200ms | ~68% failure rate — 429s on most calls |
| 3x parallel | 500ms | ~68% failure rate — same token budget issue |
| 2x parallel | 1s | ~65-72% success per pass — workable with multi-pass |
| 1x serial | 2s | Zero 429s but ~6 hours for the full corpus |
The winning pattern: 2x parallel + multi-pass convergence.
Run at 2x with 2s stagger. Accept that ~30% of calls will 429 on each pass. Each pass commits successful rows individually to D1 (stage_3_complete = 1), so the default filter automatically skips them on the next run. Repeat passes with 30-60s cooldowns between them. Each pass chews through more of the queue. After 5-7 passes, the queue is empty.
For small straggler sets (<50 quals), the script drops to 1x serial with 2s gap — avoids all 429s when the remaining set fits in the token window.
max_tokens must be 4096, not 2048. Quals with 10-16 groups (e.g. ICT50220 with 16, RII30220 with 16 rules, PMB30121 with 14) generate JSON output that exceeds 2048 tokens. At 2048, the JSON is truncated mid-string, producing parse errors (Unterminated string, Unexpected end of JSON). 4096 handles every qual in the corpus.
Cost: The full corpus run (918 LLM-processed rows across 7 passes) cost ~$13 in Haiku tokens. Average ~900 input + ~550 output tokens per call.
What the pipeline does NOT do¶
- ❌ Replace
tga-sync(that's the weekly unit/qual ingestion). - ❌ Touch
rto_scope_v2. - ❌ Generate KN translations (separate brief: STUDIO-KN-TRANSLATION-01).
- ❌ Wire the canvas to read the data (that's STUDIO-PACKAGING-CANVAS-01, already shipped).