Skip to content

Packaging Pipeline

Canonical reference for the qualification structure extraction pipeline. First run: 2026-04-16 (session 62, STUDIO-PACKAGING-PIPELINE-01). Script: tools/packaging-pipeline.mjs. Storage: qualification_packaging_rules in rtopacks-db. Related docs: tga-unitgrid-endpoint.md, source-of-truth-connectivity.md, tga-api-field-inventory.md.


What this pipeline does

Takes TGA's raw data (unitgrid endpoint + packaging_rules HTML) and produces structured qualification packaging rules in the DB — core/elective splits, group membership, choose-N constraints. Studio reads the output to seed sessions with real unit pools.


The four stages

Stage 1 — Unitgrid fetch (TGA direct, $0)

  • Endpoint: GET /api/training/{code}/releases/{releaseNumber}/unitgrid
  • What it gives: Per-unit isEssential: true|false — authoritative core/elective classification.
  • Script: tools/parse-packaging-rules.mjs --unitgrid
  • Writes: groups (core + elective_pool), rules (fixed + from_group), unit_grid, stage_1_complete = 1, parser_version = 'unitgrid-v1', confidence = 'high'.
  • Handles 404s gracefully — deleted quals skipped and logged.
  • Scope: All quals with a release_number in qualifications.
  • Runtime: ~25 min at 5x parallel with 100ms stagger.
  • See: tga-unitgrid-endpoint.md for the full endpoint contract.

Stage 2 — Deterministic group extraction ($0)

  • Source: qualifications.packaging_rules HTML (content bundle 0116).
  • Method: Regex-based section splitting on <th> and <strong> headings + <ntr-tcref data-nrt-code="..."> extraction. No DOM parser dependency (no Cheerio) — the two HTML patterns (table-based and <p><strong> based) are both handled by position-aware matching.
  • Script: tools/packaging-pipeline.mjs --stage 2
  • Writes: groups (with real Group A/B/C structure when found), rules (deterministic when parseable), stage_2_complete = 1, parser_version = 'deterministic-v1'.
  • Validates: Unit codes must appear in the source HTML. Rule count sum must match total_units_required.
  • Failures → Stage 3: Quals where rule parsing is ambiguous get their groups written (valuable on their own) but are queued for Stage 3's LLM to refine the rules.
  • Runtime: ~15 min at 5x parallel.

Stage 3 — Haiku LLM fallback (~$1)

  • Source: Same packaging_rules HTML, stripped to text.
  • Model: claude-haiku-4-5-20251001, 2048 max_tokens.
  • Prompt: Strict JSON-only system prompt + user message with qual code, counts, and rule text. Few-shot examples from BSB40920, CHC50121, SIT40521 are in tools/parse-packaging-rules.mjs (the original parser script, separate from the pipeline).
  • Script: tools/packaging-pipeline.mjs --stage 3 (or runs automatically after Stage 2 when ANTHROPIC_API_KEY is set).
  • Writes: groups, rules, stage_3_complete = 1, parser_version = 'llm-v1', confidence = 'medium' (or 'low' if validation fails).
  • Runtime: Serial at 1 req/sec. Only processes Stage 2 failures, typically 10-15% of corpus.
  • Key: ANTHROPIC_API_KEY in env. Never committed. Pass --skip-stage-3 to skip when no key is available.

Stage 4 — Verification report

  • Script: tools/packaging-pipeline.mjs --stage 4
  • Queries D1 and prints:
  • Total rows, stage completion counts, confidence breakdown.
  • Parser version distribution.
  • Low-confidence review query (exact SQL, copy-paste ready).
  • Updates: docs/ops/packaging-pipeline-last-run.md with date + counts.

Running the pipeline

Full corpus (all stages)

export CF_API_TOKEN=...
export ANTHROPIC_API_KEY=...  # optional — omit to skip Stage 3
node tools/packaging-pipeline.mjs

Stage 1 only (separate script)

CF_API_TOKEN=... node tools/parse-packaging-rules.mjs --unitgrid

Stage 2 only (deterministic parsing)

CF_API_TOKEN=... node tools/packaging-pipeline.mjs --stage 2

Single qual (any stage)

CF_API_TOKEN=... node tools/packaging-pipeline.mjs --qual BSB40920 --force

Force reprocess

CF_API_TOKEN=... node tools/packaging-pipeline.mjs --force

Reprocesses all rows regardless of stage completion flags. Use when the parsing logic has changed.


Resumability

Every stage uses a stage_N_complete = 0 filter by default. If the pipeline crashes at qual 4000, a rerun picks up at 4001. The --force flag reprocesses everything.


Observatory integration

The admin dashboard (admin.rtopacks.com.au) has a Packaging Pipeline drawer accessible from the Data Sources section header. It shows:

  • Coverage: N of M qualifications with packaging data.
  • Stage breakdown: how many rows were processed by each stage.
  • Confidence distribution: high / medium / low with review hint.
  • Parser version distribution.
  • Last parsed timestamp.
  • CLI command examples for full and single-qual runs.

The drawer is read-only. Full corpus runs are triggered from the CLI; the drawer surfaces status. Single-qual reruns via the Observatory are a future extension.


When to re-run

Trigger What to do
TGA releases a new training package version --qual {code} --force for affected quals
tga-sync updates a qual's packaging_rules Same — --qual {code} --force
New qual added to the corpus Default filter picks it up automatically
Low-confidence row manually reviewed --qual {code} --force to re-classify
Parsing logic changed --force for the full corpus

HTML patterns the Stage 2 parser handles

Pattern A — <p><strong> headings (BSB-style)

<p><strong>Core units</strong></p>
<p><ntr-tcref data-nrt-code="BSBPMG420" data-nrt-title="...">BSBPMG420</ntr-tcref> Title</p>
...
<p><strong>Group A – Project Management</strong></p>
<p><ntr-tcref data-nrt-code="BSBPMG423" ...>BSBPMG423</ntr-tcref> Title</p>

Pattern B — <table> with <th> headings (CHC/SIT-style)

<thead><tr><th colspan="2">Core units</th></tr></thead>
<tbody>
  <tr><td><p><ntr-tcref data-nrt-code="BSBTWK502" ...>...</ntr-tcref></p></td><td><p>Title</p></td></tr>
  ...
</tbody>

Both patterns use <ntr-tcref data-nrt-code="..." data-nrt-title="..."> for unit references. The parser finds all headings (from <th> and <strong>) and all ntr-tcref elements by their character positions, then assigns each code to the nearest preceding heading. Zero DOM parsing needed.


Standing rule

PACKAGING-PIPELINE — always trigger from Observatory for status checks, and from the CLI for actual runs. Update docs/ops/packaging-pipeline.md when TGA API behaviour changes. Never run ad-hoc SQL against qualification_packaging_rules for bulk modifications — use the pipeline script with --force.


Stage 3 rate limit lessons (session 62)

Captured during the first full corpus run so this never has to be rediscovered.

Haiku's rate limit is token-per-minute, not request-per-minute. Large packaging_rules HTML (some quals are 100KB+) burns through the token budget fast even at low request concurrency. The number of requests doesn't matter — the total input+output tokens per minute does.

What we tried and what worked:

Concurrency Stagger Result
5x parallel 200ms ~68% failure rate — 429s on most calls
3x parallel 500ms ~68% failure rate — same token budget issue
2x parallel 1s ~65-72% success per pass — workable with multi-pass
1x serial 2s Zero 429s but ~6 hours for the full corpus

The winning pattern: 2x parallel + multi-pass convergence.

Run at 2x with 2s stagger. Accept that ~30% of calls will 429 on each pass. Each pass commits successful rows individually to D1 (stage_3_complete = 1), so the default filter automatically skips them on the next run. Repeat passes with 30-60s cooldowns between them. Each pass chews through more of the queue. After 5-7 passes, the queue is empty.

For small straggler sets (<50 quals), the script drops to 1x serial with 2s gap — avoids all 429s when the remaining set fits in the token window.

max_tokens must be 4096, not 2048. Quals with 10-16 groups (e.g. ICT50220 with 16, RII30220 with 16 rules, PMB30121 with 14) generate JSON output that exceeds 2048 tokens. At 2048, the JSON is truncated mid-string, producing parse errors (Unterminated string, Unexpected end of JSON). 4096 handles every qual in the corpus.

Cost: The full corpus run (918 LLM-processed rows across 7 passes) cost ~$13 in Haiku tokens. Average ~900 input + ~550 output tokens per call.


What the pipeline does NOT do

  • ❌ Replace tga-sync (that's the weekly unit/qual ingestion).
  • ❌ Touch rto_scope_v2.
  • ❌ Generate KN translations (separate brief: STUDIO-KN-TRANSLATION-01).
  • ❌ Wire the canvas to read the data (that's STUDIO-PACKAGING-CANVAS-01, already shipped).