LLM Batch Processing — Performance Findings & Cost Lessons¶

Date: 1–2 April 2026 Context: KN v1.6 full corpus generation — 15,119 units Status: Updated 2 April 2026 after cost incident. READ BEFORE ANY LLM RUN.

⚠ MANDATORY PRE-RUN CHECKLIST¶

Before running ANY LLM generation job — 10 units or 15,000 — complete every item.

Verify the model exists on the current pricing page. Go to https://docs.x.ai/developers/models (or the equivalent for the provider). If the model name is not listed, it is deprecated. Do not run it. Deprecated models may still accept API calls at legacy pricing — xAI does not warn you.
Record the exact per-token pricing. Write down input price/M and output price/M from the pricing page. Do not use hardcoded constants from scripts, briefs, or memory. Pricing changes. The pricing page is the only source of truth.
Calculate the estimated cost BEFORE starting. Use the formula:
```
Cost = (total_input_tokens / 1M × input_price) + (total_output_tokens / 1M × output_price)
```
Use averages from previous runs: ~2,000 input tokens/unit, ~2,900 output tokens/unit.
Check the xAI credit balance. Go to console.x.ai. Confirm prepaid balance exceeds estimated cost + 15% buffer. At high concurrency, the billing system overshoots by $3-5 before cutting off.
Run 10 test units first. Verify output quality AND check the actual billed amount against your estimate. If they don't match within 20%, stop and investigate before running the corpus.
Update this document with the actual model, pricing, and cost after every run.

If any of these steps are skipped, the run is unauthorised.

The Cost Incident — 1 April 2026¶

What happened¶

A KN generation brief specified grok-3-fast as the model. This model is deprecated and no longer listed on xAI's pricing page. However, xAI's API still accepted requests to it — at legacy pricing of approximately $3.96 per million tokens (blended).

The current equivalent model — grok-4-1-fast-reasoning — costs $0.20/M input + $0.50/M output (~$0.38/M blended). This is 10x cheaper.

The script ran 1,292 units on the deprecated model before exhausting the $25 prepaid credit. It then received 429 responses (credit exhaustion, not rate limiting) for the remaining 13,841 units, which all failed.

What it cost¶

Item	Amount
Prepaid credit topped up	$25.00
Units successfully processed	1,292
Actual cost per unit	~$0.019
What the same units would cost on grok-4-1-fast	~$1.78
Money wasted on deprecated model pricing	~$23.22

What should have happened¶

The entire 15,119-unit corpus should have cost $25.51 on grok-4-1-fast-reasoning standard, or $12.75 on batch. Instead, $25 bought 1,292 units.

Root causes¶

The brief referenced a deprecated model. Claude specified grok-3-fast without checking the current xAI pricing page. The model worked in the previous session but was superseded between sessions.
The script used hardcoded pricing constants. INPUT_COST_PER_TOKEN = 0.0000006 was wrong by 6.6x. The script reported $3.79 cost while xAI actually charged ~$25. Nobody caught the discrepancy until the credit ran out.
xAI accepts deprecated model names silently. No deprecation warning in the API response. No pricing mismatch flag. The API just processes the request and charges at whatever the legacy rate is.
No pre-run cost verification. The script did not query the xAI billing endpoint or compare estimated vs actual cost after the first batch. A simple check after 100 units would have caught the 6.6x discrepancy.
429 misinterpreted as rate limiting. The script treated all 429 responses as temporary rate limits (back off and retry). Credit exhaustion also returns 429. After 3 retries per unit, 13,841 units failed one by one instead of the script stopping early.

Fixes implemented¶

Circuit breaker: If 50 consecutive units receive 429, stop the run and report "likely credit exhaustion"
Model validation: Before any run, confirm the model is listed on the current pricing page
Actual cost tracking: After every 100 units, compare script-tracked cost against expected cost. Alert if >20% discrepancy
This checklist: Mandatory before every run, no exceptions

Current Model Pricing (as of 2 April 2026)¶

Source: https://docs.x.ai/developers/models

xAI Grok Models¶

Model	Input/M	Output/M	Cached/M	Batch Input/M	Batch Output/M
grok-4.20-0309-reasoning	$2.00	$6.00	$0.20	$1.00	$3.00
grok-4-1-fast-reasoning	$0.20	$0.50	$0.05	$0.10	$0.25
grok-4-1-fast-non-reasoning	$0.20	$0.50	$0.05	$0.10	$0.25

Anthropic Claude Models (for comparison)¶

Model	Input/M	Output/M	Batch Input/M	Batch Output/M
Haiku 4.5	$1.00	$5.00	$0.50	$2.50
Sonnet 4.6	$3.00	$15.00	$1.50	$7.50
Opus 4.6	$15.00	$75.00	$7.50	$37.50

Cost Comparison — 15,119 unit corpus¶

Based on actual token averages: ~2,000 input + ~2,900 output per unit.

Option	Cost	Notes
grok-4-1-fast BATCH	$12.75	Best value. 50% off standard. 24hr turnaround.
grok-4-1-fast standard	$25.51	Real-time. Use for urgent re-runs.
Haiku 4.5 BATCH	$113.65	Different voice — test before committing.
grok-4.20 BATCH	$147.49	Flagship quality, 10x the fast model price.
Sonnet 4.6 BATCH	$340.95	Premium. Only if voice quality demands it.
~~grok-3-fast~~	~~$268.00~~	DEPRECATED. Do not use.

The Pipeline¶

Node.js script (Mac Mini)
  → 50 concurrent HTTP requests to xAI API
  → Parse JSON response
  → Write to Cloudflare D1 via REST API
  → Next unit immediately (worker pool, no wave batching)

Each unit = 1 API call to Grok + 1 API call to D1. Average ~4,900 tokens per unit (2,000 input + 2,900 output). Average response time from Grok: 3-8 seconds depending on unit complexity.

What We Tested — Concurrency¶

Concurrency	Rate (units/min)	Notes
10 (wave batching)	~10	Initial approach. Waves wait for slowest unit.
10 (worker pool)	~18	Same concurrency, pool pattern eliminates wave wait
25 (worker pool)	~30	Grok comfortable, D1 starting to be the limiter
50 (worker pool)	~38	Diminishing returns — D1 write latency is the ceiling

Grok was never the bottleneck. The ceiling is Cloudflare D1's REST API write latency (~100-200ms per write round-trip from Australia to US-based API endpoint).

Key Findings¶

1. Worker pool beats wave batching¶

Wave batching (Promise.allSettled on fixed batches of N): - Fires N requests, waits for ALL N to complete, fires next N - Throughput limited by the slowest unit in each wave - A single 15-second response blocks 24 other workers that finished in 3 seconds

Worker pool (N workers each pulling from a shared queue): - Each worker grabs next unit immediately after finishing - No waiting for slow units — the pool stays fully saturated - ~80% throughput improvement over wave batching at the same concurrency

2. D1 REST API is the write bottleneck¶

Each unit writes ~6 fields of text (1-5KB total) to D1 via the Cloudflare REST API. The round-trip from Brisbane to the CF API endpoint adds ~100-200ms per write. At 50 concurrent workers, that's ~25 D1 writes per second — close to the practical limit for sequential REST writes.

3. Resume-safe WHERE clause is essential¶

WHERE status = 'Current'
AND restricted_access = 0
AND (kn_prompt_version != '1.6' OR kn_prompt_version IS NULL)

If the script dies mid-run (credit exhaustion, network error, Mac Mini sleep), restart and it picks up exactly where it left off. No duplicated work, no lost work.

4. Credit exhaustion returns 429, not a specific error¶

xAI returns HTTP 429 for both rate limiting AND credit exhaustion. The script must distinguish between the two. A circuit breaker (stop after N consecutive 429s) prevents burning through the entire queue when the account is empty.

5. High concurrency causes billing overshoot¶

At 50 concurrent requests, the billing system takes several seconds to register usage. Expect $3-5 overshoot past the prepaid balance before 429s start. Budget for this.

6. JSON parse failures are rare and non-fatal¶

~2% of units produce malformed JSON. max_tokens: 8000 is sufficient for 98% of units. The script logs failures and continues. Retry failed units individually after the run.

Optimal Settings¶

const MODEL = "grok-4-1-fast-reasoning";  // NOT grok-3-fast — that's deprecated
const CONCURRENCY = 50;           // worker pool, not wave batching
const MAX_TOKENS = 8000;          // generous — only charges actual output
const TEMPERATURE = 0.3;          // consistent voice
const TIMEOUT = 120000;           // 2 min per request
const CONSECUTIVE_429_LIMIT = 50; // circuit breaker — stop run on credit exhaustion

Infrastructure¶

Run from: Mac Mini (Brisbane) — Node.js 25.x
API: xAI REST API (https://api.x.ai/v1/chat/completions)
DB: Cloudflare D1 via REST API (not Worker binding — script runs locally)
Auth: GROK_API_KEY env var (xai- prefix token), stored in .credentials/cloudflare.env
Pricing page: https://docs.x.ai/developers/models — CHECK BEFORE EVERY RUN

Monitoring During a Run¶

Progress:

SELECT COUNT(*) as processed,
  ROUND(SUM(kn_cost_usd), 4) as total_spent,
  MAX(kn_processed_at) as last_processed
FROM units WHERE kn_prompt_version = '1.6'

Remaining:

SELECT COUNT(*) as remaining
FROM units WHERE status = 'Current'
AND restricted_access = 0
AND (kn_prompt_version != '1.6' OR kn_prompt_version IS NULL)

Live dashboard: admin.rtopacks.com.au/compute

What NOT to Do¶

Don't trust hardcoded pricing constants. Check the pricing page before every run.
Don't run a model without confirming it's on the current pricing page. Deprecated models still work but cost more.
Don't assume 429 means rate limiting. It also means credit exhaustion. Add a circuit breaker.
Don't skip the 10-unit test run. Compare estimated vs actual cost before committing to the corpus.
Don't use wave batching — use a worker pool.
Don't set concurrency below 25 — you're just wasting time.
Don't run from a CF Worker — the 30-second CPU limit kills long-running generation.
Don't mix models within a run — one model, one prompt version, full corpus.

Document History¶

Date	Change
1 Apr 2026	Initial document — concurrency findings, wave vs pool, D1 bottleneck
2 Apr 2026	Major revision. Cost incident documented. Deprecated model pricing discovered ($3.96/M vs $0.38/M). Mandatory pre-run checklist added. Pricing tables updated. Circuit breaker requirement added. This document is now required reading before any LLM generation run.

This document is required reading before any LLM generation run. If Claude or Alex references a model or pricing that contradicts this document, this document wins. Check the pricing page. Every time. No exceptions.