RTOpacks Radar — Product Specification¶
Document ID: RADAR-SPEC-01
Version: 0.2
Status: Living document — captures deployed system state + product intent
Location: docs/docs/workspace/apps/radar.md
Last updated: 17 April 2026
Changelog¶
| Version | Date | Changes |
|---|---|---|
| 0.1 | 17 Apr 2026 | Initial — full signal taxonomy, data model, crawl pipeline, UI surfaces, product positioning |
| 0.2 | 17 Apr 2026 | Enrichment pipeline, 24 promoted columns, compliance signals, link destination resolution, third-party/infrastructure domain split, provenance layer, dual screenshots, enrichment_data JSON schema |
How to Use This Document¶
This spec is the full-picture reference for Radar, also known as Field Observer when surfaced as a standalone product. It documents what the system is, what it observes, how it observes it, and what it does not do. Every build brief that touches Radar references this document as its source of truth.
This document does not constitute a brief. Briefs are scoped, sequenced, and issued separately. The reason for this is the same as People: the full system described here is too large to have been built in one pass, and some parts are still planned. What should never happen is a brief that contradicts this document, or a build that forecloses something described here.
Alex: read this document before touching any Radar code. The briefs tell you what to build. This document tells you why, and what "Radar" actually is.
What Radar Is¶
Radar is RTOpacks' intelligence layer. It assembles a structured dossier on every registered training organisation in Australia — their digital presence, their technology stack, their government registration status, and their public reputation — using only publicly observable data.
It does not access anything private. It does not authenticate to any system. It does not use AI to draw conclusions. It reads what is already published — sitemaps, DNS records, SSL certificates, government registers, social media profiles — and assembles it faster than any human team could, into a format where it can actually be seen and compared.
The product's core claim, stated plainly: Everything you see here was already public. We just connected the dots.
Who uses it¶
- RTOpacks operators (L3) use Radar to understand the RTO landscape — which organisations are active, which are dormant, what technology they run, how mature their digital presence is.
- RTO managers (L4) will eventually see their own Radar card — a mirror showing how their organisation appears from the outside. The slight vertigo of seeing your own business reflected back is the product's disarming move: the explanation is reassuring precisely because the mechanism is mundane.
- Regulatory analysts use the signal layers to identify anomalies — RTOs with expired SSL, no web presence, mismatched entity types, or dormant digital footprints despite active registrations.
What it is not¶
Radar is not surveillance. It does not monitor private communications, access authenticated systems, or track individual behaviour. Every signal it captures is available to anyone with a web browser and the patience to look. Radar's value is not in what it sees — it's in the fact that it sees everything, at the same time, and puts it somewhere useful.
Say this, not that:
| Say this | Not that |
|---|---|
| Public data, assembled | Data harvested |
| What's on the label | What we discovered |
| We read your sitemap | We analysed your site |
| The same information anyone could find | Proprietary intelligence |
| A mechanical process | An algorithm |
| We connected the dots | We drew conclusions |
| No AI | Powered by AI |
The Dossier Model¶
Every RTO has a dossier — a single record that anchors all intelligence collected about that organisation. The dossier is created automatically when any Radar surface first references an RTO. It persists indefinitely.
Dossier fields¶
| Field | Type | Description |
|---|---|---|
id |
TEXT PK | UUID |
rto_code |
TEXT UNIQUE | The RTO's national code from TGA |
last_crawled_at |
TEXT | ISO timestamp of most recent crawl completion |
crawl_status |
TEXT | Lifecycle: pending → queued → crawling → complete / failed |
notes |
TEXT | Free-text analyst notes (operator-entered) |
created_at |
TEXT | Auto |
updated_at |
TEXT | Auto |
The dossier is a shell. Intelligence lives in satellite tables keyed to rto_code, plus the radar_crawl_results table which holds both sitemap crawl data and enrichment data from the Puppeteer pass.
Signal Layers¶
Radar organises intelligence into five layers, ordered by confidence — from government-reported facts at the top, to social signals at the bottom. Each layer is independently populated, independently timestamped, and independently refreshable.
Layer 1 — Government Reality¶
Confidence: Authoritative
Sources: Australian Business Register (ABR), ASIC, ASQA, ACNC
What it tells you: Is this entity real? Is it active? What kind of entity is it?
| Category | Signal Key | Description |
|---|---|---|
abn |
abn_value |
The ABN itself |
abn |
abn_status |
Active / Cancelled / other |
abn |
abn_status_effective |
Date the status took effect |
abn |
abr_entity_name |
Legal name as registered with ABR |
abn |
abr_last_updated |
When ABR last modified this record |
abn |
entity_type_code |
ABR entity type (company, sole trader, trust, etc.) |
abn |
entity_description |
Human-readable entity description |
abn |
gst_status |
GST registration status |
abn |
gst_registered_date |
When GST was registered |
abn |
asic_number |
ASIC company number (ACN) if applicable |
abn |
acnc_registered |
Whether registered with ACNC (charities) |
abn |
charity_type |
ACNC charity subtype |
abn |
dgr_endorsed |
Deductible Gift Recipient status |
abn |
address_state |
Registered business state |
abn |
address_postcode |
Registered business postcode |
classification |
entity_type |
Derived entity classification (school, TAFE, university, commercial, NFP, etc.) |
classification |
non_commercial_flag |
Whether this is a non-commercial entity |
classifier |
entity_type_abn |
Entity type derived from ABN data (highest precedence) |
classifier |
risk_flag |
Risk indicators (multiple allowed) |
registration |
edu_au_anomaly |
Anomaly flags on .edu.au domain usage |
Entity type classification precedence: ABN entity type > domain pattern > sweep inference > unknown. The entity badge in the UI follows this precedence chain.
Entity type values: school, tafe, university, government, association, sole_trader, trust, nfp, charity, commercial_rto, unknown_edu, unknown.
Layer 2 — Public Presence¶
Confidence: Observed
Sources: DNS, WHOIS, SSL certificates, HTTP responses, Wayback Machine, TLS handshake (enrichment pass)
What it tells you: Does this organisation have a functioning web presence? How established is it? Is their infrastructure secure?
| Category | Signal Key | Source | Description |
|---|---|---|---|
domain |
has_edu_domain |
DNS | Whether the RTO uses a .edu.au domain |
domain |
edu_website_live |
HTTP | Whether the .edu.au site responds |
domain |
edu_mx_live |
DNS | Whether .edu.au MX records are active |
domain |
domain_registered |
WHOIS | Domain registration date |
domain |
domain_expiry |
WHOIS | Domain expiry date |
domain |
registrar |
WHOIS | Domain registrar |
domain |
registrant_privacy |
WHOIS | Whether WHOIS privacy is enabled |
domain |
nameserver_provider |
DNS | NS provider (Cloudflare, AWS Route53, etc.) |
domain |
namespace_active |
DNS | Whether the domain actively resolves |
domain |
dnssec |
DNS | DNSSEC enabled |
domain |
axfr_permitted |
DNS | Zone transfer permitted (security flag) |
email |
contact_email |
HTML | Publicly listed contact email |
email |
email_domain |
HTML | Email domain used |
email |
email_domain_resolves |
DNS | Whether the email domain resolves |
ssl |
ssl_issuer |
TLS handshake | Certificate issuer (Let's Encrypt, DigiCert, Sectigo, etc.) |
ssl |
ssl_valid_to |
TLS handshake | Certificate expiry date (ISO) |
ssl |
ssl_days_remaining |
TLS handshake | Days until cert expiry (negative = expired) |
ssl |
ssl_cert_error |
TLS handshake | Certificate error if any (expired, name mismatch, etc.) |
ssl |
hsts |
HTTP header | HTTP Strict Transport Security enabled |
redirect |
final_url |
HTTP | URL after all redirects (stale domain detection) |
redirect |
had_https_redirect |
HTTP | Whether initial HTTP URL redirected to HTTPS |
history |
wayback_first_snapshot |
Wayback API | Earliest Wayback Machine capture |
history |
wayback_latest_snapshot |
Wayback API | Most recent Wayback Machine capture |
classifier |
web_age_years |
Derived | Years since first Wayback snapshot |
classifier |
ssl_health |
Derived | good / adequate / expiring_soon / critical |
Enrichment adds: ssl_issuer, ssl_valid_to, ssl_days_remaining, ssl_cert_error, final_url, had_https_redirect are promoted columns on radar_crawl_results, populated by the local Puppeteer enrichment pass from the TLS handshake. Full SSL details (subject, valid_from, protocol, SAN list note) are in the enrichment_data JSON blob under $.ssl.
Layer 3 — Technical Fingerprint¶
Confidence: Observed / Inferred
Sources: HTTP response headers, HTML source analysis, rendered DOM, network request interception, known platform patterns
What it tells you: What technology runs this site? How sophisticated is it? What compliance signals are visible?
Fingerprint signals¶
| Category | Signal Key | Source | Description |
|---|---|---|---|
fingerprint |
page_title |
DOM | HTML <title> (promoted column) |
fingerprint |
meta_description |
DOM | Meta description |
fingerprint |
meta_keywords |
DOM | Meta keywords (legacy SEO) |
fingerprint |
og_title |
DOM | OpenGraph title |
fingerprint |
og_site_name |
DOM | OpenGraph site name |
fingerprint |
og_description |
DOM | OpenGraph description |
fingerprint |
og_image_url |
DOM | OpenGraph image URL |
fingerprint |
twitter_card |
DOM | Twitter card type |
fingerprint |
twitter_site |
DOM | Twitter @handle |
fingerprint |
generator |
DOM | CMS generator tag (WordPress, Squarespace, etc.) (promoted column) |
fingerprint |
server_header |
HTTP | HTTP Server header |
fingerprint |
x_powered_by |
HTTP | HTTP X-Powered-By header |
fingerprint |
robots_meta |
DOM | Robots meta directive |
fingerprint |
copyright_year |
DOM | Copyright year found in footer |
fingerprint |
mobile_responsive |
DOM | Viewport meta tag exists and is not fixed-width (promoted column) |
fingerprint |
canonical_url |
DOM | <link rel="canonical"> URL |
fingerprint |
viewport_meta |
DOM | Raw viewport meta content string |
fingerprint |
wp_rest_api_exposed |
HTTP | WordPress REST API publicly accessible |
Analytics and pixel detection¶
Detected via dual strategy: HTML source regex AND observed network requests during page load.
| Signal Key | Detection | Description |
|---|---|---|
gtm + gtm_id |
HTML regex (GTM-*) + network (googletagmanager.com) |
Google Tag Manager |
ga4 + ga4_id |
HTML regex (G-*) + network (google-analytics.com) |
Google Analytics 4 |
google_ads + google_ads_id |
HTML regex (AW-*) + network (googleadservices.com) |
Google Ads conversion |
meta_pixel + meta_pixel_id |
HTML regex (fbq('init') + network (connect.facebook.net) |
Meta/Facebook pixel |
tiktok_pixel + tiktok_pixel_id |
HTML regex (ttq.load) + network (analytics.tiktok.com) |
TikTok pixel |
linkedin_insight + linkedin_insight_id |
HTML regex (_linkedin_partner_id) + network (snap.licdn.com) |
LinkedIn Insight |
hubspot_tracking |
HTML regex (hs-script-loader) + network (js.hs-scripts.com) |
HubSpot tracking |
hotjar |
HTML regex (hotjar.com) + network (hotjar.com) |
Hotjar |
intercom |
HTML regex (intercomSettings) + network (intercom.io) |
Intercom |
zendesk_chat |
HTML regex (zE() + network (zendesk.com) |
Zendesk Chat |
tawk_chat |
HTML regex (tawk.to) + network (embed.tawk.to) |
Tawk.to Chat |
Compliance signals¶
Mechanical observations from the rendered homepage. No inference, no model calls. All regex-based or link-presence checks.
| Signal Key | Type | Source | Description |
|---|---|---|---|
rto_code_on_homepage |
BOOLEAN | Text regex | RTO's own national code visible in page text (promoted column) |
cricos_code_on_homepage |
BOOLEAN | Text regex | CRICOS provider code visible (promoted column) |
cricos_code_value |
TEXT | Text regex | The matched CRICOS code (e.g. 00001K) (promoted column) |
usi_mentioned |
BOOLEAN | Text regex | "USI" or "Unique Student Identifier" in page text |
privacy_policy_link_present |
BOOLEAN | Link check | <a> with text containing "privacy" exists (promoted column) |
privacy_policy_url |
TEXT | Link href | URL of the privacy policy link |
complaints_link_present |
BOOLEAN | Link check | <a> with text containing "complaint", "grievance", or "appeals" (promoted column) |
complaints_link_url |
TEXT | Link href | URL of the complaints link |
accessibility_statement_present |
BOOLEAN | Link check | <a> with text containing "accessibility" or "wcag" (promoted column) |
accessibility_url |
TEXT | Link href | URL of the accessibility statement |
government_program_mentions |
JSON array | Text regex | Detected mentions of: Smart and Skilled, Free TAFE, Fee-Free TAFE, VET Student Loans, Australian Apprenticeships, JobTrainer, User Choice, Construction Blueprint, Skills First, Higher Level Skills |
third_party_enrolment_system |
BOOLEAN | Link resolution | "Enrol" or "Apply" links resolve to known external platforms (promoted column) |
Layer 4 — Integration Signals¶
Confidence: Observed / Inferred
Sources: MX records, DNS TXT records, HTML source, subdomain fingerprinting, network request interception, link destination resolution
What it tells you: What third-party services does this organisation depend on?
| Category | Signal Key | Description |
|---|---|---|
mail |
mail_platform |
Email platform (Google Workspace, Microsoft 365, etc.) |
mail |
email_domain_mail_platform |
Mail platform on the email domain specifically |
mail |
spf_present |
SPF record exists |
mail |
dkim_present |
DKIM record exists |
mail |
dmarc_present |
DMARC record exists |
mail |
dmarc_policy |
DMARC policy (none, quarantine, reject) |
lms |
lms_platform |
LMS platform detected (from signals table) |
lms |
lms_url |
URL of detected LMS |
lms |
lms_hosting_same_as_main |
Whether LMS is on same infrastructure |
lms |
lms_platform_detected |
LMS detected via link resolution (promoted column on crawl_results) |
sms |
sms_platform_detected |
Student management system detected via link resolution (promoted column) |
integration |
third_party_service |
Named third-party service detected |
classifier |
email_maturity |
Derived: mature / moderate / basic / none |
Layer 5 — Human Signals¶
Confidence: Inferred (lowest)
Sources: Social media platforms, Seek job listings, Wikipedia
What it tells you: Is this organisation active in the public sphere? Is it hiring?
| Category | Signal Key | Description |
|---|---|---|
social |
social_facebook |
Facebook URL |
social |
social_linkedin |
LinkedIn URL |
social |
social_instagram |
Instagram URL |
social |
social_twitter |
Twitter/X URL |
social |
social_youtube |
YouTube URL |
social |
social_tiktok |
TikTok URL |
social |
facebook_blocked |
Whether Facebook profile was blocked/private |
social |
linkedin_blocked |
Whether LinkedIn profile was blocked/private |
seek |
seek_active_listings |
Number of active Seek job listings |
seek |
seek_locations |
Locations in Seek listings |
reputation |
wikipedia_present |
Whether a Wikipedia article exists |
reputation |
wikipedia_title |
Wikipedia article title |
reputation |
wikipedia_description |
Wikipedia article summary |
classifier |
hiring_signal |
Derived: active_hiring / dormant |
Confidence Tiers¶
Every signal carries a confidence level. Three tiers:
| Tier | Meaning | Example |
|---|---|---|
| Authoritative | Government-reported, citable, legally binding | ABN status from ABR |
| Observed | Directly detected by automated sweep, timestamped, repeatable | SSL certificate issuer from HTTPS handshake |
| Inferred | Derived from authoritative or observed data — a reasonable conclusion, not a direct observation | Hosting tier derived from IP/ASN lookup |
The UI distinguishes these with colour-coded confidence pills. Authoritative signals are presented as facts. Inferred signals are presented as inferences. This distinction is not cosmetic — it determines whether the signal is citable in a regulatory context.
Stack Profile¶
Separate from the signal layers, each RTO has a stack profile — a table of detected technologies organised by surface area. This is the "what are they running?" view.
Surfaces¶
| Surface | What it covers |
|---|---|
frontend |
CMS, static site generator, JavaScript framework |
lms |
Learning management system (Canvas, Moodle, Blackboard, etc.) |
email |
Email platform (Google Workspace, Microsoft 365, Zoho, etc.) |
hosting |
Cloud provider, CDN, shared hosting |
cdn |
Content delivery network |
analytics |
Analytics and tracking tools |
payment |
Payment processing (Stripe, PayPal, etc.) |
student_portal |
Student management system / portal |
other |
Anything that doesn't fit above |
Each entry records: surface, technology name, description, cost range estimate, vendor URL, confidence level, and observation timestamp.
Crawl Pipeline¶
Data flow¶
The crawl pipeline is a Cloudflare Worker (radar-crawl) that processes RTOs in batches. For each RTO:
- Resolve web address — query
rto_web_addressesinrto-nrt-db - Fetch robots.txt — extract
Sitemap:directives - Fetch and parse sitemap — handle both
sitemap.xmlandsitemap_index.xml; cap at 500 URLs per RTO - Classify URLs — pattern-match against known categories
- Detect external platforms — hostname matching
- Write to D1 — structured JSON in
radar_crawl_results
URL classification categories¶
| Category | Pattern matches |
|---|---|
courses |
/courses/, /qualifications/, /training/, /cert-, /diploma-, /certificate- |
enrol |
/enrol, /apply, /register, /enquire, /enquiry |
student |
/student, /portal, /my-, /login, /lms |
contact |
/contact, /locations, /find-us |
policies |
/usi, /fees, /refund, /complaints, /privacy, /terms |
about |
/about, /team, /staff, /governance |
news |
/news, /blog, /events |
External platform detection¶
Detected via both sitemap hostname matching (crawl pipeline) and link destination resolution (enrichment pipeline).
| Platform | Hostname pattern |
|---|---|
| Canvas LMS | *.instructure.com |
| Canvas Catalog | *.catalog.instructure.com |
| Moodle | moodle.*, *.moodlecloud.com, *.moodlesites.com, *.mdl2.com |
| Blackboard | *.blackboard.com |
| Blackboard Collaborate | *.bbcollab.com |
| D2L Brightspace | *.brightspace.com |
| Teachable | *.teachable.com |
| Thinkific | *.thinkific.com |
| LinkedIn Learning | learning.linkedin.com |
| Wisenet SMS | *.wisenet.co |
| aXcelerate | *.axcelerate.com(.au)? |
| VETtrak | *.vettrak.com(.au)? |
| JobReady | *.jobready.com(.au)? |
| RTO Manager | *.rtomanager.(com\|com.au\|net) |
| Cliniko | *.cliniko.com |
| Google Forms | docs.google.com/forms, forms.gle |
| Microsoft Forms | forms.office.com, forms.microsoft.com |
| SurveyMonkey | *.surveymonkey.com |
| JotForm | *.jotform.com |
| Typeform | *.typeform.com |
| Eventbrite | *.eventbrite.com(.au)? |
| SharePoint | *.sharepoint.com |
Enrichment Pipeline¶
Overview¶
The enrichment pipeline is a local Puppeteer script (tools/radar-enrich-local.mjs) that visits every RTO's homepage with a full Chromium browser instance and captures every signal extractable from that single visit. It runs on Tim's Mac Mini M2 Pro, not on Cloudflare Workers.
Why local, not CF Workers¶
- CF Browser Rendering has a 2-concurrent-session limit — bulk work is infeasible
- Local Puppeteer runs 10+ parallel Chromium instances with no concurrency ceiling
- Local execution keeps the CF Browser Rendering budget free for ad-hoc per-RTO refreshes
- The previous corpus screenshot ingest used this exact pattern successfully
Three lifecycle modes¶
| Mode | Tool | Trigger | Cadence | Volume |
|---|---|---|---|---|
| Bulk ingest | Local Puppeteer (tools/radar-enrich-local.mjs) |
Manual | On demand | All ~12,500 RTOs |
| Monthly refresh | Same local script | Manual, 1st of month | Monthly | All RTOs |
| Ad-hoc single | CF Browser Rendering via radar-crawl worker /single endpoint |
"Trigger crawl" button in RadarTab UI | On demand | 1 RTO |
What the enrichment captures per RTO (single visit)¶
Screenshots (two per RTO):
| Viewport | Size | R2 key |
|---|---|---|
| Desktop | 1280×800, JPEG q75 | radar-screenshots/{rto_code}/homepage.jpg |
| Mobile | 375×812, isMobile: true, deviceScaleFactor: 2, JPEG q75 |
radar-screenshots/{rto_code}/homepage-mobile.jpg |
Both captured on the same page load — desktop first, then resize to mobile with 500ms reflow wait.
SSL cert details — from the TLS handshake Chromium performs. Captured even on cert errors (--ignore-certificate-errors flag). An expired cert is itself a high-value signal.
HTTP response headers — full headers stored as JSON in enrichment_data.response_headers. Server, X-Powered-By, HSTS promoted to queryable fields.
Layer 3 fingerprints — all meta tags, OG data, generator, viewport, canonical URL, copyright year from the rendered DOM via page.evaluate().
Analytics/pixel detection — dual HTML regex + network request interception. IDs captured where possible (GTM-, G-, AW-*).
Compliance signals — RTO code visibility, CRICOS code, USI mentions, privacy/complaints/accessibility links, government program mentions. All regex-based, mechanical, no inference.
Third-party domain classification — every outbound network request logged, deduplicated by hostname, classified as vendor or infrastructure (see below).
Link destination resolution — homepage links in categories enrol, student, contact, plus all external links, resolved via HEAD request (GET fallback on 405). Max 20 per RTO, 10s timeout. Platform detection on final URL.
HTML archive — full rendered DOM, gzipped, stored in R2 at radar-html/{rto_code}/{timestamp}.html.gz. Enables future re-mining without re-crawling.
Favicon — fetched from detected <link rel="icon"> or fallback /favicon.ico. Stored at radar-favicons/{rto_code}/favicon.{ext}.
Performance — load time to domcontentloaded (ms), request count, page weight estimate.
Per-field partial failure¶
Each capture operation is wrapped in try/catch. If one fails (e.g. favicon 404), the rest of the RTO's enrichment continues. Success is tracked per-field in enrichment_success_flags:
{
"screenshot_desktop": 1,
"screenshot_mobile": 1,
"html_archived": 1,
"ssl_captured": 1,
"headers_captured": 1,
"metadata_extracted": 1,
"analytics_detected": 1,
"compliance_parsed": 1,
"third_party_captured": 1,
"performance_captured": 1,
"link_destinations_resolved": 1,
"favicon_captured": 0
}
Analysts can query "RTOs where link_destinations failed" and re-run just that capture without re-doing the full visit.
Third-Party Domain Classification¶
Every outbound network request during a page load is logged. After navigation completes, each unique hostname is classified as vendor (real service/product relationship) or infrastructure (CDN/font/analytics plumbing).
Classification rules¶
- Check the hostname against the vendor patterns list (regex, takes precedence)
- Check the registrable domain against the infrastructure set
- Unknown domains default to vendor (conservative — better to over-surface than silently filter)
First-party filtering¶
Requests to the RTO's own domain or subdomains are excluded. Filtering uses the final URL's registrable domain (after redirects), not the stored web_address — handles domain migrations like www.cit.act.edu.au → cit.edu.au.
Vendor patterns (62 patterns)¶
Organised by category. Full list maintained in tools/radar-enrich-local.mjs. Key categories:
- LMS: Canvas, Moodle (Cloud/Sites/mdl2), Blackboard, Brightspace, Teachable, Thinkific, LinkedIn Learning
- SMS/RTO: Wisenet, aXcelerate, VETtrak, JobReady, RTO Manager, Cliniko
- Form/Survey: Typeform, JotForm, SurveyMonkey, Microsoft Forms
- Chat/Support: Intercom, Tawk.to, Zendesk, Freshdesk, Drift, Crisp
- Consent management: Cookiebot, OneTrust, CookieLaw, Osano
- Analytics: Hotjar, Cloudflare Web Analytics, HubSpot, Segment, Mixpanel, Amplitude
- Booking: Calendly, Acuity, TryBooking, Humanitix, Eventbrite
- Maps: Google Maps (maps.googleapis.com, maps.google.com)
- reCAPTCHA: recaptcha.net, google.com/recaptcha
- Video conferencing: Zoom
- Payment: Stripe, PayPal, Braintree, Square, Westpac PayWay, eWAY, SecurePay
- Email marketing: Mailchimp
- CRM: Salesforce, Pardot, Zoho, Pipedrive
- Microsoft: SharePoint
Infrastructure set (33 domains)¶
CDN/static assets (gstatic, googleapis, cloudflare, jsdelivr, bootstrapcdn, fontawesome, typekit), ad/tracking plumbing (googletagmanager, google-analytics, doubleclick, googlesyndication, facebook.net), social embed loaders (platform.twitter, platform.linkedin), WordPress infrastructure (wp.com, s.w.org), video embeds (youtube, vimeo).
Note: google.com is NOT in the infrastructure set — it's too broad. Specific Google services (Maps, reCAPTCHA) are handled by vendor patterns. Unclassified google.com requests fall through to vendor-unknown.
Output¶
Two arrays in enrichment_data:
third_party_domains— vendor relationships:["static.hotjar.com (Hotjar)", "js-ap1.hubspot.com (HubSpot)", "unknown-vendor.com"]infrastructure_domains— plumbing:["www.googletagmanager.com", "connect.facebook.net", "fonts.gstatic.com"]
Named vendors include the detected name in parentheses. Unknown vendors show hostname only.
Link Destination Resolution¶
Homepage links matching specific categories are resolved to their final URL via HTTP HEAD request, with platform detection on the resolved hostname.
Categories resolved¶
- enrol — links containing "enrol", "apply", "register", "enquire"
- student — links containing "student", "portal", "login", "lms", "my-"
- contact — links containing "contact", "location", "find us"
- external — any link to a hostname different from the RTO's own domain
Resolution method¶
- Extract matching
<a>tags from rendered DOM (max 30 pre-filtered) - Take first 20 links (hard cap)
- For each: HTTP HEAD request, follow redirects, 10s timeout
- If HEAD returns 405: retry with GET
- Capture final URL after redirects
- Pattern-match final hostname against platform detection list
Output¶
{
"link_text": "Student Login",
"href": "/student-portal",
"final_url": "https://myschool.canvas.instructure.com/login",
"category": "student",
"platform_detected": "Canvas LMS"
}
Platform detections from link resolution promote to lms_platform_detected, sms_platform_detected, and third_party_enrolment_system columns on radar_crawl_results.
Provenance¶
Every enrichment run records an observed_via map in enrichment_data showing which detection strategy produced each signal:
{
"ssl_issuer": "tls_handshake",
"page_title": "dom_query",
"generator": "meta_tag",
"ga4": "html_regex+network_request",
"rto_code_on_homepage": "html_text_regex",
"link_destinations": "href_head_resolution",
"third_party_domains": "network_request_listener"
}
Purpose: when a signal returns surprising coverage (e.g. Meta Pixel at 5%), the provenance map tells you which detection strategy was used, enabling targeted improvement without re-crawling.
Crawl Results Schema (complete)¶
CREATE TABLE radar_crawl_results (
-- Core (from sitemap crawl)
rto_code TEXT PRIMARY KEY,
crawled_at TEXT NOT NULL,
web_address TEXT,
has_sitemap INTEGER DEFAULT 0,
sitemap_url TEXT,
page_count INTEGER DEFAULT 0,
last_modified TEXT,
screenshot_r2_key TEXT, -- desktop: {rto_code}/homepage.jpg
screenshot_failed INTEGER DEFAULT 0,
no_web_presence INTEGER DEFAULT 0,
classified_urls TEXT, -- JSON
subdomains TEXT, -- JSON array
external_platforms TEXT, -- JSON array
raw_sitemap_sample TEXT, -- JSON array, first 50 URLs
crawl_status TEXT DEFAULT 'pending',
crawl_error TEXT,
-- Enrichment: redirect (v0.2)
final_url TEXT,
had_https_redirect INTEGER DEFAULT 0,
-- Enrichment: SSL (v0.2)
ssl_issuer TEXT,
ssl_valid_to TEXT,
ssl_days_remaining INTEGER,
ssl_cert_error TEXT,
-- Enrichment: fingerprint (v0.2)
page_title TEXT,
generator TEXT,
mobile_responsive INTEGER DEFAULT 0,
-- Enrichment: compliance (v0.2)
rto_code_on_homepage INTEGER DEFAULT 0,
cricos_code_on_homepage INTEGER DEFAULT 0,
cricos_code_value TEXT,
complaints_link_present INTEGER DEFAULT 0,
privacy_policy_link_present INTEGER DEFAULT 0,
accessibility_statement_present INTEGER DEFAULT 0,
-- Enrichment: platform detection (v0.2)
lms_platform_detected TEXT,
sms_platform_detected TEXT,
third_party_enrolment_system INTEGER DEFAULT 0,
-- Enrichment: performance (v0.2)
load_time_ms INTEGER,
-- Enrichment: artifacts (v0.2)
screenshot_mobile_r2_key TEXT,
favicon_r2_key TEXT,
html_archive_r2_key TEXT,
-- Enrichment: tracking (v0.2)
enrichment_completed_at TEXT,
enrichment_success_flags TEXT, -- JSON per-field success map
enrichment_data TEXT -- JSON blob (see structure below)
);
enrichment_data JSON structure¶
{
"ssl": {
"subject": "cit.edu.au",
"issuer": "E8",
"valid_from": "2026-03-25T01:34:10.000Z",
"valid_to": "2026-06-23T01:34:09.000Z",
"protocol": "TLS 1.3",
"san_list_note": "Not available via Puppeteer SecurityDetails API"
},
"response_headers": {
"server": "cloudflare",
"content-type": "text/html; charset=utf-8",
"strict-transport-security": "max-age=31536000"
},
"meta": {
"page_title": "Home : Canberra Institute of Technology",
"meta_description": "...",
"og_title": "...",
"og_site_name": "CIT",
"generator": null,
"viewport_meta": "width=device-width, initial-scale=1",
"mobile_responsive": true,
"copyright_year": "2026",
"canonical_url": "https://cit.edu.au/",
"favicon_url": "/favicon.ico"
},
"analytics": {
"gtm": true, "gtm_id": "GTM-XXXX",
"ga4": true, "ga4_id": "G-XXXX",
"meta_pixel": false,
"hotjar": true,
"intercom": false
},
"compliance": {
"rto_code_present": true,
"cricos_code_present": true,
"cricos_code_value": "00001K",
"usi_mentioned": true,
"privacy_policy_link_present": true,
"privacy_policy_url": "https://cit.edu.au/policies/privacy_policy",
"complaints_link_present": true,
"complaints_link_url": "https://cit.edu.au/about/student-and-community-member-complaints",
"accessibility_statement_present": false,
"accessibility_url": null,
"government_program_mentions": ["Free TAFE", "Fee-Free TAFE"]
},
"third_party_domains": [
"static.hotjar.com (Hotjar)",
"script.hotjar.com (Hotjar)",
"www.facebook.com"
],
"infrastructure_domains": [
"www.googletagmanager.com",
"connect.facebook.net",
"www.google-analytics.com"
],
"link_destinations": [
{
"link_text": "How to apply",
"href": "https://cit.edu.au/study/apply",
"final_url": "https://cit.edu.au/study/apply",
"category": "enrol",
"platform_detected": null
}
],
"performance": {
"page_weight_kb": 22413,
"request_count": 205
},
"observed_via": {
"ssl_issuer": "tls_handshake",
"page_title": "dom_query",
"ga4": "html_regex+network_request"
}
}
IP Observations¶
Each RTO's domain is resolved to IP addresses, and each IP is enriched:
| Field | Description |
|---|---|
hostname |
The domain resolved |
ip_address |
Resolved IP (A record) |
ip_version |
4 or 6 |
ptr_record |
Reverse DNS |
asn / asn_org |
Autonomous System Number and organisation |
network_block |
CIDR block |
country_code / city |
Geolocation |
is_shared |
Whether the IP hosts multiple domains |
hosting_provider |
Derived provider name |
UI Surfaces¶
Field Observer Map (Layer 0)¶
The default view when opening any RTO's Radar tab. Visible without clicking "Show detail."
Components: - Homepage screenshot — full-width, 240px tall, from R2 - React Flow topology map — horizontal flow graph showing the site structure: - Root node — homepage, domain name label - Section nodes — one per classified URL category with count badge (e.g. "Courses (31)") - Subdomain nodes — dashed blue border, separate column - External platform nodes — amber, rightmost column (e.g. "Canvas LMS") - Meta bar — domain, last crawled date, page count, sitemap status
Progressive disclosure: - Layer 0 (always visible): screenshot + topology map + meta - Layer 1 (tooltip on node hover): single most relevant signal - Layer 2 (node click): opens URL in new tab - Layer 3 ("Show detail" toggle): full intelligence panel
Graceful degradation: - No web presence: "No web address recorded for this RTO" + government signals still shown - No sitemap: screenshot only + "No sitemap detected" + simplified single-node map - No screenshot: map and signals still render; screenshot area hidden
Intelligence Panel (Layer 3)¶
Behind the "Show detail" toggle. Contains the full existing Radar content:
- Dossier header — last crawled, crawl status, entity badge, risk flags, analyst notes, trigger crawl button
- Classifier bar — visual summary: hosting tier, email maturity, SSL health, entity type, hiring signal, web age, org size, Wikipedia presence
- Stack profile — technology table by surface
- Screenshots — historical screenshot grid
- Seek job listings — active postings (if detected)
- IP observations — resolved IP table
- Signal layers 1–5 — collapsible sections with per-signal management
Database Topology¶
| Database | Binding | Role |
|---|---|---|
radar-db |
RADAR_DB |
All Radar-specific data: dossiers, signals, stack profiles, IP observations, screenshots, crawl results |
rtopacks-db |
NRT_DB / RTOPACKS_DB |
Read-only — RTO web addresses, org identity |
radar-screenshots (R2) |
SCREENSHOTS / RADAR_SCREENSHOTS |
Homepage JPEGs (desktop + mobile), HTML archives, favicons |
R2 key conventions:
- radar-screenshots/{rto_code}/homepage.jpg — desktop screenshot
- radar-screenshots/{rto_code}/homepage-mobile.jpg — mobile screenshot
- radar-html/{rto_code}/{timestamp}.html.gz — archived rendered HTML (timestamped, accumulates)
- radar-favicons/{rto_code}/favicon.{ext} — favicon (overwritten on refresh)
Hard separation rule: rtopacks-db is never written to by Radar. It is the sacred NRT corpus. Radar reads web addresses from it. All Radar intelligence is stored in radar-db.
Field Observer — Standalone Product Positioning¶
Radar is an internal tool today. Field Observer is its external product name — the version that can be shown to clients and eventually extracted as a standalone product.
Tagline: "What's already visible — made useful."
The disarming move: When someone first sees their own RTO's Radar card, they feel a slight vertigo — "who gave you permission to know this about me?" The answer: nobody needed to. Everything here was already public. The value isn't in what we see — it's in the assembly.
What Field Observer shows a client: - How their organisation appears from the outside - What a prospective student, funding body, or regulator would see - What technology their competitors are running - Where their enrolment funnel actually goes - How current their site is
What Field Observer does not do: - No AI. No machine learning. No natural language processing. - No access to private data, authenticated systems, or internal networks. - No judgments, scores, or rankings. - A mechanical process. Attention, applied at scale.
Future Work¶
Items described here but not yet built. Briefs will be scoped separately.
- RADAR-ENRICH-UI-01 — Surface the new enrichment fields in the admin UI: compliance signals bar, link destination panel, third-party vendor list, performance metrics, mobile screenshot toggle, HTML archive download.
- Client-facing Radar card — L4 users see their own RTO's Field Observer view in the workspace. Read-only, no analyst notes.
- Comparative view — side-by-side Radar cards for two RTOs. Competitive intelligence use case.
- Anomaly detection — automated flagging of signals that don't match expectations (active registration + expired SSL, recent ABN cancellation + live website, etc.)
- Historical signal tracking — versioned signals over time, not just latest observation. "This RTO's SSL expired on X date and was renewed on Y."
- RADAR-ENRICH-COMMERCIAL — WhoisXML API integration for comprehensive WHOIS/RDAP data at scale. Replaces the sparse 5% coverage on domain_registered/domain_expiry.
- Configurable crawl depth — per-RTO override for sitemap URL cap, screenshot frequency, etc.
- Element-level sitemap analysis — parse individual course pages to extract qualification codes, mapping sitemap URLs to TGA scope data.
RADAR-SPEC-01 v0.2 — RTOpacks internal specification — April 2026