Skip to content

RTOpacks Radar — Product Specification

Document ID: RADAR-SPEC-01
Version: 0.2
Status: Living document — captures deployed system state + product intent
Location: docs/docs/workspace/apps/radar.md
Last updated: 17 April 2026


Changelog

Version Date Changes
0.1 17 Apr 2026 Initial — full signal taxonomy, data model, crawl pipeline, UI surfaces, product positioning
0.2 17 Apr 2026 Enrichment pipeline, 24 promoted columns, compliance signals, link destination resolution, third-party/infrastructure domain split, provenance layer, dual screenshots, enrichment_data JSON schema

How to Use This Document

This spec is the full-picture reference for Radar, also known as Field Observer when surfaced as a standalone product. It documents what the system is, what it observes, how it observes it, and what it does not do. Every build brief that touches Radar references this document as its source of truth.

This document does not constitute a brief. Briefs are scoped, sequenced, and issued separately. The reason for this is the same as People: the full system described here is too large to have been built in one pass, and some parts are still planned. What should never happen is a brief that contradicts this document, or a build that forecloses something described here.

Alex: read this document before touching any Radar code. The briefs tell you what to build. This document tells you why, and what "Radar" actually is.


What Radar Is

Radar is RTOpacks' intelligence layer. It assembles a structured dossier on every registered training organisation in Australia — their digital presence, their technology stack, their government registration status, and their public reputation — using only publicly observable data.

It does not access anything private. It does not authenticate to any system. It does not use AI to draw conclusions. It reads what is already published — sitemaps, DNS records, SSL certificates, government registers, social media profiles — and assembles it faster than any human team could, into a format where it can actually be seen and compared.

The product's core claim, stated plainly: Everything you see here was already public. We just connected the dots.

Who uses it

  • RTOpacks operators (L3) use Radar to understand the RTO landscape — which organisations are active, which are dormant, what technology they run, how mature their digital presence is.
  • RTO managers (L4) will eventually see their own Radar card — a mirror showing how their organisation appears from the outside. The slight vertigo of seeing your own business reflected back is the product's disarming move: the explanation is reassuring precisely because the mechanism is mundane.
  • Regulatory analysts use the signal layers to identify anomalies — RTOs with expired SSL, no web presence, mismatched entity types, or dormant digital footprints despite active registrations.

What it is not

Radar is not surveillance. It does not monitor private communications, access authenticated systems, or track individual behaviour. Every signal it captures is available to anyone with a web browser and the patience to look. Radar's value is not in what it sees — it's in the fact that it sees everything, at the same time, and puts it somewhere useful.

Say this, not that:

Say this Not that
Public data, assembled Data harvested
What's on the label What we discovered
We read your sitemap We analysed your site
The same information anyone could find Proprietary intelligence
A mechanical process An algorithm
We connected the dots We drew conclusions
No AI Powered by AI

The Dossier Model

Every RTO has a dossier — a single record that anchors all intelligence collected about that organisation. The dossier is created automatically when any Radar surface first references an RTO. It persists indefinitely.

Dossier fields

Field Type Description
id TEXT PK UUID
rto_code TEXT UNIQUE The RTO's national code from TGA
last_crawled_at TEXT ISO timestamp of most recent crawl completion
crawl_status TEXT Lifecycle: pendingqueuedcrawlingcomplete / failed
notes TEXT Free-text analyst notes (operator-entered)
created_at TEXT Auto
updated_at TEXT Auto

The dossier is a shell. Intelligence lives in satellite tables keyed to rto_code, plus the radar_crawl_results table which holds both sitemap crawl data and enrichment data from the Puppeteer pass.


Signal Layers

Radar organises intelligence into five layers, ordered by confidence — from government-reported facts at the top, to social signals at the bottom. Each layer is independently populated, independently timestamped, and independently refreshable.

Layer 1 — Government Reality

Confidence: Authoritative
Sources: Australian Business Register (ABR), ASIC, ASQA, ACNC
What it tells you: Is this entity real? Is it active? What kind of entity is it?

Category Signal Key Description
abn abn_value The ABN itself
abn abn_status Active / Cancelled / other
abn abn_status_effective Date the status took effect
abn abr_entity_name Legal name as registered with ABR
abn abr_last_updated When ABR last modified this record
abn entity_type_code ABR entity type (company, sole trader, trust, etc.)
abn entity_description Human-readable entity description
abn gst_status GST registration status
abn gst_registered_date When GST was registered
abn asic_number ASIC company number (ACN) if applicable
abn acnc_registered Whether registered with ACNC (charities)
abn charity_type ACNC charity subtype
abn dgr_endorsed Deductible Gift Recipient status
abn address_state Registered business state
abn address_postcode Registered business postcode
classification entity_type Derived entity classification (school, TAFE, university, commercial, NFP, etc.)
classification non_commercial_flag Whether this is a non-commercial entity
classifier entity_type_abn Entity type derived from ABN data (highest precedence)
classifier risk_flag Risk indicators (multiple allowed)
registration edu_au_anomaly Anomaly flags on .edu.au domain usage

Entity type classification precedence: ABN entity type > domain pattern > sweep inference > unknown. The entity badge in the UI follows this precedence chain.

Entity type values: school, tafe, university, government, association, sole_trader, trust, nfp, charity, commercial_rto, unknown_edu, unknown.

Layer 2 — Public Presence

Confidence: Observed
Sources: DNS, WHOIS, SSL certificates, HTTP responses, Wayback Machine, TLS handshake (enrichment pass)
What it tells you: Does this organisation have a functioning web presence? How established is it? Is their infrastructure secure?

Category Signal Key Source Description
domain has_edu_domain DNS Whether the RTO uses a .edu.au domain
domain edu_website_live HTTP Whether the .edu.au site responds
domain edu_mx_live DNS Whether .edu.au MX records are active
domain domain_registered WHOIS Domain registration date
domain domain_expiry WHOIS Domain expiry date
domain registrar WHOIS Domain registrar
domain registrant_privacy WHOIS Whether WHOIS privacy is enabled
domain nameserver_provider DNS NS provider (Cloudflare, AWS Route53, etc.)
domain namespace_active DNS Whether the domain actively resolves
domain dnssec DNS DNSSEC enabled
domain axfr_permitted DNS Zone transfer permitted (security flag)
email contact_email HTML Publicly listed contact email
email email_domain HTML Email domain used
email email_domain_resolves DNS Whether the email domain resolves
ssl ssl_issuer TLS handshake Certificate issuer (Let's Encrypt, DigiCert, Sectigo, etc.)
ssl ssl_valid_to TLS handshake Certificate expiry date (ISO)
ssl ssl_days_remaining TLS handshake Days until cert expiry (negative = expired)
ssl ssl_cert_error TLS handshake Certificate error if any (expired, name mismatch, etc.)
ssl hsts HTTP header HTTP Strict Transport Security enabled
redirect final_url HTTP URL after all redirects (stale domain detection)
redirect had_https_redirect HTTP Whether initial HTTP URL redirected to HTTPS
history wayback_first_snapshot Wayback API Earliest Wayback Machine capture
history wayback_latest_snapshot Wayback API Most recent Wayback Machine capture
classifier web_age_years Derived Years since first Wayback snapshot
classifier ssl_health Derived good / adequate / expiring_soon / critical

Enrichment adds: ssl_issuer, ssl_valid_to, ssl_days_remaining, ssl_cert_error, final_url, had_https_redirect are promoted columns on radar_crawl_results, populated by the local Puppeteer enrichment pass from the TLS handshake. Full SSL details (subject, valid_from, protocol, SAN list note) are in the enrichment_data JSON blob under $.ssl.

Layer 3 — Technical Fingerprint

Confidence: Observed / Inferred
Sources: HTTP response headers, HTML source analysis, rendered DOM, network request interception, known platform patterns
What it tells you: What technology runs this site? How sophisticated is it? What compliance signals are visible?

Fingerprint signals

Category Signal Key Source Description
fingerprint page_title DOM HTML <title> (promoted column)
fingerprint meta_description DOM Meta description
fingerprint meta_keywords DOM Meta keywords (legacy SEO)
fingerprint og_title DOM OpenGraph title
fingerprint og_site_name DOM OpenGraph site name
fingerprint og_description DOM OpenGraph description
fingerprint og_image_url DOM OpenGraph image URL
fingerprint twitter_card DOM Twitter card type
fingerprint twitter_site DOM Twitter @handle
fingerprint generator DOM CMS generator tag (WordPress, Squarespace, etc.) (promoted column)
fingerprint server_header HTTP HTTP Server header
fingerprint x_powered_by HTTP HTTP X-Powered-By header
fingerprint robots_meta DOM Robots meta directive
fingerprint copyright_year DOM Copyright year found in footer
fingerprint mobile_responsive DOM Viewport meta tag exists and is not fixed-width (promoted column)
fingerprint canonical_url DOM <link rel="canonical"> URL
fingerprint viewport_meta DOM Raw viewport meta content string
fingerprint wp_rest_api_exposed HTTP WordPress REST API publicly accessible

Analytics and pixel detection

Detected via dual strategy: HTML source regex AND observed network requests during page load.

Signal Key Detection Description
gtm + gtm_id HTML regex (GTM-*) + network (googletagmanager.com) Google Tag Manager
ga4 + ga4_id HTML regex (G-*) + network (google-analytics.com) Google Analytics 4
google_ads + google_ads_id HTML regex (AW-*) + network (googleadservices.com) Google Ads conversion
meta_pixel + meta_pixel_id HTML regex (fbq('init') + network (connect.facebook.net) Meta/Facebook pixel
tiktok_pixel + tiktok_pixel_id HTML regex (ttq.load) + network (analytics.tiktok.com) TikTok pixel
linkedin_insight + linkedin_insight_id HTML regex (_linkedin_partner_id) + network (snap.licdn.com) LinkedIn Insight
hubspot_tracking HTML regex (hs-script-loader) + network (js.hs-scripts.com) HubSpot tracking
hotjar HTML regex (hotjar.com) + network (hotjar.com) Hotjar
intercom HTML regex (intercomSettings) + network (intercom.io) Intercom
zendesk_chat HTML regex (zE() + network (zendesk.com) Zendesk Chat
tawk_chat HTML regex (tawk.to) + network (embed.tawk.to) Tawk.to Chat

Compliance signals

Mechanical observations from the rendered homepage. No inference, no model calls. All regex-based or link-presence checks.

Signal Key Type Source Description
rto_code_on_homepage BOOLEAN Text regex RTO's own national code visible in page text (promoted column)
cricos_code_on_homepage BOOLEAN Text regex CRICOS provider code visible (promoted column)
cricos_code_value TEXT Text regex The matched CRICOS code (e.g. 00001K) (promoted column)
usi_mentioned BOOLEAN Text regex "USI" or "Unique Student Identifier" in page text
privacy_policy_link_present BOOLEAN Link check <a> with text containing "privacy" exists (promoted column)
privacy_policy_url TEXT Link href URL of the privacy policy link
complaints_link_present BOOLEAN Link check <a> with text containing "complaint", "grievance", or "appeals" (promoted column)
complaints_link_url TEXT Link href URL of the complaints link
accessibility_statement_present BOOLEAN Link check <a> with text containing "accessibility" or "wcag" (promoted column)
accessibility_url TEXT Link href URL of the accessibility statement
government_program_mentions JSON array Text regex Detected mentions of: Smart and Skilled, Free TAFE, Fee-Free TAFE, VET Student Loans, Australian Apprenticeships, JobTrainer, User Choice, Construction Blueprint, Skills First, Higher Level Skills
third_party_enrolment_system BOOLEAN Link resolution "Enrol" or "Apply" links resolve to known external platforms (promoted column)

Layer 4 — Integration Signals

Confidence: Observed / Inferred
Sources: MX records, DNS TXT records, HTML source, subdomain fingerprinting, network request interception, link destination resolution
What it tells you: What third-party services does this organisation depend on?

Category Signal Key Description
mail mail_platform Email platform (Google Workspace, Microsoft 365, etc.)
mail email_domain_mail_platform Mail platform on the email domain specifically
mail spf_present SPF record exists
mail dkim_present DKIM record exists
mail dmarc_present DMARC record exists
mail dmarc_policy DMARC policy (none, quarantine, reject)
lms lms_platform LMS platform detected (from signals table)
lms lms_url URL of detected LMS
lms lms_hosting_same_as_main Whether LMS is on same infrastructure
lms lms_platform_detected LMS detected via link resolution (promoted column on crawl_results)
sms sms_platform_detected Student management system detected via link resolution (promoted column)
integration third_party_service Named third-party service detected
classifier email_maturity Derived: mature / moderate / basic / none

Layer 5 — Human Signals

Confidence: Inferred (lowest)
Sources: Social media platforms, Seek job listings, Wikipedia
What it tells you: Is this organisation active in the public sphere? Is it hiring?

Category Signal Key Description
social social_facebook Facebook URL
social social_linkedin LinkedIn URL
social social_instagram Instagram URL
social social_twitter Twitter/X URL
social social_youtube YouTube URL
social social_tiktok TikTok URL
social facebook_blocked Whether Facebook profile was blocked/private
social linkedin_blocked Whether LinkedIn profile was blocked/private
seek seek_active_listings Number of active Seek job listings
seek seek_locations Locations in Seek listings
reputation wikipedia_present Whether a Wikipedia article exists
reputation wikipedia_title Wikipedia article title
reputation wikipedia_description Wikipedia article summary
classifier hiring_signal Derived: active_hiring / dormant

Confidence Tiers

Every signal carries a confidence level. Three tiers:

Tier Meaning Example
Authoritative Government-reported, citable, legally binding ABN status from ABR
Observed Directly detected by automated sweep, timestamped, repeatable SSL certificate issuer from HTTPS handshake
Inferred Derived from authoritative or observed data — a reasonable conclusion, not a direct observation Hosting tier derived from IP/ASN lookup

The UI distinguishes these with colour-coded confidence pills. Authoritative signals are presented as facts. Inferred signals are presented as inferences. This distinction is not cosmetic — it determines whether the signal is citable in a regulatory context.


Stack Profile

Separate from the signal layers, each RTO has a stack profile — a table of detected technologies organised by surface area. This is the "what are they running?" view.

Surfaces

Surface What it covers
frontend CMS, static site generator, JavaScript framework
lms Learning management system (Canvas, Moodle, Blackboard, etc.)
email Email platform (Google Workspace, Microsoft 365, Zoho, etc.)
hosting Cloud provider, CDN, shared hosting
cdn Content delivery network
analytics Analytics and tracking tools
payment Payment processing (Stripe, PayPal, etc.)
student_portal Student management system / portal
other Anything that doesn't fit above

Each entry records: surface, technology name, description, cost range estimate, vendor URL, confidence level, and observation timestamp.


Crawl Pipeline

Data flow

rto-nrt-db (read) → radar-crawl worker → radar-db (write) + R2 (screenshots)

The crawl pipeline is a Cloudflare Worker (radar-crawl) that processes RTOs in batches. For each RTO:

  1. Resolve web address — query rto_web_addresses in rto-nrt-db
  2. Fetch robots.txt — extract Sitemap: directives
  3. Fetch and parse sitemap — handle both sitemap.xml and sitemap_index.xml; cap at 500 URLs per RTO
  4. Classify URLs — pattern-match against known categories
  5. Detect external platforms — hostname matching
  6. Write to D1 — structured JSON in radar_crawl_results

URL classification categories

Category Pattern matches
courses /courses/, /qualifications/, /training/, /cert-, /diploma-, /certificate-
enrol /enrol, /apply, /register, /enquire, /enquiry
student /student, /portal, /my-, /login, /lms
contact /contact, /locations, /find-us
policies /usi, /fees, /refund, /complaints, /privacy, /terms
about /about, /team, /staff, /governance
news /news, /blog, /events

External platform detection

Detected via both sitemap hostname matching (crawl pipeline) and link destination resolution (enrichment pipeline).

Platform Hostname pattern
Canvas LMS *.instructure.com
Canvas Catalog *.catalog.instructure.com
Moodle moodle.*, *.moodlecloud.com, *.moodlesites.com, *.mdl2.com
Blackboard *.blackboard.com
Blackboard Collaborate *.bbcollab.com
D2L Brightspace *.brightspace.com
Teachable *.teachable.com
Thinkific *.thinkific.com
LinkedIn Learning learning.linkedin.com
Wisenet SMS *.wisenet.co
aXcelerate *.axcelerate.com(.au)?
VETtrak *.vettrak.com(.au)?
JobReady *.jobready.com(.au)?
RTO Manager *.rtomanager.(com\|com.au\|net)
Cliniko *.cliniko.com
Google Forms docs.google.com/forms, forms.gle
Microsoft Forms forms.office.com, forms.microsoft.com
SurveyMonkey *.surveymonkey.com
JotForm *.jotform.com
Typeform *.typeform.com
Eventbrite *.eventbrite.com(.au)?
SharePoint *.sharepoint.com

Enrichment Pipeline

Overview

The enrichment pipeline is a local Puppeteer script (tools/radar-enrich-local.mjs) that visits every RTO's homepage with a full Chromium browser instance and captures every signal extractable from that single visit. It runs on Tim's Mac Mini M2 Pro, not on Cloudflare Workers.

Why local, not CF Workers

  • CF Browser Rendering has a 2-concurrent-session limit — bulk work is infeasible
  • Local Puppeteer runs 10+ parallel Chromium instances with no concurrency ceiling
  • Local execution keeps the CF Browser Rendering budget free for ad-hoc per-RTO refreshes
  • The previous corpus screenshot ingest used this exact pattern successfully

Three lifecycle modes

Mode Tool Trigger Cadence Volume
Bulk ingest Local Puppeteer (tools/radar-enrich-local.mjs) Manual On demand All ~12,500 RTOs
Monthly refresh Same local script Manual, 1st of month Monthly All RTOs
Ad-hoc single CF Browser Rendering via radar-crawl worker /single endpoint "Trigger crawl" button in RadarTab UI On demand 1 RTO

What the enrichment captures per RTO (single visit)

Screenshots (two per RTO):

Viewport Size R2 key
Desktop 1280×800, JPEG q75 radar-screenshots/{rto_code}/homepage.jpg
Mobile 375×812, isMobile: true, deviceScaleFactor: 2, JPEG q75 radar-screenshots/{rto_code}/homepage-mobile.jpg

Both captured on the same page load — desktop first, then resize to mobile with 500ms reflow wait.

SSL cert details — from the TLS handshake Chromium performs. Captured even on cert errors (--ignore-certificate-errors flag). An expired cert is itself a high-value signal.

HTTP response headers — full headers stored as JSON in enrichment_data.response_headers. Server, X-Powered-By, HSTS promoted to queryable fields.

Layer 3 fingerprints — all meta tags, OG data, generator, viewport, canonical URL, copyright year from the rendered DOM via page.evaluate().

Analytics/pixel detection — dual HTML regex + network request interception. IDs captured where possible (GTM-, G-, AW-*).

Compliance signals — RTO code visibility, CRICOS code, USI mentions, privacy/complaints/accessibility links, government program mentions. All regex-based, mechanical, no inference.

Third-party domain classification — every outbound network request logged, deduplicated by hostname, classified as vendor or infrastructure (see below).

Link destination resolution — homepage links in categories enrol, student, contact, plus all external links, resolved via HEAD request (GET fallback on 405). Max 20 per RTO, 10s timeout. Platform detection on final URL.

HTML archive — full rendered DOM, gzipped, stored in R2 at radar-html/{rto_code}/{timestamp}.html.gz. Enables future re-mining without re-crawling.

Favicon — fetched from detected <link rel="icon"> or fallback /favicon.ico. Stored at radar-favicons/{rto_code}/favicon.{ext}.

Performance — load time to domcontentloaded (ms), request count, page weight estimate.

Per-field partial failure

Each capture operation is wrapped in try/catch. If one fails (e.g. favicon 404), the rest of the RTO's enrichment continues. Success is tracked per-field in enrichment_success_flags:

{
  "screenshot_desktop": 1,
  "screenshot_mobile": 1,
  "html_archived": 1,
  "ssl_captured": 1,
  "headers_captured": 1,
  "metadata_extracted": 1,
  "analytics_detected": 1,
  "compliance_parsed": 1,
  "third_party_captured": 1,
  "performance_captured": 1,
  "link_destinations_resolved": 1,
  "favicon_captured": 0
}

Analysts can query "RTOs where link_destinations failed" and re-run just that capture without re-doing the full visit.


Third-Party Domain Classification

Every outbound network request during a page load is logged. After navigation completes, each unique hostname is classified as vendor (real service/product relationship) or infrastructure (CDN/font/analytics plumbing).

Classification rules

  1. Check the hostname against the vendor patterns list (regex, takes precedence)
  2. Check the registrable domain against the infrastructure set
  3. Unknown domains default to vendor (conservative — better to over-surface than silently filter)

First-party filtering

Requests to the RTO's own domain or subdomains are excluded. Filtering uses the final URL's registrable domain (after redirects), not the stored web_address — handles domain migrations like www.cit.act.edu.aucit.edu.au.

Vendor patterns (62 patterns)

Organised by category. Full list maintained in tools/radar-enrich-local.mjs. Key categories:

  • LMS: Canvas, Moodle (Cloud/Sites/mdl2), Blackboard, Brightspace, Teachable, Thinkific, LinkedIn Learning
  • SMS/RTO: Wisenet, aXcelerate, VETtrak, JobReady, RTO Manager, Cliniko
  • Form/Survey: Typeform, JotForm, SurveyMonkey, Microsoft Forms
  • Chat/Support: Intercom, Tawk.to, Zendesk, Freshdesk, Drift, Crisp
  • Consent management: Cookiebot, OneTrust, CookieLaw, Osano
  • Analytics: Hotjar, Cloudflare Web Analytics, HubSpot, Segment, Mixpanel, Amplitude
  • Booking: Calendly, Acuity, TryBooking, Humanitix, Eventbrite
  • Maps: Google Maps (maps.googleapis.com, maps.google.com)
  • reCAPTCHA: recaptcha.net, google.com/recaptcha
  • Video conferencing: Zoom
  • Payment: Stripe, PayPal, Braintree, Square, Westpac PayWay, eWAY, SecurePay
  • Email marketing: Mailchimp
  • CRM: Salesforce, Pardot, Zoho, Pipedrive
  • Microsoft: SharePoint

Infrastructure set (33 domains)

CDN/static assets (gstatic, googleapis, cloudflare, jsdelivr, bootstrapcdn, fontawesome, typekit), ad/tracking plumbing (googletagmanager, google-analytics, doubleclick, googlesyndication, facebook.net), social embed loaders (platform.twitter, platform.linkedin), WordPress infrastructure (wp.com, s.w.org), video embeds (youtube, vimeo).

Note: google.com is NOT in the infrastructure set — it's too broad. Specific Google services (Maps, reCAPTCHA) are handled by vendor patterns. Unclassified google.com requests fall through to vendor-unknown.

Output

Two arrays in enrichment_data:

  • third_party_domains — vendor relationships: ["static.hotjar.com (Hotjar)", "js-ap1.hubspot.com (HubSpot)", "unknown-vendor.com"]
  • infrastructure_domains — plumbing: ["www.googletagmanager.com", "connect.facebook.net", "fonts.gstatic.com"]

Named vendors include the detected name in parentheses. Unknown vendors show hostname only.


Homepage links matching specific categories are resolved to their final URL via HTTP HEAD request, with platform detection on the resolved hostname.

Categories resolved

  • enrol — links containing "enrol", "apply", "register", "enquire"
  • student — links containing "student", "portal", "login", "lms", "my-"
  • contact — links containing "contact", "location", "find us"
  • external — any link to a hostname different from the RTO's own domain

Resolution method

  1. Extract matching <a> tags from rendered DOM (max 30 pre-filtered)
  2. Take first 20 links (hard cap)
  3. For each: HTTP HEAD request, follow redirects, 10s timeout
  4. If HEAD returns 405: retry with GET
  5. Capture final URL after redirects
  6. Pattern-match final hostname against platform detection list

Output

{
  "link_text": "Student Login",
  "href": "/student-portal",
  "final_url": "https://myschool.canvas.instructure.com/login",
  "category": "student",
  "platform_detected": "Canvas LMS"
}

Platform detections from link resolution promote to lms_platform_detected, sms_platform_detected, and third_party_enrolment_system columns on radar_crawl_results.


Provenance

Every enrichment run records an observed_via map in enrichment_data showing which detection strategy produced each signal:

{
  "ssl_issuer": "tls_handshake",
  "page_title": "dom_query",
  "generator": "meta_tag",
  "ga4": "html_regex+network_request",
  "rto_code_on_homepage": "html_text_regex",
  "link_destinations": "href_head_resolution",
  "third_party_domains": "network_request_listener"
}

Purpose: when a signal returns surprising coverage (e.g. Meta Pixel at 5%), the provenance map tells you which detection strategy was used, enabling targeted improvement without re-crawling.


Crawl Results Schema (complete)

CREATE TABLE radar_crawl_results (
  -- Core (from sitemap crawl)
  rto_code              TEXT PRIMARY KEY,
  crawled_at            TEXT NOT NULL,
  web_address           TEXT,
  has_sitemap           INTEGER DEFAULT 0,
  sitemap_url           TEXT,
  page_count            INTEGER DEFAULT 0,
  last_modified         TEXT,
  screenshot_r2_key     TEXT,           -- desktop: {rto_code}/homepage.jpg
  screenshot_failed     INTEGER DEFAULT 0,
  no_web_presence       INTEGER DEFAULT 0,
  classified_urls       TEXT,           -- JSON
  subdomains            TEXT,           -- JSON array
  external_platforms    TEXT,           -- JSON array
  raw_sitemap_sample    TEXT,           -- JSON array, first 50 URLs
  crawl_status          TEXT DEFAULT 'pending',
  crawl_error           TEXT,

  -- Enrichment: redirect (v0.2)
  final_url             TEXT,
  had_https_redirect    INTEGER DEFAULT 0,

  -- Enrichment: SSL (v0.2)
  ssl_issuer            TEXT,
  ssl_valid_to          TEXT,
  ssl_days_remaining    INTEGER,
  ssl_cert_error        TEXT,

  -- Enrichment: fingerprint (v0.2)
  page_title            TEXT,
  generator             TEXT,
  mobile_responsive     INTEGER DEFAULT 0,

  -- Enrichment: compliance (v0.2)
  rto_code_on_homepage          INTEGER DEFAULT 0,
  cricos_code_on_homepage       INTEGER DEFAULT 0,
  cricos_code_value             TEXT,
  complaints_link_present       INTEGER DEFAULT 0,
  privacy_policy_link_present   INTEGER DEFAULT 0,
  accessibility_statement_present INTEGER DEFAULT 0,

  -- Enrichment: platform detection (v0.2)
  lms_platform_detected         TEXT,
  sms_platform_detected         TEXT,
  third_party_enrolment_system  INTEGER DEFAULT 0,

  -- Enrichment: performance (v0.2)
  load_time_ms          INTEGER,

  -- Enrichment: artifacts (v0.2)
  screenshot_mobile_r2_key TEXT,
  favicon_r2_key        TEXT,
  html_archive_r2_key   TEXT,

  -- Enrichment: tracking (v0.2)
  enrichment_completed_at TEXT,
  enrichment_success_flags TEXT,  -- JSON per-field success map
  enrichment_data       TEXT     -- JSON blob (see structure below)
);

enrichment_data JSON structure

{
  "ssl": {
    "subject": "cit.edu.au",
    "issuer": "E8",
    "valid_from": "2026-03-25T01:34:10.000Z",
    "valid_to": "2026-06-23T01:34:09.000Z",
    "protocol": "TLS 1.3",
    "san_list_note": "Not available via Puppeteer SecurityDetails API"
  },
  "response_headers": {
    "server": "cloudflare",
    "content-type": "text/html; charset=utf-8",
    "strict-transport-security": "max-age=31536000"
  },
  "meta": {
    "page_title": "Home : Canberra Institute of Technology",
    "meta_description": "...",
    "og_title": "...",
    "og_site_name": "CIT",
    "generator": null,
    "viewport_meta": "width=device-width, initial-scale=1",
    "mobile_responsive": true,
    "copyright_year": "2026",
    "canonical_url": "https://cit.edu.au/",
    "favicon_url": "/favicon.ico"
  },
  "analytics": {
    "gtm": true, "gtm_id": "GTM-XXXX",
    "ga4": true, "ga4_id": "G-XXXX",
    "meta_pixel": false,
    "hotjar": true,
    "intercom": false
  },
  "compliance": {
    "rto_code_present": true,
    "cricos_code_present": true,
    "cricos_code_value": "00001K",
    "usi_mentioned": true,
    "privacy_policy_link_present": true,
    "privacy_policy_url": "https://cit.edu.au/policies/privacy_policy",
    "complaints_link_present": true,
    "complaints_link_url": "https://cit.edu.au/about/student-and-community-member-complaints",
    "accessibility_statement_present": false,
    "accessibility_url": null,
    "government_program_mentions": ["Free TAFE", "Fee-Free TAFE"]
  },
  "third_party_domains": [
    "static.hotjar.com (Hotjar)",
    "script.hotjar.com (Hotjar)",
    "www.facebook.com"
  ],
  "infrastructure_domains": [
    "www.googletagmanager.com",
    "connect.facebook.net",
    "www.google-analytics.com"
  ],
  "link_destinations": [
    {
      "link_text": "How to apply",
      "href": "https://cit.edu.au/study/apply",
      "final_url": "https://cit.edu.au/study/apply",
      "category": "enrol",
      "platform_detected": null
    }
  ],
  "performance": {
    "page_weight_kb": 22413,
    "request_count": 205
  },
  "observed_via": {
    "ssl_issuer": "tls_handshake",
    "page_title": "dom_query",
    "ga4": "html_regex+network_request"
  }
}

IP Observations

Each RTO's domain is resolved to IP addresses, and each IP is enriched:

Field Description
hostname The domain resolved
ip_address Resolved IP (A record)
ip_version 4 or 6
ptr_record Reverse DNS
asn / asn_org Autonomous System Number and organisation
network_block CIDR block
country_code / city Geolocation
is_shared Whether the IP hosts multiple domains
hosting_provider Derived provider name

UI Surfaces

Field Observer Map (Layer 0)

The default view when opening any RTO's Radar tab. Visible without clicking "Show detail."

Components: - Homepage screenshot — full-width, 240px tall, from R2 - React Flow topology map — horizontal flow graph showing the site structure: - Root node — homepage, domain name label - Section nodes — one per classified URL category with count badge (e.g. "Courses (31)") - Subdomain nodes — dashed blue border, separate column - External platform nodes — amber, rightmost column (e.g. "Canvas LMS") - Meta bar — domain, last crawled date, page count, sitemap status

Progressive disclosure: - Layer 0 (always visible): screenshot + topology map + meta - Layer 1 (tooltip on node hover): single most relevant signal - Layer 2 (node click): opens URL in new tab - Layer 3 ("Show detail" toggle): full intelligence panel

Graceful degradation: - No web presence: "No web address recorded for this RTO" + government signals still shown - No sitemap: screenshot only + "No sitemap detected" + simplified single-node map - No screenshot: map and signals still render; screenshot area hidden

Intelligence Panel (Layer 3)

Behind the "Show detail" toggle. Contains the full existing Radar content:

  1. Dossier header — last crawled, crawl status, entity badge, risk flags, analyst notes, trigger crawl button
  2. Classifier bar — visual summary: hosting tier, email maturity, SSL health, entity type, hiring signal, web age, org size, Wikipedia presence
  3. Stack profile — technology table by surface
  4. Screenshots — historical screenshot grid
  5. Seek job listings — active postings (if detected)
  6. IP observations — resolved IP table
  7. Signal layers 1–5 — collapsible sections with per-signal management

Database Topology

Database Binding Role
radar-db RADAR_DB All Radar-specific data: dossiers, signals, stack profiles, IP observations, screenshots, crawl results
rtopacks-db NRT_DB / RTOPACKS_DB Read-only — RTO web addresses, org identity
radar-screenshots (R2) SCREENSHOTS / RADAR_SCREENSHOTS Homepage JPEGs (desktop + mobile), HTML archives, favicons

R2 key conventions: - radar-screenshots/{rto_code}/homepage.jpg — desktop screenshot - radar-screenshots/{rto_code}/homepage-mobile.jpg — mobile screenshot - radar-html/{rto_code}/{timestamp}.html.gz — archived rendered HTML (timestamped, accumulates) - radar-favicons/{rto_code}/favicon.{ext} — favicon (overwritten on refresh)

Hard separation rule: rtopacks-db is never written to by Radar. It is the sacred NRT corpus. Radar reads web addresses from it. All Radar intelligence is stored in radar-db.


Field Observer — Standalone Product Positioning

Radar is an internal tool today. Field Observer is its external product name — the version that can be shown to clients and eventually extracted as a standalone product.

Tagline: "What's already visible — made useful."

The disarming move: When someone first sees their own RTO's Radar card, they feel a slight vertigo — "who gave you permission to know this about me?" The answer: nobody needed to. Everything here was already public. The value isn't in what we see — it's in the assembly.

What Field Observer shows a client: - How their organisation appears from the outside - What a prospective student, funding body, or regulator would see - What technology their competitors are running - Where their enrolment funnel actually goes - How current their site is

What Field Observer does not do: - No AI. No machine learning. No natural language processing. - No access to private data, authenticated systems, or internal networks. - No judgments, scores, or rankings. - A mechanical process. Attention, applied at scale.


Future Work

Items described here but not yet built. Briefs will be scoped separately.

  1. RADAR-ENRICH-UI-01 — Surface the new enrichment fields in the admin UI: compliance signals bar, link destination panel, third-party vendor list, performance metrics, mobile screenshot toggle, HTML archive download.
  2. Client-facing Radar card — L4 users see their own RTO's Field Observer view in the workspace. Read-only, no analyst notes.
  3. Comparative view — side-by-side Radar cards for two RTOs. Competitive intelligence use case.
  4. Anomaly detection — automated flagging of signals that don't match expectations (active registration + expired SSL, recent ABN cancellation + live website, etc.)
  5. Historical signal tracking — versioned signals over time, not just latest observation. "This RTO's SSL expired on X date and was renewed on Y."
  6. RADAR-ENRICH-COMMERCIAL — WhoisXML API integration for comprehensive WHOIS/RDAP data at scale. Replaces the sparse 5% coverage on domain_registered/domain_expiry.
  7. Configurable crawl depth — per-RTO override for sitemap URL cap, screenshot frequency, etc.
  8. Element-level sitemap analysis — parse individual course pages to extract qualification codes, mapping sitemap URLs to TGA scope data.

RADAR-SPEC-01 v0.2 — RTOpacks internal specification — April 2026