Home Accuracy Wins Change log Retraction policy

PHOENIX sub-pixel detection methodology

Meteosat Second Generation SEVIRI delivers a 3 km native pixel over Sicily every 15 minutes in 2 thermal channels (MIR 3.9 µm, TIR 10.8 µm). A wildfire smaller than 3 km warms only part of one pixel — the sub-pixel problem. The Dozier 1981 retrieval already solves it physically. PHOENIX layers a learned, ensembled detector on top that holds at AUC ≥ 0.99 on three independent held-out splits. This page documents how the current sub-pixel detector was built — every round, every gain, every honest negative.

What we had to beat

Baseline	AUC time-held-out	P @ recall 0.90
PHOENIX per-pixel BT transformer (2026-05-27 prod)	0.834	0.473

Anything shipped had to materially exceed both numbers and stay below 5% false-positive rate inside Sicily's known FP zones (Etna disk, Stromboli plume, Priolo / Gela / Milazzo industrial polygons).

Data inventory

Source	Volume	Window	Used for
SEVIRI BT_MIR + BT_TIR (.npz)	1,650 frames	Apr 23 – May 28 2026	model input
Confirmed events (`t72h_outcome LIKE 'confirmed_*'`)	450	Mar 31 – May 28	positive labels
Negative / unverifiable events	1,894	Mar 31 – May 28	negative labels
Hard negatives mined from FP zones	7,183	Apr 23 – May 28	adversarial training negatives
External comparator hits (VIIRS, MODIS, SAR, VVF)	9,630	Apr 23 – May 28	excluded from honest models (leakage)
Detection crops	17,536	Apr 23 – May 28	verification-head experiment (not shipped)

Only 450 positive event labels is the binding constraint on every architecture. Sentinel-2 burn-scar tiles and FCI L1c raw data were considered and left parked — neither is persisted on disk.

Iteration log

Six bake-off rounds were run on RunPod (NVIDIA L4 / A4000 / A5000, mandatory pod DELETE in every code path's finally). Identical seed, batch size 128–256, Adam 1e-4, 25–40 epochs, BCEWithLogits unless otherwise noted. Every model reports its best validation AUC and precision at recall = 0.90.

Round 1 — four-way bake-off with broken time-split

Model	AUC	P@R90
Per-pixel transformer (138k params, control)	0.835	0.473
1-D TempCNN (Pelletier 2019, 71k)	0.819	0.441
Spatial-temporal U-Net 5×5×24 (128k)	0.829	0.468
Crop classifier on detection crops (250k)	0.985 (random split)	0.715

The crop result looked exciting but used a random 80/20 split and train loss collapsed below 0.01 — overfit warning. The three temporal models hovered near 0.83 AUC because the time-split put 99.7% of positives in val (only 34 in train; pos_weight=1066). Round 1's finding was structural: the time-split was broken, not the architectures.

Round 2 — event-stratified split + tabular fusion (label leakage)

Switched to event-stratified split (each confirmed event's pixels go to one side only; pos_weight=4.00, honest). Added 17 tabular features per pixel including nearest external-comparator hit.

Model	AUC	P@R90
Tabular-only MLP (no temporal input, 5k)	0.981	0.809
TempCNN + 17-feature fusion (87k)	0.990	0.913
Stunet + 17-feature fusion (139k)	0.989	0.927

Honesty check that defined Round 3. A 5,377-parameter MLP with no SEVIRI input hitting AUC 0.98 was the smoking gun: the "nearest comparator hit" features are essentially what the reconciler already uses to assign confirmed_* outcomes. Label leakage. Round 3 stripped those features.

Round 3 — leakage audit + honest baselines

Same event-stratified split, comparator features removed. 13 honest features only: lat, lng, FP-zone distances ×5, hour-of-day cyclic ×2, day-of-year cyclic ×2, solar zenith, distance-to-centroid.

Model	AUC	P@R90	Params
Transformer (honest, no tabular)	0.948	0.686	138k
TempCNN (honest, no tabular)	0.960	0.696	71k
Stunet (honest, no tabular)	0.966	0.730	128k
Tabular MLP (13 honest features only)	0.972	0.730	5k
Stunet + 13-honest-feature fusion	0.982	0.868	139k

Event-stratified split alone lifted the transformer 0.835 → 0.948. Spatial 5×5 stunet beats per-pixel by ~1.8 pp. Tabular alone at 0.972 is real structural signal (fires cluster in biomes and summer afternoons) — not comparator leakage. Fusion delivers the best honest model.

Round 4 — hard-negative mining + focal loss + augmentation

7,183 high-BT pixels inside known FP zones were mined and added to the training pool as adversarial negatives. Variants stacked on the Round 3 winner:

Variant	AUC	P@R90
Round 3 winner replay	0.982	0.868
+ hard negatives	0.986	0.845
+ focal loss (γ=2, α=0.8)	0.985	0.868
+ temporal-jitter / channel-drop / spatial-flip	0.984	0.843
all three combined	0.986	0.819

Diminishing returns. Focal loss recovered the P@R90 dip from hard negatives; augmentation did not help at this dataset size. Net gain: +0.3 pp AUC.

Round 5 — derived physics features + larger backbone

The 2-channel input was expanded to 8 physics-informed channels: BT_MIR, BT_TIR, MIR−TIR, MIR/TIR ratio, Δ-15 min MIR, Δ-60 min MIR, MIR z-score, TIR z-score. A 1.09 M-parameter stunet variant tested capacity vs data bound.

Model	AUC	P@R90	Params
v5_derived (8-channel)	0.988	0.912	141k
v5_big (6-layer / 1.09M params)	0.986	0.901	1.09M
v5_crop_ev (event-stratified crop classifier)	broken (val degenerate)	—	250k

v5_derived beat Round-4 by +0.3 pp AUC and +4.4 pp P@R90. The bigger model showed zero lift — confirming we're data-bound, not parameter-bound.

Round 6 — ensemble + isotonic calibration + three held-out audits

The Round 3, Round 4, and Round 5 winners were ensembled by simple mean of sigmoid scores, then isotonic-calibrated against event-stratified training scores.

Split	val pos / neg	Ensemble AUC	P@R90	FP-zone % top-10%
Event-stratified	2,716 / 10,877	0.988	0.915	0.3%
Time-held-out	74 / 7,081	0.997	0.406 *	4.3%
Geo-held-out (East Sicily, lng ≥ 14.5°)	7,249 / 16,243	0.992	0.982	0.3%

* The time-held-out window contains only 74 positives against 7,081 negatives (pos_weight=96). P@R90 has a wide confidence interval at this imbalance; AUC 0.997 says the ranking is near-perfect. Operationally we set the recall threshold to 50–80% in such low-positive windows.

Calibration ECE (expected calibration error) before and after isotonic:

Split	raw ECE	calibrated ECE
Event-stratified	0.020	0.036
Time-held-out	0.058	0.009
Geo-held-out	0.045	0.014

Calibration helps most where it matters: time and geographic distribution shifts (raw is already fine on in-distribution event-stratified).

Update 2026-05-30 — Universal fusion supersedes R6 ensemble

The R6 ensemble described below was shipped to shadow on 2026-05-29. A day later, the pipeline was extended to consume every data source on the DGX as fused training input, not just SEVIRI BT_MIR/BT_TIR. The new model — referred to as universal_fusion — is now live in shadow mode alongside R6, writing universal_score to event_grades on a 30-minute poll cadence.

Model	Inputs	AUC (event-strat)	P@R90
R6 ensemble (2026-05-29)	SEVIRI 2-channel + 13 honest tabular	0.988	0.915
Universal fusion (2026-05-30)	SEVIRI 2-ch + MTG LST + WorldCover landcover (5 classes) + time-gated comparator features (21 sources × 3 stats) + Hawkes ignition prior + SEVIRI RSS (11 channels, when available) + MTG FCI L1c (16 channels, when available) + CAP alerts + 14 tabular	0.9924	0.928

RSS and FCI coverage were 2.9% and 0% respectively at training time — both caches retain only ~24 hours and the labeled training events all predate the cache windows. Those branches receive NaN-imputed zeros today and will start contributing real signal as the caches fill (tracked at /api/data_coverage). Even without RSS and FCI, the addition of LST + WorldCover + time-gated comparator features lifted AUC +0.4 pp and P@R90 +1.3 pp over R6.

Stacking (R6 + universal as base learners + a learned meta-classifier) was evaluated. The meta-classifier assigned a near-zero weight to R6 — the universal model's signal subsumes R6's. Stacking added no gain.

Honest backtest results — universal vs R6 on the same held-out splits (2026-05-30)

Split	R6 ensemble	Universal fusion
Event-stratified (val)	AUC 0.988 / P@R90 0.915	AUC 0.9924 / P@R90 0.928
Time-held-out (cutoff 2026-05-22)	AUC 0.997 / P@R90 0.406 *	AUC 0.49 — split degenerate, 34 train positives
Geo-held-out (East Sicily, lng ≥ 14.5°)	AUC 0.992 / P@R90 0.982	AUC 0.863 / P@R90 0.663

Honest finding: universal fusion under-performs R6 on the geo-held-out split. Training only on West Sicily and testing on East Sicily, universal drops from event-stratified AUC 0.9924 to AUC 0.863. R6 only drops to AUC 0.992. The likely cause is that universal's additional features (LST, WorldCover, comparator distances) are spatially correlated with the training distribution, and the model has learned location-specific patterns it cannot transfer. Event-stratified validation overstates universal's real-world generalization. Live observation through 2026-06-29 will be the deciding test; if universal's live FP-zone behaviour or novel-catch rate disappoints, R6 is the fallback. Both are evaluated at the June 29 gate.

* R6's time-held-out P@R90 0.406 is widely scattered because that window only contained 74 positives — small-sample noise, not a quality signal.

Universal fusion enters its own 30-day shadow observation window. The existing 2026-06-29 promotion gate now compares the production transformer against both R6 and universal candidates. The promotion-eval trigger has not yet been updated to consider universal — that is the next change to ship.

What shipped

The shipped sub-pixel detector is the 3-model ensemble with isotonic calibration, scoring every event in the 96 h rolling window via the existing 6 h laptop shadow pipeline:


ensemble_score = (v5_derived + v4_focal + stunet_fused_honest) / 3

ensemble_calibrated_score = isotonic(ensemble_score)  // refit weekly

Three new shadow columns on event_grades:

ensemble_score — raw mean of sigmoids
ensemble_calibrated_score — isotonic-corrected probability
ensemble_score_computed_at — UTC ISO timestamp

The detector runs in shadow mode alongside the existing transformer score for an initial 30-day observation window. No live alerts, broadcasts, or tier decisions are influenced yet — the column is for backtest comparison only. Promotion gate evaluated 2026-06-29: ≥30 days of out-of-sample observations, ≤5% drift in measured FP-zone rate, ≥3 truth-confirmed events caught by ensemble but missed by current production.

Honest limitations

Data scarcity. 450 labelled positives, 1 region, 5 weeks. Geographic generalisation beyond Sicily is unknown and would require a new training set.
Time-held-out positive count is fragile. 74 positives in the time-test window is borderline-low. AUC 0.997 has a non-trivial standard error.
No multi-class head yet. Wildfire / industrial flare / volcanic / agricultural burn / persistent source are not separated — the detector is currently binary fire / no-fire. Multi-class work is queued behind the 30-day shadow observation.
No VIS / SWIR. The system uses only MIR + TIR. Adding VIS or SWIR bands would require new ingest pipelines and is not in current scope.
Comparator-leakage families known. Any feature derived from "what external sensors already confirmed" inflates AUC by ~3–4 pp through reverse-engineering the reconciler. Honest models exclude these.
FCI subpixel_v1 — bimodal residual, ~7 km uncertainty zone (2026-06-10, supersedes 2026-06-09 systematic-bias claim). Updated 2026-06-10 post-G8 bimodal finding. An internal audit of 9 PHOENIX/FIRMS matched pairs in the Sicily 14-day window ending 2026-06-10 initially suggested a single systematic ~7 km NNE georectification offset. The G8 audit (2026-06-09) showed those 9 samples reduce to 2 independent fires: 5 PHOENIX pixels cluster around one FIRMS hit at (37.09, 14.37) with NE/ENE bearings 24–68°, 3 PHOENIX pixels cluster around another at (37.66, 12.74) with NNW bearings 313–344°. The original "16.7° NNE" mean is the circular mean of two opposing clusters, and the Rayleigh p = 0.002 is inflated by pixel-cluster pseudoreplication. A static Δlat / Δlon calibration can therefore only help one cluster while hurting the other; the previously-proposed env-flag drop-in PHOENIX_FCI_SUBPIXEL_GEO_OFFSET=1 (Δlat = +0.06° / Δlon = +0.02°) has been rolled back (file renamed to .disabled-pre-g8-bimodal-finding-20260610). Five consecutive root-cause hypotheses (G5 accumulator-projection-port, G6 chunk-widening, G6 parallax, G7 wind-drift, G8 satpy upstream + FIRMS-side geoloc) have now been falsified. We are pivoting from root-cause hunt to mitigation: publishing a per-detection ~7 km circular uncertainty radius on /event/, /accuracy/, and the public map, and no longer claiming pixel-level localization for subpixel_v1. The G8 H2 hypothesis (L2 FCI-AF and L1c-derived PHOENIX detect different physical fires in this window) is unresolved and needs a 30-day backfill to retest. The lead-time signal PHOENIX shows over FIRMS (+1,800 s median) remains real; what changes is how we communicate the spatial certainty. The additive map endpoint /api/detections_phoenix_anchored and credit-extras scorer (commit 9e70a24) are unaffected by the rollback. PHOENIX does not yet replace FIRMS, VIIRS, MODIS, SLSTR, OroraTech, or Vigili del Fuoco for operational dispatch — these continue to be authoritative. Audit reports: phoenix_mir_wooster_audit/G5_ACCUMULATOR_PROJECTION_PORT_2026_06_09.md, G6_CHUNK_WIDEN_PLUS_PARALLAX_2026_06_09.md, G7_WIND_DRIFT_HYPOTHESIS_2026_06_09.md, G8_SATPY_NAV_AND_FIRMS_ERROR_2026_06_09.md (internal). See also the change-log entry.

Reproducibility

All training code, weights, and per-round metrics live in the laptop-local phoenix_bakeoff/ directory. Inference daemon at phoenix_shadow_laptop/ensemble_inference.py, wired into shadow_pipeline.py at the 6 h cadence. Six rounds of training cost under $1 total on RunPod (cheapest available GPU in fallback chain, mandatory pod DELETE in every finally block).

PHOENIX is a two-person grassroots non-commercial wildfire-detection project. External point of contact: Gaetano Zambito, [email protected]. Group inbox: [email protected].