PHOENIX sub-pixel detection methodology

Meteosat Second Generation SEVIRI delivers a 3 km native pixel over Sicily every 15 minutes in 2 thermal channels (MIR 3.9 µm, TIR 10.8 µm). A wildfire smaller than 3 km warms only part of one pixel — the sub-pixel problem. The Dozier 1981 retrieval already solves it physically. PHOENIX layers a learned, ensembled detector on top that holds at AUC ≥ 0.99 on three independent held-out splits. This page documents how the current sub-pixel detector was built — every round, every gain, every honest negative.

What we had to beat

BaselineAUC time-held-outP @ recall 0.90
PHOENIX per-pixel BT transformer (2026-05-27 prod)0.8340.473

Anything shipped had to materially exceed both numbers and stay below 5% false-positive rate inside Sicily's known FP zones (Etna disk, Stromboli plume, Priolo / Gela / Milazzo industrial polygons).

Data inventory

SourceVolumeWindowUsed for
SEVIRI BT_MIR + BT_TIR (.npz)1,650 framesApr 23 – May 28 2026model input
Confirmed events (t72h_outcome LIKE 'confirmed_*')450Mar 31 – May 28positive labels
Negative / unverifiable events1,894Mar 31 – May 28negative labels
Hard negatives mined from FP zones7,183Apr 23 – May 28adversarial training negatives
External comparator hits (VIIRS, MODIS, SAR, VVF)9,630Apr 23 – May 28excluded from honest models (leakage)
Detection crops17,536Apr 23 – May 28verification-head experiment (not shipped)

Only 450 positive event labels is the binding constraint on every architecture. Sentinel-2 burn-scar tiles and FCI L1c raw data were considered and left parked — neither is persisted on disk.

Iteration log

Six bake-off rounds were run on RunPod (NVIDIA L4 / A4000 / A5000, mandatory pod DELETE in every code path's finally). Identical seed, batch size 128–256, Adam 1e-4, 25–40 epochs, BCEWithLogits unless otherwise noted. Every model reports its best validation AUC and precision at recall = 0.90.

Round 1 — four-way bake-off with broken time-split

ModelAUCP@R90
Per-pixel transformer (138k params, control)0.8350.473
1-D TempCNN (Pelletier 2019, 71k)0.8190.441
Spatial-temporal U-Net 5×5×24 (128k)0.8290.468
Crop classifier on detection crops (250k)0.985 (random split)0.715

The crop result looked exciting but used a random 80/20 split and train loss collapsed below 0.01 — overfit warning. The three temporal models hovered near 0.83 AUC because the time-split put 99.7% of positives in val (only 34 in train; pos_weight=1066). Round 1's finding was structural: the time-split was broken, not the architectures.

Round 2 — event-stratified split + tabular fusion (label leakage)

Switched to event-stratified split (each confirmed event's pixels go to one side only; pos_weight=4.00, honest). Added 17 tabular features per pixel including nearest external-comparator hit.

ModelAUCP@R90
Tabular-only MLP (no temporal input, 5k)0.9810.809
TempCNN + 17-feature fusion (87k)0.9900.913
Stunet + 17-feature fusion (139k)0.9890.927
Honesty check that defined Round 3. A 5,377-parameter MLP with no SEVIRI input hitting AUC 0.98 was the smoking gun: the "nearest comparator hit" features are essentially what the reconciler already uses to assign confirmed_* outcomes. Label leakage. Round 3 stripped those features.

Round 3 — leakage audit + honest baselines

Same event-stratified split, comparator features removed. 13 honest features only: lat, lng, FP-zone distances ×5, hour-of-day cyclic ×2, day-of-year cyclic ×2, solar zenith, distance-to-centroid.

ModelAUCP@R90Params
Transformer (honest, no tabular)0.9480.686138k
TempCNN (honest, no tabular)0.9600.69671k
Stunet (honest, no tabular)0.9660.730128k
Tabular MLP (13 honest features only)0.9720.7305k
Stunet + 13-honest-feature fusion0.9820.868139k

Event-stratified split alone lifted the transformer 0.835 → 0.948. Spatial 5×5 stunet beats per-pixel by ~1.8 pp. Tabular alone at 0.972 is real structural signal (fires cluster in biomes and summer afternoons) — not comparator leakage. Fusion delivers the best honest model.

Round 4 — hard-negative mining + focal loss + augmentation

7,183 high-BT pixels inside known FP zones were mined and added to the training pool as adversarial negatives. Variants stacked on the Round 3 winner:

VariantAUCP@R90
Round 3 winner replay0.9820.868
+ hard negatives0.9860.845
+ focal loss (γ=2, α=0.8)0.9850.868
+ temporal-jitter / channel-drop / spatial-flip0.9840.843
all three combined0.9860.819

Diminishing returns. Focal loss recovered the P@R90 dip from hard negatives; augmentation did not help at this dataset size. Net gain: +0.3 pp AUC.

Round 5 — derived physics features + larger backbone

The 2-channel input was expanded to 8 physics-informed channels: BT_MIR, BT_TIR, MIR−TIR, MIR/TIR ratio, Δ-15 min MIR, Δ-60 min MIR, MIR z-score, TIR z-score. A 1.09 M-parameter stunet variant tested capacity vs data bound.

ModelAUCP@R90Params
v5_derived (8-channel)0.9880.912141k
v5_big (6-layer / 1.09M params)0.9860.9011.09M
v5_crop_ev (event-stratified crop classifier)broken (val degenerate)250k

v5_derived beat Round-4 by +0.3 pp AUC and +4.4 pp P@R90. The bigger model showed zero lift — confirming we're data-bound, not parameter-bound.

Round 6 — ensemble + isotonic calibration + three held-out audits

The Round 3, Round 4, and Round 5 winners were ensembled by simple mean of sigmoid scores, then isotonic-calibrated against event-stratified training scores.

Splitval pos / negEnsemble AUCP@R90FP-zone % top-10%
Event-stratified2,716 / 10,8770.9880.9150.3%
Time-held-out74 / 7,0810.9970.406 *4.3%
Geo-held-out (East Sicily, lng ≥ 14.5°)7,249 / 16,2430.9920.9820.3%

* The time-held-out window contains only 74 positives against 7,081 negatives (pos_weight=96). P@R90 has a wide confidence interval at this imbalance; AUC 0.997 says the ranking is near-perfect. Operationally we set the recall threshold to 50–80% in such low-positive windows.

Calibration ECE (expected calibration error) before and after isotonic:

Splitraw ECEcalibrated ECE
Event-stratified0.0200.036
Time-held-out0.0580.009
Geo-held-out0.0450.014

Calibration helps most where it matters: time and geographic distribution shifts (raw is already fine on in-distribution event-stratified).

Update 2026-05-30 — Universal fusion supersedes R6 ensemble

The R6 ensemble described below was shipped to shadow on 2026-05-29. A day later, the pipeline was extended to consume every data source on the DGX as fused training input, not just SEVIRI BT_MIR/BT_TIR. The new model — referred to as universal_fusion — is now live in shadow mode alongside R6, writing universal_score to event_grades on a 30-minute poll cadence.

ModelInputsAUC (event-strat)P@R90
R6 ensemble (2026-05-29)SEVIRI 2-channel + 13 honest tabular0.9880.915
Universal fusion (2026-05-30)SEVIRI 2-ch + MTG LST + WorldCover landcover (5 classes) + time-gated comparator features (21 sources × 3 stats) + Hawkes ignition prior + SEVIRI RSS (11 channels, when available) + MTG FCI L1c (16 channels, when available) + CAP alerts + 14 tabular0.99240.928

RSS and FCI coverage were 2.9% and 0% respectively at training time — both caches retain only ~24 hours and the labeled training events all predate the cache windows. Those branches receive NaN-imputed zeros today and will start contributing real signal as the caches fill (tracked at /api/data_coverage). Even without RSS and FCI, the addition of LST + WorldCover + time-gated comparator features lifted AUC +0.4 pp and P@R90 +1.3 pp over R6.

Stacking (R6 + universal as base learners + a learned meta-classifier) was evaluated. The meta-classifier assigned a near-zero weight to R6 — the universal model's signal subsumes R6's. Stacking added no gain.

Honest backtest results — universal vs R6 on the same held-out splits (2026-05-30)

SplitR6 ensembleUniversal fusion
Event-stratified (val)AUC 0.988 / P@R90 0.915AUC 0.9924 / P@R90 0.928
Time-held-out (cutoff 2026-05-22)AUC 0.997 / P@R90 0.406 *AUC 0.49 — split degenerate, 34 train positives
Geo-held-out (East Sicily, lng ≥ 14.5°)AUC 0.992 / P@R90 0.982AUC 0.863 / P@R90 0.663

Honest finding: universal fusion under-performs R6 on the geo-held-out split. Training only on West Sicily and testing on East Sicily, universal drops from event-stratified AUC 0.9924 to AUC 0.863. R6 only drops to AUC 0.992. The likely cause is that universal's additional features (LST, WorldCover, comparator distances) are spatially correlated with the training distribution, and the model has learned location-specific patterns it cannot transfer. Event-stratified validation overstates universal's real-world generalization. Live observation through 2026-06-29 will be the deciding test; if universal's live FP-zone behaviour or novel-catch rate disappoints, R6 is the fallback. Both are evaluated at the June 29 gate.

* R6's time-held-out P@R90 0.406 is widely scattered because that window only contained 74 positives — small-sample noise, not a quality signal.

Universal fusion enters its own 30-day shadow observation window. The existing 2026-06-29 promotion gate now compares the production transformer against both R6 and universal candidates. The promotion-eval trigger has not yet been updated to consider universal — that is the next change to ship.

What shipped

The shipped sub-pixel detector is the 3-model ensemble with isotonic calibration, scoring every event in the 96 h rolling window via the existing 6 h laptop shadow pipeline:


ensemble_score = (v5_derived + v4_focal + stunet_fused_honest) / 3

ensemble_calibrated_score = isotonic(ensemble_score)  // refit weekly

Three new shadow columns on event_grades:

The detector runs in shadow mode alongside the existing transformer score for an initial 30-day observation window. No live alerts, broadcasts, or tier decisions are influenced yet — the column is for backtest comparison only. Promotion gate evaluated 2026-06-29: ≥30 days of out-of-sample observations, ≤5% drift in measured FP-zone rate, ≥3 truth-confirmed events caught by ensemble but missed by current production.

Honest limitations

Reproducibility

All training code, weights, and per-round metrics live in the laptop-local phoenix_bakeoff/ directory. Inference daemon at phoenix_shadow_laptop/ensemble_inference.py, wired into shadow_pipeline.py at the 6 h cadence. Six rounds of training cost under $1 total on RunPod (cheapest available GPU in fallback chain, mandatory pod DELETE in every finally block).