Meteosat Second Generation SEVIRI delivers a 3 km native pixel over Sicily every 15 minutes in 2 thermal channels (MIR 3.9 µm, TIR 10.8 µm). A wildfire smaller than 3 km warms only part of one pixel — the sub-pixel problem. The Dozier 1981 retrieval already solves it physically. PHOENIX layers a learned, ensembled detector on top that holds at AUC ≥ 0.99 on three independent held-out splits. This page documents how the current sub-pixel detector was built — every round, every gain, every honest negative.
| Baseline | AUC time-held-out | P @ recall 0.90 |
|---|---|---|
| PHOENIX per-pixel BT transformer (2026-05-27 prod) | 0.834 | 0.473 |
Anything shipped had to materially exceed both numbers and stay below 5% false-positive rate inside Sicily's known FP zones (Etna disk, Stromboli plume, Priolo / Gela / Milazzo industrial polygons).
| Source | Volume | Window | Used for |
|---|---|---|---|
| SEVIRI BT_MIR + BT_TIR (.npz) | 1,650 frames | Apr 23 – May 28 2026 | model input |
Confirmed events (t72h_outcome LIKE 'confirmed_*') | 450 | Mar 31 – May 28 | positive labels |
| Negative / unverifiable events | 1,894 | Mar 31 – May 28 | negative labels |
| Hard negatives mined from FP zones | 7,183 | Apr 23 – May 28 | adversarial training negatives |
| External comparator hits (VIIRS, MODIS, SAR, VVF) | 9,630 | Apr 23 – May 28 | excluded from honest models (leakage) |
| Detection crops | 17,536 | Apr 23 – May 28 | verification-head experiment (not shipped) |
Only 450 positive event labels is the binding constraint on every architecture. Sentinel-2 burn-scar tiles and FCI L1c raw data were considered and left parked — neither is persisted on disk.
Six bake-off rounds were run on RunPod (NVIDIA L4 / A4000 / A5000, mandatory
pod DELETE in every code path's finally). Identical seed, batch size
128–256, Adam 1e-4, 25–40 epochs, BCEWithLogits unless otherwise noted.
Every model reports its best validation AUC and precision at recall = 0.90.
| Model | AUC | P@R90 |
|---|---|---|
| Per-pixel transformer (138k params, control) | 0.835 | 0.473 |
| 1-D TempCNN (Pelletier 2019, 71k) | 0.819 | 0.441 |
| Spatial-temporal U-Net 5×5×24 (128k) | 0.829 | 0.468 |
| Crop classifier on detection crops (250k) | 0.985 (random split) | 0.715 |
The crop result looked exciting but used a random 80/20 split and train loss
collapsed below 0.01 — overfit warning. The three temporal models hovered near
0.83 AUC because the time-split put 99.7% of positives in val (only 34 in
train; pos_weight=1066). Round 1's finding was structural: the
time-split was broken, not the architectures.
Switched to event-stratified split (each confirmed event's pixels go to one
side only; pos_weight=4.00, honest). Added 17 tabular features per
pixel including nearest external-comparator hit.
| Model | AUC | P@R90 |
|---|---|---|
| Tabular-only MLP (no temporal input, 5k) | 0.981 | 0.809 |
| TempCNN + 17-feature fusion (87k) | 0.990 | 0.913 |
| Stunet + 17-feature fusion (139k) | 0.989 | 0.927 |
confirmed_* outcomes. Label leakage.
Round 3 stripped those features.Same event-stratified split, comparator features removed. 13 honest features only: lat, lng, FP-zone distances ×5, hour-of-day cyclic ×2, day-of-year cyclic ×2, solar zenith, distance-to-centroid.
| Model | AUC | P@R90 | Params |
|---|---|---|---|
| Transformer (honest, no tabular) | 0.948 | 0.686 | 138k |
| TempCNN (honest, no tabular) | 0.960 | 0.696 | 71k |
| Stunet (honest, no tabular) | 0.966 | 0.730 | 128k |
| Tabular MLP (13 honest features only) | 0.972 | 0.730 | 5k |
| Stunet + 13-honest-feature fusion | 0.982 | 0.868 | 139k |
Event-stratified split alone lifted the transformer 0.835 → 0.948. Spatial 5×5 stunet beats per-pixel by ~1.8 pp. Tabular alone at 0.972 is real structural signal (fires cluster in biomes and summer afternoons) — not comparator leakage. Fusion delivers the best honest model.
7,183 high-BT pixels inside known FP zones were mined and added to the training pool as adversarial negatives. Variants stacked on the Round 3 winner:
| Variant | AUC | P@R90 |
|---|---|---|
| Round 3 winner replay | 0.982 | 0.868 |
| + hard negatives | 0.986 | 0.845 |
| + focal loss (γ=2, α=0.8) | 0.985 | 0.868 |
| + temporal-jitter / channel-drop / spatial-flip | 0.984 | 0.843 |
| all three combined | 0.986 | 0.819 |
Diminishing returns. Focal loss recovered the P@R90 dip from hard negatives; augmentation did not help at this dataset size. Net gain: +0.3 pp AUC.
The 2-channel input was expanded to 8 physics-informed channels: BT_MIR, BT_TIR, MIR−TIR, MIR/TIR ratio, Δ-15 min MIR, Δ-60 min MIR, MIR z-score, TIR z-score. A 1.09 M-parameter stunet variant tested capacity vs data bound.
| Model | AUC | P@R90 | Params |
|---|---|---|---|
| v5_derived (8-channel) | 0.988 | 0.912 | 141k |
| v5_big (6-layer / 1.09M params) | 0.986 | 0.901 | 1.09M |
| v5_crop_ev (event-stratified crop classifier) | broken (val degenerate) | — | 250k |
v5_derived beat Round-4 by +0.3 pp AUC and +4.4 pp P@R90. The bigger model showed zero lift — confirming we're data-bound, not parameter-bound.
The Round 3, Round 4, and Round 5 winners were ensembled by simple mean of sigmoid scores, then isotonic-calibrated against event-stratified training scores.
| Split | val pos / neg | Ensemble AUC | P@R90 | FP-zone % top-10% |
|---|---|---|---|---|
| Event-stratified | 2,716 / 10,877 | 0.988 | 0.915 | 0.3% |
| Time-held-out | 74 / 7,081 | 0.997 | 0.406 * | 4.3% |
| Geo-held-out (East Sicily, lng ≥ 14.5°) | 7,249 / 16,243 | 0.992 | 0.982 | 0.3% |
* The time-held-out window contains only 74 positives
against 7,081 negatives (pos_weight=96). P@R90 has a wide
confidence interval at this imbalance; AUC 0.997 says the ranking is
near-perfect. Operationally we set the recall threshold to 50–80% in such
low-positive windows.
Calibration ECE (expected calibration error) before and after isotonic:
| Split | raw ECE | calibrated ECE |
|---|---|---|
| Event-stratified | 0.020 | 0.036 |
| Time-held-out | 0.058 | 0.009 |
| Geo-held-out | 0.045 | 0.014 |
Calibration helps most where it matters: time and geographic distribution shifts (raw is already fine on in-distribution event-stratified).
The R6 ensemble described below was shipped to shadow on 2026-05-29.
A day later, the pipeline was extended to consume every data source on
the DGX as fused training input, not just SEVIRI BT_MIR/BT_TIR.
The new model — referred to as universal_fusion — is now
live in shadow mode alongside R6, writing universal_score
to event_grades on a 30-minute poll cadence.
| Model | Inputs | AUC (event-strat) | P@R90 |
|---|---|---|---|
| R6 ensemble (2026-05-29) | SEVIRI 2-channel + 13 honest tabular | 0.988 | 0.915 |
| Universal fusion (2026-05-30) | SEVIRI 2-ch + MTG LST + WorldCover landcover (5 classes) + time-gated comparator features (21 sources × 3 stats) + Hawkes ignition prior + SEVIRI RSS (11 channels, when available) + MTG FCI L1c (16 channels, when available) + CAP alerts + 14 tabular | 0.9924 | 0.928 |
RSS and FCI coverage were 2.9% and 0% respectively at training time — both
caches retain only ~24 hours and the labeled training events all predate the
cache windows. Those branches receive NaN-imputed zeros today and will start
contributing real signal as the caches fill (tracked at
/api/data_coverage). Even
without RSS and FCI, the addition of LST + WorldCover + time-gated comparator
features lifted AUC +0.4 pp and P@R90 +1.3 pp over R6.
Stacking (R6 + universal as base learners + a learned meta-classifier) was evaluated. The meta-classifier assigned a near-zero weight to R6 — the universal model's signal subsumes R6's. Stacking added no gain.
| Split | R6 ensemble | Universal fusion |
|---|---|---|
| Event-stratified (val) | AUC 0.988 / P@R90 0.915 | AUC 0.9924 / P@R90 0.928 |
| Time-held-out (cutoff 2026-05-22) | AUC 0.997 / P@R90 0.406 * | AUC 0.49 — split degenerate, 34 train positives |
| Geo-held-out (East Sicily, lng ≥ 14.5°) | AUC 0.992 / P@R90 0.982 | AUC 0.863 / P@R90 0.663 |
Honest finding: universal fusion under-performs R6 on the geo-held-out split. Training only on West Sicily and testing on East Sicily, universal drops from event-stratified AUC 0.9924 to AUC 0.863. R6 only drops to AUC 0.992. The likely cause is that universal's additional features (LST, WorldCover, comparator distances) are spatially correlated with the training distribution, and the model has learned location-specific patterns it cannot transfer. Event-stratified validation overstates universal's real-world generalization. Live observation through 2026-06-29 will be the deciding test; if universal's live FP-zone behaviour or novel-catch rate disappoints, R6 is the fallback. Both are evaluated at the June 29 gate.
* R6's time-held-out P@R90 0.406 is widely scattered because that window only contained 74 positives — small-sample noise, not a quality signal.
Universal fusion enters its own 30-day shadow observation window. The existing 2026-06-29 promotion gate now compares the production transformer against both R6 and universal candidates. The promotion-eval trigger has not yet been updated to consider universal — that is the next change to ship.
The shipped sub-pixel detector is the 3-model ensemble with isotonic calibration, scoring every event in the 96 h rolling window via the existing 6 h laptop shadow pipeline:
ensemble_score = (v5_derived + v4_focal + stunet_fused_honest) / 3 ensemble_calibrated_score = isotonic(ensemble_score) // refit weekly
Three new shadow columns on event_grades:
ensemble_score — raw mean of sigmoidsensemble_calibrated_score — isotonic-corrected probabilityensemble_score_computed_at — UTC ISO timestampThe detector runs in shadow mode alongside the existing transformer score for an initial 30-day observation window. No live alerts, broadcasts, or tier decisions are influenced yet — the column is for backtest comparison only. Promotion gate evaluated 2026-06-29: ≥30 days of out-of-sample observations, ≤5% drift in measured FP-zone rate, ≥3 truth-confirmed events caught by ensemble but missed by current production.
subpixel_v1 — residuo bimodale, zona di
incertezza ~7 km (2026-06-10, sostituisce la pretesa di bias sistematico del
2026-06-09).
Aggiornato 2026-06-10 dopo il risultato bimodale del G8.
Un audit interno su 9 coppie PHOENIX/FIRMS abbinate nella finestra di 14 giorni in
Sicilia chiusa il 2026-06-10 inizialmente suggeriva un singolo offset di
georettificazione sistematico di ~7 km a NNE. L'audit G8 (2026-06-09) ha mostrato
che quei 9 campioni si riducono a 2 incendi indipendenti: 5 pixel PHOENIX
raggruppati attorno a un hit FIRMS a (37,09, 14,37) con bearing NE/ENE
24–68°; 3 pixel PHOENIX raggruppati attorno a un altro a (37,66, 12,74)
con bearing NNW 313–344°. La media originale «16,7° NNE»
è la media circolare di due cluster opposti, e la p di Rayleigh
p = 0,002 è gonfiata da pseudoreplicazione di cluster di pixel.
Una calibrazione statica Δlat / Δlon può quindi aiutare solo un
cluster peggiorando l'altro; il drop-in con env flag
PHOENIX_FCI_SUBPIXEL_GEO_OFFSET=1 precedentemente proposto
(Δlat = +0,06° / Δlon = +0,02°) è stato annullato
(file rinominato .disabled-pre-g8-bimodal-finding-20260610). Cinque
ipotesi di root-cause consecutive (G5 porting proiezione accumulator, G6 chunk-widen,
G6 parallasse, G7 deriva da vento, G8 upstream satpy + geoloc lato FIRMS) sono ora
falsificate. Stiamo passando da caccia alla causa a mitigazione: pubblicare
un raggio di incertezza circolare di ~7 km per detection su
/event/, /accuracy/ e la mappa pubblica, e non
rivendicare più localizzazione a livello di pixel per subpixel_v1.
L'ipotesi H2 di G8 (L2 FCI-AF e PHOENIX derivato da L1c rilevano incendi fisicamente
diversi in questa finestra) è irrisolta e va ritestata con un backfill di
30 giorni. Il segnale di lead-time che PHOENIX porta su FIRMS (+1.800 s mediana)
rimane reale; cambia come comunichiamo la certezza spaziale. L'endpoint additivo
/api/detections_phoenix_anchored e lo scorer credit-extras (commit
9e70a24) non sono toccati dal rollback. PHOENIX non sostituisce
ancora FIRMS, VIIRS, MODIS, SLSTR, OroraTech o i Vigili del Fuoco per il
dispatch operativo — questi restano autoritativi. Report audit:
phoenix_mir_wooster_audit/G5_ACCUMULATOR_PROJECTION_PORT_2026_06_09.md,
G6_CHUNK_WIDEN_PLUS_PARALLAX_2026_06_09.md,
G7_WIND_DRIFT_HYPOTHESIS_2026_06_09.md,
G8_SATPY_NAV_AND_FIRMS_ERROR_2026_06_09.md (interni). Vedi anche la
voce del change-log.All training code, weights, and per-round metrics live in the laptop-local
phoenix_bakeoff/ directory. Inference daemon at
phoenix_shadow_laptop/ensemble_inference.py, wired into
shadow_pipeline.py at the 6 h cadence. Six rounds of training cost
under $1 total on RunPod (cheapest available GPU in fallback chain,
mandatory pod DELETE in every finally block).