Meteosat Second Generation SEVIRI delivers a 3 km native pixel over Sicily every 15 minutes in 2 thermal channels (MIR 3.9 µm, TIR 10.8 µm). A wildfire smaller than 3 km warms only part of one pixel — the sub-pixel problem. The Dozier 1981 retrieval already solves it physically. PHOENIX layers a learned, ensembled detector on top that holds at AUC ≥ 0.99 on three independent held-out splits. This page documents how the current sub-pixel detector was built — every round, every gain, every honest negative.
| Baseline | AUC time-held-out | P @ recall 0.90 |
|---|---|---|
| PHOENIX per-pixel BT transformer (2026-05-27 prod) | 0.834 | 0.473 |
Anything shipped had to materially exceed both numbers and stay below 5% false-positive rate inside Sicily's known FP zones (Etna disk, Stromboli plume, Priolo / Gela / Milazzo industrial polygons).
| Source | Volume | Window | Used for |
|---|---|---|---|
| SEVIRI BT_MIR + BT_TIR (.npz) | 1,650 frames | Apr 23 – May 28 2026 | model input |
Confirmed events (t72h_outcome LIKE 'confirmed_*') | 450 | Mar 31 – May 28 | positive labels |
| Negative / unverifiable events | 1,894 | Mar 31 – May 28 | negative labels |
| Hard negatives mined from FP zones | 7,183 | Apr 23 – May 28 | adversarial training negatives |
| External comparator hits (VIIRS, MODIS, SAR, VVF) | 9,630 | Apr 23 – May 28 | excluded from honest models (leakage) |
| Detection crops | 17,536 | Apr 23 – May 28 | verification-head experiment (not shipped) |
Only 450 positive event labels is the binding constraint on every architecture. Sentinel-2 burn-scar tiles and FCI L1c raw data were considered and left parked — neither is persisted on disk.
Six bake-off rounds were run on RunPod (NVIDIA L4 / A4000 / A5000, mandatory
pod DELETE in every code path's finally). Identical seed, batch size
128–256, Adam 1e-4, 25–40 epochs, BCEWithLogits unless otherwise noted.
Every model reports its best validation AUC and precision at recall = 0.90.
| Model | AUC | P@R90 |
|---|---|---|
| Per-pixel transformer (138k params, control) | 0.835 | 0.473 |
| 1-D TempCNN (Pelletier 2019, 71k) | 0.819 | 0.441 |
| Spatial-temporal U-Net 5×5×24 (128k) | 0.829 | 0.468 |
| Crop classifier on detection crops (250k) | 0.985 (random split) | 0.715 |
The crop result looked exciting but used a random 80/20 split and train loss
collapsed below 0.01 — overfit warning. The three temporal models hovered near
0.83 AUC because the time-split put 99.7% of positives in val (only 34 in
train; pos_weight=1066). Round 1's finding was structural: the
time-split was broken, not the architectures.
Switched to event-stratified split (each confirmed event's pixels go to one
side only; pos_weight=4.00, honest). Added 17 tabular features per
pixel including nearest external-comparator hit.
| Model | AUC | P@R90 |
|---|---|---|
| Tabular-only MLP (no temporal input, 5k) | 0.981 | 0.809 |
| TempCNN + 17-feature fusion (87k) | 0.990 | 0.913 |
| Stunet + 17-feature fusion (139k) | 0.989 | 0.927 |
confirmed_* outcomes. Label leakage.
Round 3 stripped those features.Same event-stratified split, comparator features removed. 13 honest features only: lat, lng, FP-zone distances ×5, hour-of-day cyclic ×2, day-of-year cyclic ×2, solar zenith, distance-to-centroid.
| Model | AUC | P@R90 | Params |
|---|---|---|---|
| Transformer (honest, no tabular) | 0.948 | 0.686 | 138k |
| TempCNN (honest, no tabular) | 0.960 | 0.696 | 71k |
| Stunet (honest, no tabular) | 0.966 | 0.730 | 128k |
| Tabular MLP (13 honest features only) | 0.972 | 0.730 | 5k |
| Stunet + 13-honest-feature fusion | 0.982 | 0.868 | 139k |
Event-stratified split alone lifted the transformer 0.835 → 0.948. Spatial 5×5 stunet beats per-pixel by ~1.8 pp. Tabular alone at 0.972 is real structural signal (fires cluster in biomes and summer afternoons) — not comparator leakage. Fusion delivers the best honest model.
7,183 high-BT pixels inside known FP zones were mined and added to the training pool as adversarial negatives. Variants stacked on the Round 3 winner:
| Variant | AUC | P@R90 |
|---|---|---|
| Round 3 winner replay | 0.982 | 0.868 |
| + hard negatives | 0.986 | 0.845 |
| + focal loss (γ=2, α=0.8) | 0.985 | 0.868 |
| + temporal-jitter / channel-drop / spatial-flip | 0.984 | 0.843 |
| all three combined | 0.986 | 0.819 |
Diminishing returns. Focal loss recovered the P@R90 dip from hard negatives; augmentation did not help at this dataset size. Net gain: +0.3 pp AUC.
The 2-channel input was expanded to 8 physics-informed channels: BT_MIR, BT_TIR, MIR−TIR, MIR/TIR ratio, Δ-15 min MIR, Δ-60 min MIR, MIR z-score, TIR z-score. A 1.09 M-parameter stunet variant tested capacity vs data bound.
| Model | AUC | P@R90 | Params |
|---|---|---|---|
| v5_derived (8-channel) | 0.988 | 0.912 | 141k |
| v5_big (6-layer / 1.09M params) | 0.986 | 0.901 | 1.09M |
| v5_crop_ev (event-stratified crop classifier) | broken (val degenerate) | — | 250k |
v5_derived beat Round-4 by +0.3 pp AUC and +4.4 pp P@R90. The bigger model showed zero lift — confirming we're data-bound, not parameter-bound.
The Round 3, Round 4, and Round 5 winners were ensembled by simple mean of sigmoid scores, then isotonic-calibrated against event-stratified training scores.
| Split | val pos / neg | Ensemble AUC | P@R90 | FP-zone % top-10% |
|---|---|---|---|---|
| Event-stratified | 2,716 / 10,877 | 0.988 | 0.915 | 0.3% |
| Time-held-out | 74 / 7,081 | 0.997 | 0.406 * | 4.3% |
| Geo-held-out (East Sicily, lng ≥ 14.5°) | 7,249 / 16,243 | 0.992 | 0.982 | 0.3% |
* The time-held-out window contains only 74 positives
against 7,081 negatives (pos_weight=96). P@R90 has a wide
confidence interval at this imbalance; AUC 0.997 says the ranking is
near-perfect. Operationally we set the recall threshold to 50–80% in such
low-positive windows.
Calibration ECE (expected calibration error) before and after isotonic:
| Split | raw ECE | calibrated ECE |
|---|---|---|
| Event-stratified | 0.020 | 0.036 |
| Time-held-out | 0.058 | 0.009 |
| Geo-held-out | 0.045 | 0.014 |
Calibration helps most where it matters: time and geographic distribution shifts (raw is already fine on in-distribution event-stratified).
The R6 ensemble described below was shipped to shadow on 2026-05-29.
A day later, the pipeline was extended to consume every data source on
the DGX as fused training input, not just SEVIRI BT_MIR/BT_TIR.
The new model — referred to as universal_fusion — is now
live in shadow mode alongside R6, writing universal_score
to event_grades on a 30-minute poll cadence.
| Model | Inputs | AUC (event-strat) | P@R90 |
|---|---|---|---|
| R6 ensemble (2026-05-29) | SEVIRI 2-channel + 13 honest tabular | 0.988 | 0.915 |
| Universal fusion (2026-05-30) | SEVIRI 2-ch + MTG LST + WorldCover landcover (5 classes) + time-gated comparator features (21 sources × 3 stats) + Hawkes ignition prior + SEVIRI RSS (11 channels, when available) + MTG FCI L1c (16 channels, when available) + CAP alerts + 14 tabular | 0.9924 | 0.928 |
RSS and FCI coverage were 2.9% and 0% respectively at training time — both
caches retain only ~24 hours and the labeled training events all predate the
cache windows. Those branches receive NaN-imputed zeros today and will start
contributing real signal as the caches fill (tracked at
/api/data_coverage). Even
without RSS and FCI, the addition of LST + WorldCover + time-gated comparator
features lifted AUC +0.4 pp and P@R90 +1.3 pp over R6.
Stacking (R6 + universal as base learners + a learned meta-classifier) was evaluated. The meta-classifier assigned a near-zero weight to R6 — the universal model's signal subsumes R6's. Stacking added no gain.
| Split | R6 ensemble | Universal fusion |
|---|---|---|
| Event-stratified (val) | AUC 0.988 / P@R90 0.915 | AUC 0.9924 / P@R90 0.928 |
| Time-held-out (cutoff 2026-05-22) | AUC 0.997 / P@R90 0.406 * | AUC 0.49 — split degenerate, 34 train positives |
| Geo-held-out (East Sicily, lng ≥ 14.5°) | AUC 0.992 / P@R90 0.982 | AUC 0.863 / P@R90 0.663 |
Honest finding: universal fusion under-performs R6 on the geo-held-out split. Training only on West Sicily and testing on East Sicily, universal drops from event-stratified AUC 0.9924 to AUC 0.863. R6 only drops to AUC 0.992. The likely cause is that universal's additional features (LST, WorldCover, comparator distances) are spatially correlated with the training distribution, and the model has learned location-specific patterns it cannot transfer. Event-stratified validation overstates universal's real-world generalization. Live observation through 2026-06-29 will be the deciding test; if universal's live FP-zone behaviour or novel-catch rate disappoints, R6 is the fallback. Both are evaluated at the June 29 gate.
* R6's time-held-out P@R90 0.406 is widely scattered because that window only contained 74 positives — small-sample noise, not a quality signal.
Universal fusion enters its own 30-day shadow observation window. The existing 2026-06-29 promotion gate now compares the production transformer against both R6 and universal candidates. The promotion-eval trigger has not yet been updated to consider universal — that is the next change to ship.
The shipped sub-pixel detector is the 3-model ensemble with isotonic calibration, scoring every event in the 96 h rolling window via the existing 6 h laptop shadow pipeline:
ensemble_score = (v5_derived + v4_focal + stunet_fused_honest) / 3 ensemble_calibrated_score = isotonic(ensemble_score) // refit weekly
Three new shadow columns on event_grades:
ensemble_score — raw mean of sigmoidsensemble_calibrated_score — isotonic-corrected probabilityensemble_score_computed_at — UTC ISO timestampThe detector runs in shadow mode alongside the existing transformer score for an initial 30-day observation window. No live alerts, broadcasts, or tier decisions are influenced yet — the column is for backtest comparison only. Promotion gate evaluated 2026-06-29: ≥30 days of out-of-sample observations, ≤5% drift in measured FP-zone rate, ≥3 truth-confirmed events caught by ensemble but missed by current production.
subpixel_v1 — bimodal residual, ~7 km uncertainty zone
(2026-06-10, supersedes 2026-06-09 systematic-bias claim).
Updated 2026-06-10 post-G8 bimodal finding.
An internal audit of 9 PHOENIX/FIRMS matched pairs in the Sicily 14-day window
ending 2026-06-10 initially suggested a single systematic ~7 km NNE georectification
offset. The G8 audit (2026-06-09) showed those 9 samples reduce to 2 independent
fires: 5 PHOENIX pixels cluster around one FIRMS hit at (37.09, 14.37) with
NE/ENE bearings 24–68°, 3 PHOENIX pixels cluster around another at
(37.66, 12.74) with NNW bearings 313–344°. The original
"16.7° NNE" mean is the circular mean of two opposing clusters, and
the Rayleigh p = 0.002 is inflated by pixel-cluster pseudoreplication.
A static Δlat / Δlon calibration can therefore only help one cluster
while hurting the other; the previously-proposed env-flag drop-in
PHOENIX_FCI_SUBPIXEL_GEO_OFFSET=1 (Δlat = +0.06° /
Δlon = +0.02°) has been rolled back (file renamed to
.disabled-pre-g8-bimodal-finding-20260610). Five consecutive root-cause
hypotheses (G5 accumulator-projection-port, G6 chunk-widening, G6 parallax, G7 wind-drift,
G8 satpy upstream + FIRMS-side geoloc) have now been falsified. We are pivoting
from root-cause hunt to mitigation: publishing a per-detection
~7 km circular uncertainty radius on /event/,
/accuracy/, and the public map, and no longer claiming pixel-level
localization for subpixel_v1. The G8 H2 hypothesis (L2 FCI-AF and
L1c-derived PHOENIX detect different physical fires in this window) is unresolved
and needs a 30-day backfill to retest. The lead-time signal PHOENIX shows over
FIRMS (+1,800 s median) remains real; what changes is how we communicate the
spatial certainty. The additive map endpoint
/api/detections_phoenix_anchored and credit-extras scorer (commit
9e70a24) are unaffected by the rollback. PHOENIX does not yet
replace FIRMS, VIIRS, MODIS, SLSTR, OroraTech, or Vigili del Fuoco for operational
dispatch — these continue to be authoritative. Audit reports:
phoenix_mir_wooster_audit/G5_ACCUMULATOR_PROJECTION_PORT_2026_06_09.md,
G6_CHUNK_WIDEN_PLUS_PARALLAX_2026_06_09.md,
G7_WIND_DRIFT_HYPOTHESIS_2026_06_09.md,
G8_SATPY_NAV_AND_FIRMS_ERROR_2026_06_09.md (internal). See also the
change-log entry.All training code, weights, and per-round metrics live in the laptop-local
phoenix_bakeoff/ directory. Inference daemon at
phoenix_shadow_laptop/ensemble_inference.py, wired into
shadow_pipeline.py at the 6 h cadence. Six rounds of training cost
under $1 total on RunPod (cheapest available GPU in fallback chain,
mandatory pod DELETE in every finally block).