## The Abstract
Predicting a Formula 1 race winner is one of the most complex problems in sports analytics. Unlike high-frequency sports, F1 offers a tiny sample size (22-24 races per year) with enormous confounding variables: atmospheric conditions, tire degradation, safety car probability, and mechanical reliability.
With the massive **2026 Regulatory Reset**, historical data became partially obsolete. This system was designed to bridge that gap using a triple-model ensemble that balances historical patterns with real-time physical simulations.
---
## 🏗️ Technical Architecture: The Triple-Model Ensemble
To handle the "known unknowns" of the 2026 regulations, I built an ensemble system that combines three fundamentally different worldviews:
### 1. XGBoost Classifier (The Pattern Matcher)
Gradient boosting on 12 years of hybrid-era data (2014-2025). It identifies historical signatures of dominance.
- **Handling Imbalance**: Uses `scale_pos_weight` to penalize misclassifying a winner 20x more heavily than a loser.
- **Key Feature**: `constructor_strength` — a rolling-window proxy for car performance.
### 2. Monte Carlo Simulator (The Physics Engine)
Models the race lap-by-lap, 10,000 times. It rolls "luck" for every lap.
- **Dynamic Variable**: Tire degradation curves calibrated during Friday practice.
- **Safety Car**: 1% per-lap trigger probability (tuned to circuit-specific history).
### 3. Bayesian Inference (The Belief Updater)
Starts with "Priors" based on historical driver dominance and updates with "Evidence" from the current qualifying weekend.
- **Uncertainty Quantification**: Explicitly models the variance of new-regulation year transitions.
---
## 🏗️ Feature Engineering: The Tiered Architecture
I engineered **26+ features** across four distinct tiers to capture the underlying dynamics of F1 performance.

*Visualizing the relative weight of historical vs. 2026-specific features.*
---
## 🏎️ Race Case Study 1: Australian GP (The Season Opener)
The first real-world test of the 2026 power units. The ensemble faced high reliability risk and unknown tire behavior.
### The Win Probability Heatmap

| Driver | Ensemble Win % | Note |
| :--- | :---: | :--- |
| **Russell** (Mercedes) | **38.9%** | Dominant pole; structural advantage. |
| **Hadjar** (Red Bull) | **15.8%** | Beneficiary of Red Bull's historical strength. |
| **Antonelli** (Mercedes) | **15.6%** | High "constructor carryover" effect. |
### Physics vs. Pattern
While XGBoost loved Hadjar (35.2%) based on previous Red Bull dominance, the Monte Carlo physics engine gave Russell a massive **75.8%** win probability because his 0.4s qualifying gap is historically insurmountable at Albert Park without a safety car.
---
## 🏎️ Race Case Study 2: Japanese GP (The Masterclass at Suzuka)
Suzuka is a "driver's track" — its high-speed flow punishes inconsistent lap times. The 2026 Japanese GP prediction focused on **Verstappen's recovery** from a shock P11 start and the **Mercedes dual-threat**.
### Win Probability Distribution

### The Verstappen Recovery Analysis
Max Verstappen qualified P11 due to an ERS mapping error. Could he win from the midfield?

- **Probability**: 0.3% (Model consensus was nearly unanimous: winning from P11 at Suzuka requires multiple safety cars).
- **SC Sensitivity**: Late-race safety cars (L41-53) increased his win probability by 12x, but even then, it capped at ~2%.
---
## 🧪 Simulation Insights
### Tire Degradation Curves
The 2026 narrower tires showed significant "cliff" behavior after 18 laps on the Soft compound.

### Model Agreement Hierarchy
Where do the worldviews align?

- **High Confidence**: Mercedes P1/P2 dominance was predicted by all 5 model variations (Bayesian, MC, XGBoost, Tyre-Adj).
- **High Uncertainty**: The fight for P3 (McLaren vs. Ferrari) showed massive variance between XGBoost (historical) and Monte Carlo (physics lap-time based).
---
## 🔍 Engineering Insights & Result Analysis
### 1. Safety Car Wildcards
The 2026 season showed a **~60% SC rate**. The Monte Carlo engine revealed that an early safety car (Lap 1-15) disproportionately favors drivers starting P10-P20, as it allows for a "cheap" pit stop and field compression.
### 2. Bayesian Belief Updating
The Bayesian model was the most cautious, hedging bets on Hamilton (P7) due to his historical performance in regulation-change years (2014, 2017).

### 3. The 2026 Transition
The 50/50 ICE-to-Electric split introduced a new failure mode: **MGU-K Harvesting Fatigue**. Models tuned to "Reliability Risk" successfully predicted the high DNF rate for teams running 1st-generation 2026 batteries.
---
### Tech Stack
- **Languages**: Python (NumPy, SciPy, Pandas)
- **ML Engine**: XGBoost, Scikit-learn, PyMC
- **Data Pipelines**: FastF1, OpenF1 API
- **Visualization**: Matplotlib, Seaborn
---
## 📂 Technical Resources
Download the full technical documentation for this predictor:
- [**2026 Japanese GP Prediction Report (PDF)**](/docs/2026_Japanese_GP_Prediction_Report.pdf)
- [**2026 Australian GP Prediction Report (PDF)**](/docs/F1_2026_AusGP_Prediction_Report.pdf)
XGBoostMonte CarloBayesianPython
F1 Race Winner Prediction System
Predicting the unpredictable: An ensemble ML approach to Formula 1 race outcomes in the 2026 ground-effect era.