Skip to content
BY

Loading experience

XGBoostMonte CarloBayesianPython

F1 Race Winner Prediction System

Predicting the unpredictable: An ensemble ML approach to Formula 1 race outcomes in the 2026 ground-effect era.

## The Abstract Predicting a Formula 1 race winner is one of the most complex problems in sports analytics. Unlike high-frequency sports, F1 offers a tiny sample size (22-24 races per year) with enormous confounding variables: atmospheric conditions, tire degradation, safety car probability, and mechanical reliability. With the massive **2026 Regulatory Reset**, historical data became partially obsolete. This system was designed to bridge that gap using a triple-model ensemble that balances historical patterns with real-time physical simulations. --- ## 🏗️ Technical Architecture: The Triple-Model Ensemble To handle the "known unknowns" of the 2026 regulations, I built an ensemble system that combines three fundamentally different worldviews: ### 1. XGBoost Classifier (The Pattern Matcher) Gradient boosting on 12 years of hybrid-era data (2014-2025). It identifies historical signatures of dominance. - **Handling Imbalance**: Uses `scale_pos_weight` to penalize misclassifying a winner 20x more heavily than a loser. - **Key Feature**: `constructor_strength` — a rolling-window proxy for car performance. ### 2. Monte Carlo Simulator (The Physics Engine) Models the race lap-by-lap, 10,000 times. It rolls "luck" for every lap. - **Dynamic Variable**: Tire degradation curves calibrated during Friday practice. - **Safety Car**: 1% per-lap trigger probability (tuned to circuit-specific history). ### 3. Bayesian Inference (The Belief Updater) Starts with "Priors" based on historical driver dominance and updates with "Evidence" from the current qualifying weekend. - **Uncertainty Quantification**: Explicitly models the variance of new-regulation year transitions. --- ## 🏗️ Feature Engineering: The Tiered Architecture I engineered **26+ features** across four distinct tiers to capture the underlying dynamics of F1 performance. ![Feature Importance](/images/projects/f1/feature_importance.png) *Visualizing the relative weight of historical vs. 2026-specific features.* --- ## 🏎️ Race Case Study 1: Australian GP (The Season Opener) The first real-world test of the 2026 power units. The ensemble faced high reliability risk and unknown tire behavior. ### The Win Probability Heatmap ![Australian GP Win Probabilities](/images/projects/f1/win_probabilities.png) | Driver | Ensemble Win % | Note | | :--- | :---: | :--- | | **Russell** (Mercedes) | **38.9%** | Dominant pole; structural advantage. | | **Hadjar** (Red Bull) | **15.8%** | Beneficiary of Red Bull's historical strength. | | **Antonelli** (Mercedes) | **15.6%** | High "constructor carryover" effect. | ### Physics vs. Pattern While XGBoost loved Hadjar (35.2%) based on previous Red Bull dominance, the Monte Carlo physics engine gave Russell a massive **75.8%** win probability because his 0.4s qualifying gap is historically insurmountable at Albert Park without a safety car. --- ## 🏎️ Race Case Study 2: Japanese GP (The Masterclass at Suzuka) Suzuka is a "driver's track" — its high-speed flow punishes inconsistent lap times. The 2026 Japanese GP prediction focused on **Verstappen's recovery** from a shock P11 start and the **Mercedes dual-threat**. ### Win Probability Distribution ![Japanese GP Win Probability](/images/projects/f1/plot_01_win_probability.png) ### The Verstappen Recovery Analysis Max Verstappen qualified P11 due to an ERS mapping error. Could he win from the midfield? ![Verstappen Recovery](/images/projects/f1/plot_09_verstappen_recovery.png) - **Probability**: 0.3% (Model consensus was nearly unanimous: winning from P11 at Suzuka requires multiple safety cars). - **SC Sensitivity**: Late-race safety cars (L41-53) increased his win probability by 12x, but even then, it capped at ~2%. --- ## 🧪 Simulation Insights ### Tire Degradation Curves The 2026 narrower tires showed significant "cliff" behavior after 18 laps on the Soft compound. ![Tire Degradation](/images/projects/f1/plot_04_tyre_degradation_curves.png) ### Model Agreement Hierarchy Where do the worldviews align? ![Model Agreement](/images/projects/f1/plot_07_model_agreement.png) - **High Confidence**: Mercedes P1/P2 dominance was predicted by all 5 model variations (Bayesian, MC, XGBoost, Tyre-Adj). - **High Uncertainty**: The fight for P3 (McLaren vs. Ferrari) showed massive variance between XGBoost (historical) and Monte Carlo (physics lap-time based). --- ## 🔍 Engineering Insights & Result Analysis ### 1. Safety Car Wildcards The 2026 season showed a **~60% SC rate**. The Monte Carlo engine revealed that an early safety car (Lap 1-15) disproportionately favors drivers starting P10-P20, as it allows for a "cheap" pit stop and field compression. ### 2. Bayesian Belief Updating The Bayesian model was the most cautious, hedging bets on Hamilton (P7) due to his historical performance in regulation-change years (2014, 2017). ![Bayesian Posteriors](/images/projects/f1/plot_08_bayesian_posteriors.png) ### 3. The 2026 Transition The 50/50 ICE-to-Electric split introduced a new failure mode: **MGU-K Harvesting Fatigue**. Models tuned to "Reliability Risk" successfully predicted the high DNF rate for teams running 1st-generation 2026 batteries. --- ### Tech Stack - **Languages**: Python (NumPy, SciPy, Pandas) - **ML Engine**: XGBoost, Scikit-learn, PyMC - **Data Pipelines**: FastF1, OpenF1 API - **Visualization**: Matplotlib, Seaborn --- ## 📂 Technical Resources Download the full technical documentation for this predictor: - [**2026 Japanese GP Prediction Report (PDF)**](/docs/2026_Japanese_GP_Prediction_Report.pdf) - [**2026 Australian GP Prediction Report (PDF)**](/docs/F1_2026_AusGP_Prediction_Report.pdf)