Why F1 Data Is the Best Playground for Learning ML
There's a moment in every data science student's journey where textbook datasets start feeling hollow. The Titanic. Iris flowers. MNIST digits. They serve their purpose — but they don't breathe.
Formula 1 breathes.
The Numbers First
A modern F1 car is fitted with over 300 sensors, transmitting data at roughly 1.5 million samples per second across a race weekend. That's 500GB of telemetry data per car, per race.
Here's what gets captured:
- Throttle position (1kHz)
- Brake pressure (at all four corners independently)
- Steering angle
- G-forces (lateral, longitudinal, vertical)
- Tyre surface temperatures (in zones across each tyre)
- Engine RPM, torque, fuel flow rate
- GPS positioning (to centimetre accuracy)
- Aerodynamic downforce estimates
- DRS status, ERS battery state
For a machine learning student, this is paradise.
The ML Problems Hidden in F1
Lap Time Prediction
Given sector-by-sector telemetry: throttle traces, braking points, corner speeds — can you predict final lap time? This is a regression problem with rich multivariate time-series input. You're essentially learning the car's physics model from data.
# A simplified version of what teams actually run
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
# Features: sector times, fuel load, tyre age, track temp
X = df[['s1_time', 's2_time', 's3_time', 'fuel_kg',
'tyre_laps', 'track_temp', 'air_temp']]
y = df['lap_time']
model = GradientBoostingRegressor(n_estimators=300, max_depth=5)
model.fit(X_train, y_train)
Real teams use gradient boosted trees, LSTMs over rolling lap windows, and even physics-informed neural networks that bake in the Pacejka tyre model.
Pit Stop Timing Optimisation
When should you pit? This is a sequential decision problem under uncertainty — the perfect motivation to learn reinforcement learning. The state space includes track position, competitor strategies, tyre degradation, and weather probability. The action space: stay out, pit for softs, pit for hards, retire.
This maps almost perfectly to a Markov Decision Process — and if you can explain RL through a pit stop analogy, you've understood it deeply.
Driver Classification from Telemetry
Can you identify the driver from their braking fingerprint alone? Different drivers have strikingly different throttle-brake overlap patterns at specific corners. This becomes a multi-class classification problem with time-series telemetry as input — naturally motivating attention mechanisms and TCNs.
The Data Infrastructure Lesson
Perhaps the most underrated ML lesson from F1: data engineering matters more than model choice.
Teams have real-time infrastructure delivering telemetry from a car travelling at 340km/h, processed, validated, and visualised in the pitwall within seconds. Lossy compression, rolling window aggregation, edge inference vs cloud inference — these are production ML engineering problems you rarely encounter in academia.
Building even a toy version of this (using the FastF1 Python library) teaches you more about data pipelines than any tutorial.
How to Start
pip install fastf1 pandas matplotlib scikit-learn
import fastf1
session = fastf1.get_session(2023, 'Bahrain', 'R')
session.load()
# Get all laps from Max Verstappen
ver = session.laps.pick_driver('VER')
print(ver[['LapTime', 'Sector1Time', 'TyreLife', 'Compound']])
You now have real F1 telemetry. Do something interesting with it.
The best way to learn something is to apply it to something you love. If that thing happens to involve 20 cars going 340km/h while generating petabytes of sensor data — even better.