FIFA World Cup 2026 Winner Prediction: An MLOps Guide

See how an end-to-end MLOps pipeline predicts World Cup 2026 results, from automated retraining and DVC to a 10,000-run Monte Carlo simulation of the bracket.

Jun 10, 2026 · 15 min read

Predicting football is hard. It is a low-scoring sport where one deflected shot can flip a result, and a fair share of any match comes down to luck. International football is harder still: national teams play only a handful of competitive games a year, so there is far less data to learn from than in club leagues.

And if that wasn't enough, FIFA made the task harder again for this year's World Cup. The expanded 48-team World Cup introduces a new format in which the top two from each of the twelve groups advance, along with eight of the twelve best third-placed teams, making group-stage fates unpredictable. Since I like a good challenge (and football), that is exactly what I set out to predict.

This is a follow-up to my EURO 2024 prediction project, rebuilt almost from the ground up. Last time I worked entirely in Jupyter notebooks and predicted a single most-likely scoreline per match. This time, I built an end-to-end MLOps pipeline that ingests fresh results, retrains itself, and runs a Monte Carlo simulation of the whole tournament 10,000 times, turning match-level predictions into probabilities for how far each team goes.

In this article, I will walk you through the project at a high level: the data and features, the MLOps practices that keep it reproducible, the pipeline architecture, and which model turns out to predict national-team football best. You can find the full code in the project repo. And of course, I will tell you who the model thinks will win. (Spoiler: it likes Spain and Argentina at around 16% each, but the interesting part is how it gets there.)

If this has you in the mood for the tournament, I recommend watching the recordings of our Data & AI World Cup sessions, or take part in our FIFA World Cup 2026 Prediction competition. The winner receives not only an official World Cup jersey, but also a 3-month subscription to Claude Enterprise. Stay up to date with the live leaderboard.

In a Nutshell

This is an end-to-end MLOps pipeline that predicts the 2026 FIFA World Cup, pulling fresh international results and retraining automatically on Google Cloud, bi-hourly during the tournament.
Data from API-Football and Elo ratings is processed through a Bronze-Silver-Gold medallion architecture and versioned with DVC for full reproducibility.
Ten models from five families were compared on a 347-match holdout; XGBoost won narrowly, the top five were almost inseparable, and the Elo difference between teams does most of the predictive work.
A Monte Carlo simulation plays the full tournament 10,000 times, turning match-level goal predictions into each team's odds of advancing and winning.
As of June 10, 2026, the model's favorites are Spain and Argentina, at roughly 16% each. The live predictions can be followed on an accompanying Streamlit dashboard that refreshes every two hours.

Build MLOps Skills Today

Start from scratch and gain career-building MLOps skills.

Start Learning for Free

The Data Behind the Predictions

A prediction is only ever as good as what goes into it, so it is worth starting with the raw materials. The model learns from two live data sources and turns them into a single, tidy table of features.

Where the data comes from

Everything is built from two places. API-Football supplies the fixtures and per-match statistics: who played whom, when, where, and how it ended. eloratings.net supplies Elo ratings for every national team.

An Elo rating is a single number that captures how strong a team is. Every team sits somewhere on the scale, and after each match, the rating updates: beat a stronger side, and you gain a lot; lose to a weaker one, and you drop sharply. The idea comes from chess and adapts neatly to football. If you want the full intuition, this earlier DataCamp piece walks through it in the context of the 2022 World Cup.

Together, the two sources give a Gold dataset of roughly 6,900 international matches since 2018 to learn from.

What the model predicts

Here is the first important design choice. Instead of predicting the outcome directly as a win, draw, or loss, the model predicts something more granular: the number of goals each team scores in a match. Goal counts in football follow, to a good approximation, a Poisson distribution, the standard way to model how often a relatively rare event happens in a fixed window of time.

Predicting goals rather than results is what makes everything later possible. Once the model can produce a plausible scoreline for any matchup, the questions everyone actually cares about, who escapes the group and who lifts the trophy, can be answered by simulating those scorelines thousands of times.

The features that matter

Each match is described by a small, carefully chosen set of features:

Elo difference: the gap in rating between the two teams. This is by far the single most important feature in the model, with an importance roughly two orders of magnitude above the next strongest. That fits intuition, as the strength gap between the two sides tells you more about the likely result than almost anything else.
Elo sum: the two ratings added together, a stand-in for the overall quality of the fixture. The difference alone cannot tell Argentina against Spain apart from San Marino against Andorra, two evenly matched games at completely different levels, and the sum restores that information.
Rolling Elo change (last 5 matches): how much each team's rating has shifted recently. This captures form while already accounting for the strength of the opponents faced.
Rolling goals for and against (last 5 matches): recent attacking and defensive output in absolute terms, calculated for each team.
Match context: the competition tier (a World Cup game carries different weight from a qualifier or a Nations League fixture), whether the match is a knockout, and whether it is played at a neutral venue.

Every feature is strictly leakage-safe, meaning each one uses only information that was available before kickoff. That sounds obvious, but it is one of the easiest ways to accidentally build a model that looks brilliant in testing and falls apart in the real world.

One idea that did not make the cut: I had planned a set of "playing style" features built by clustering teams from their in-game statistics, an unsupervised learning step. In practice, the teams did not separate into meaningful groups, so rather than feed the model noise, I dropped it. Negative results are still results.

Keeping the data reproducible

With data arriving from two sources on a rolling basis, the path from raw files to model-ready features has to be identical every single time. That is what a medallion architecture provides. It organizes data into three layers:

Bronze: the raw data, exactly as it arrives, left untouched.
Silver: cleaned and standardized. Here I map team names between the two sources (they rarely agree on spellings), validate the schema, join the Elo ratings onto the match records, and deal with anything missing or malformed.
Gold: the modeling layer, one tidy row per match with every feature computed and ready to train on.

Each layer feeds the next, so when something looks off, I can trace it back one stage at a time instead of untangling everything at once. To make the whole path reproducible, I use DVC (Data Version Control). Whenever fresh results come in, a single dvc repro rebuilds Silver and Gold from Bronze, re-running a step only if its inputs changed, and versions the resulting datasets so any earlier state can be recovered exactly.

Choosing the Best Model

Predicting goals is a well-studied problem, and there is no single obvious tool for it. So rather than commit to one approach up front, I built ten and let them compete.

The contenders

The ten models span five families plus a simple baseline. You do not need to know the internals of each one; the point is that they make very different assumptions about how goals come about.

Family	Models	The core idea
Baseline	Mean-rate Poisson	Assumes every team simply scores an overall long-run average, ignoring all features. A floor for the others to beat.
Statistical	Bivariate Poisson, Negative Binomial	Model the two goal counts directly with probability distributions built for counting events.
Bayesian	Bayesian Poisson (MCMC)	The same counting idea, but it returns a full range of uncertainty around each estimate. Far more demanding to compute: roughly 100 times slower to fit than the rest.
Time series	SARIMAX	Treats a team's results as a sequence over time and projects that sequence forward.
Machine learning	Ridge, Random Forest, XGBoost	Learn patterns straight from the features without committing to a fixed equation.
Deep learning	LSTM, 1D CNN	Neural networks that hunt for sequential and local patterns in the data.

How they were scored

With ten candidates, picking a winner by eye was never going to work. Instead, each model passes through three stages, and the code decides whether it moves forward. This is what is meant by code-based deployment: models are promoted from one environment to the next by automated checks rather than manual tuning, so the whole selection stays reproducible and easy to audit.

Experiment. Every model is trained only on international matches played before the 2022 World Cup. Not all of those matches count equally: more recent games and higher-stakes fixtures are given more weight (time-decay and match-importance weighting), so a recent competitive result shapes the model more than an old friendly. Each model's settings are then tuned to minimize Poisson negative log-likelihood (NLL) using cross-validation. NLL is just a score for how well the predicted goal rates match the goals teams ended up scoring, where lower is better. The result is the best-tuned version of each model.
Quality assurance. Those tuned models are then tested on matches they have never seen: the 2022 World Cup plus six major tournaments held since (the EURO, two Africa Cup of Nations, the Copa América, the Asian Cup, and the Gold Cup), 347 matches in all. Here, the metric switches to the ranked probability score (RPS), which measures how good a probabilistic forecast is when the outcomes have a natural order, like loss, draw, win, and rewards being confident in roughly the right direction. Lower is better again. The strongest model here becomes the challenger. RPS is the right yardstick because the real aim is predicting how far teams go, not just goal totals.
Deploy. The challenger is compared against the reigning champion. If it wins, it is promoted and refitted on every available match, so it enters the tournament having learned from all the data.

What won

So which approach came out on top? Here is the full holdout leaderboard, scored by RPS (lower is better):

Model	Holdout RPS
XGBoost	0.18289
Bayesian Poisson	0.18316
Negative Binomial	0.18373
Bivariate Poisson	0.18389
Random Forest	0.18392
SARIMAX	0.18583
Ridge	0.18813
LSTM	0.19299
1D CNN	0.20916
Mean-rate Poisson (baseline)	0.22872

Four things stand out from these results:

XGBoost won, but only just. The top five models (XGBoost, Bayesian Poisson, Negative Binomial, Bivariate Poisson, and Random Forest) finished within about 0.0011 RPS of one another. When five very different approaches land this close, it usually means the ceiling is set by the data and features, not the model. Here, the Elo difference does so much of the work that the choice of model barely moves the needle.
One feature dominates. Elo difference was the most important predictor by a wide margin, roughly a hundred times more influential than the next feature. That is reassuring rather than surprising: in a single match, the gap in strength between two teams really is most of the story.
Deep learning finished last, the baseline aside. The 1D CNN and LSTM were the weakest models apart from the naive baseline. With only around 7,000 matches to learn from, there is simply not enough data to feed networks with so many parameters; classical methods cope far better with small, structured datasets.
No sign of overfitting in the classical models. Normally, a model does a little worse on unseen data than during training. Here, almost every model (the LSTM excepted) scored better on the held-out tournaments than in cross-validation. The likely reason is that tournament football is more predictable than the everyday international calendar: higher stakes, stronger and more familiar teams, and neutral venues all strip out some of the randomness.

For the live tournament, I do not run all ten. I keep a smaller roster: the mean-rate baseline as a reference point, plus the three best performers. XGBoost and Bayesian Poisson take the top two spots outright.

Third place is effectively a tie: the Negative Binomial and Bivariate Poisson finish within 0.0002 RPS of each other and swap places depending on the random seed, so between two statistically indistinguishable models, I went with the Bivariate Poisson, whose formulation has the stronger footing in the football-prediction literature (Karlis and Ntzoufras, 2004).

That leaves a roster of XGBoost (machine learning), Bivariate Poisson (classical statistics), and Bayesian Poisson (Bayesian inference). The next section covers how those models run, retrain, and turn single-match predictions into a full tournament forecast.

Putting It Into Production

A model that lives in a notebook is only useful while you are sitting in front of it. To predict matches across a month-long tournament, the whole thing has to run on its own: pull new results, retrain, re-simulate, and refresh the forecast without anyone touching it. That is the job of the pipeline.

The bi-hourly pipeline on GCP

The entire project runs as a single scheduled job on Google Cloud Run. Before the tournament, it wakes up once a day; from the opening match on June 11, it runs every second hour. Each run follows the same cycle:

Check for new data. If no matches have finished since the last run, there is nothing to do, and the job exits early.
Ingest and rebuild. When new results have come in, they are pulled from the data sources, and a single dvc repro rebuilds the Silver and Gold layers so the features are current.
Retrain, predict, simulate. The roster models are brought up to date (more on how in a moment), every upcoming matchup is predicted, and the full tournament is simulated.
Score. Once a match is settled, the predictions made for it are scored, which feeds the monitoring described below.

Because every step is triggered by code on a schedule, there is no manual button-pressing during the tournament. New result in, refreshed forecast out.

Two modes: frozen vs. per-round

This is where the project doubles as an experiment. During the tournament, the roster runs in two parallel modes, and the difference between them is the question I hope to answer from the data: Does retraining as the tournament unfolds make the predictions better?

Frozen. The models are locked the moment the tournament kicks off and never retrained. They still respond to results, because each simulation starts from the updated bracket, but the model parameters themselves never change.
Per-round. The hyperparameters (the high-level settings) stay fixed, but the parameters the model learns are refitted on all available data after every completed group matchday and every knockout round, so the models keep learning from the tournament as it happens.

Running both side by side lets me compare them on two fronts once it is over: raw predictive accuracy, and how quickly each one's uncertainty resolves as the field narrows. If per-round wins, regular retraining earns its keep; if frozen holds its own, the extra machinery may not be worth it.

From predictions to a tournament: the Monte Carlo simulation

Predicting a single match is one thing. Turning that into "what is each team's chance of winning the tournament" is where the Monte Carlo simulation comes in.

First, inference. Rather than predicting only the fixtures we already know, the model predicts every possible matchup among the 48 teams. That sounds excessive, but in a tournament, any team could meet any other in the knockouts, so a prediction has to be ready for every pairing.

Next, the rules have to be encoded, and the 2026 format makes that especially awkward. Across the 12 groups, the top two advance automatically, but so do the eight best third-placed teams, and which knockout slot each of those eight lands in depends on which groups they came from.

There are 495 ways to choose eight qualifying groups out of twelve (twelve choose eight), and each one produces a different set of round-of-32 pairings. There is no clean formula for it; FIFA simply publishes a table. So I (or rather my very capable colleague Cursor) hardcoded all 495 combinations into a mapping, using the official table as the source.

"best_third_mappings": {
  "EFGHIJKL": {
    "74": "3F",
    "77": "3G",
    "79": "3E",
    "80": "3K",
    "81": "3I",
    "82": "3H",
    "85": "3J",
    "87": "3L"
  }, 
  "DFGHIJKL": ...

Each key, like EFGHIJKL, lists which eight groups supplied the advancing third-placed teams, and the values slot each of those teams (3E, 3F, and so on) into a specific round-of-32 match number. That is one entry; the full mapping repeats it 495 times, once per combination.

The three host nations (the United States, Canada, and Mexico) get one extra piece of handling. When a host plays a match staged in its own country, the simulation applies a home-advantage adjustment for that fixture, while the rest of the tournament is treated as neutral ground.

With the predictions and the rules in place, the simulation runs the whole tournament 10,000 times. In each run, it follows this procedure:

Draw a scoreline for every match by sampling home and away goals from the model's predicted distributions
Play out the group stage under the real points and tiebreak rules
Resolves the best-third table
Fills in the knockout bracket from the mappings above
Play through to a single champion.

Across 10,000 simulated tournaments, the share of runs in which a team reaches the final, or lifts the trophy, becomes that team's probability. One run is a guess; ten thousand runs is a forecast.

Tracking it all with MLflow

Every run described so far, in both modes, is logged to MLflow (hosted on DagsHub). Experiment tracking means systematically recording the inputs, settings, results, and outputs of each run, so any of them can be compared against the others or reproduced exactly. A few of the things it captures are worth calling out:

Reproducibility. The simulation uses a fixed random seed derived from the tournament round, and the same seed is shared by the frozen and per-round modes. That means any difference between the two comes from the models themselves, not from the luck of the draw inside the simulation. Each run also logs the exact data snapshot it saw (the number of Gold rows and a timestamp), so results can always be traced back to their inputs.
The experiment. Each run is tagged with its mode (frozen or per-round) and its stage in the lifecycle, from experimental and QA through to the live inference and refit runs, mirroring the promotion flow from the previous section.
Comparison. Holdout RPS is logged as the selection metric, along with a reference to the current champion run for lineage. Fitting time is recorded as well, which is where the Bayesian model's roughly 100-times-slower training shows up in black and white.

The trained models and the prediction files themselves (the tournament probabilities, group standings, and match forecasts) are stored as run artifacts, and those files are exactly what the live dashboard reads. That closes the loop: from raw results, through training and simulation, to the numbers you can see online.

Monitoring for drift

The last piece runs once matches are settled. As real results arrive, the predictions made for them are scored and compared against the simple mean-rate baseline. If the full models start losing ground to a model that knows nothing about the teams, that is a warning sign of drift: the patterns learned before the tournament may no longer match what is happening on the pitch.

Watching for this is standard practice for any system making live predictions, and you can read more about how it is detected in this guide to data drift and model drift.

So, Who Wins the World Cup?

After all that machinery, here is what it is for.

The favorites

As of June 10, 2026, the day before the opening match, the model's verdict is clear at the very top and crowded just behind it. Spain and Argentina lead the field, each with roughly a 16% chance of lifting the trophy. That the reigning world champions (Argentina) and the reigning European champions (Spain) come out on top is a reassuring sanity check that the model is grounded in reality.

Behind them sits a tight chasing pack: France, England, Brazil, and Colombia round out the most likely winners. These are live figures, and they will move the moment real results start landing, so treat them as a June 10 snapshot rather than a fixed prophecy. The dashboard always shows the current numbers, with a maximum delay of two hours.

The live dashboard

Talking of which: Every number in this article comes from a live Streamlit app that updates automatically as the pipeline runs. You can open it at wc2026-predictions.streamlit.app and follow along through the tournament. It has four main views:

Tournament overview: how far each team is expected to go, at a glance.
Group standings: for every group, each team's probability of finishing first, second, third (split into third-and-through versus third-and-out, thanks to the best-third rule), or fourth.
Match predictions: for each group game, the chance of a home win, draw, or away win, along with the most likely knockout bracket.
Most common knockout matchups: the pairings the simulation produces most often.

One quirk worth flagging in the match view: a couple of teams appear in two possible round-of-32 slots at once. That is not a bug. It happens when a group is so evenly balanced that the model cannot confidently tell which qualifying position a team will take. Combined with the best-third uncertainty, the two outcomes lead to different knockout slots. In the case of Turkey, it even led to them being twice in the round of 16.

The following graphic shows the final rounds (quarterfinals until the final) that the XGBoost model projects before tournament kickoff:

The coin-flip team: United States

The fun of a model like this lies in the teams that defy the eye test, and the clearest example is the United States. If you go to the tournament overview on the dashboard, you will instantly notice that the US stands out in color.

As co-hosts playing in front of home crowds, you might expect a comfortable start, but the model is far more cautious: it gives them only about a 54.6% chance of escaping their group, the 13th-lowest in the entire field (remember that two-thirds of teams qualify for it!), because their group with Australia, Paraguay, and Turkey is unusually even.

The interesting part is what comes next. Having scraped through, the US then hovers at roughly a coin flip in every round that follows. Stack those coin flips together and they land at about a 2% chance of winning the whole tournament, which is the 13th-highest of all 48 teams.

A side that ranks 13th from the bottom to get out of its group and 13th from the top to win it all is just about the perfect definition of a coin-flip team: never the favorite, never out of it.

Final Thoughts

This project was a lot of work, and it covers far more ground than one article can hold. The repo has plenty that did not make it in here: the full set of candidate models, the feature engineering, and the orchestration that keeps everything running are some examples.

For now, the model has made its picks, and the tournament will be the judge. Whether you came for the MLOps or the football, I hope you enjoy watching it unfold as much as I will. You can follow the live forecast as the matches roll in and see how well the predictions hold up.

If you want to take a closer look at some of the concepts I mentioned, I recommend taking our MLOps Concepts course.

Who will win the FIFA World Cup 2026?

How accurate can a machine learning model be at predicting football?

Why predict the number of goals instead of the match result?

What is a Monte Carlo simulation, and why run 10,000 of them?

A Monte Carlo simulation repeatedly plays out a random process to estimate probabilities that are hard to calculate directly. Here, each run draws a scoreline for every match from the model's predictions and plays the tournament through to a winner; doing this 10,000 times turns single-match predictions into stable percentages like "Spain wins about 16% of the time." One simulated tournament is just a single possible outcome, but ten thousand of them approximate the real spread of possibilities.

What tools do you need to build an MLOps pipeline like this?

The core pieces are data versioning (this project uses DVC), experiment tracking (MLflow), a way to run jobs on a schedule (Google Cloud Run with Cloud Scheduler), and a way to serve the results (a Streamlit dashboard).

The models themselves draw on a mix of Python libraries: scikit-learn (Ridge and random forest), XGBoost (the champion), statsmodels and SciPy (the Poisson, bivariate Poisson, and negative binomial regressions, plus SARIMAX), PyMC (the Bayesian model), and Keras (the LSTM and CNN), with pandas and NumPy handling the data.

None of these is strictly necessary for a one-off model, but together they make the pipeline reproducible and able to retrain and refresh itself with no manual work

Author

Tom Farnschläder

Topics

MLOps

Machine Learning

Data Science

Top Machine Learning Courses

Course

Understanding Machine Learning

2 hr

293.2K

An introduction to machine learning with no coding involved.

See Details

Start Course

Course

MLOps Concepts

2 hr

42.6K

Discover how MLOps can take machine learning models from local notebooks to functioning models in production that generate real business value.

See Details

Start Course

Course

Designing Forecasting Pipelines for Production

4 hr

1.4K

Learn how to design, automate, and monitor scalable forecasting pipelines in Python.

See Details

Start Course

blog

Predicting FIFA World Cup Qatar 2022 Winners

Learn to use Elo ratings to quantify national soccer team performance, and see how the model can be used to predict the winner of FIFA World Cup Qatar 2022.

Arne Warnke

7 min

blog

Sports Analytics: How I Predicted the EURO 2024 Final

This article explores the application of sports analytics and machine learning to predict EURO 2024 match outcomes, delving into the challenges and methodologies.

Tom Farnschläder

15 min

blog

How Data Science is Changing Soccer

With the Fifa 2022 World Cup upon us, learn about the most widely used data science use-cases in soccer.

Richie Cotton

3 min

blog

25 Top MLOps Tools You Need to Know in 2026

Discover top MLOps tools for experiment tracking, model metadata management, workflow orchestration, data and pipeline versioning, model deployment and serving, and model monitoring in production.

Abid Ali Awan

15 min

US 2024 Presidential Election Machine Learning Prediction

Tutorial

US Election 2024 Prediction With Machine Learning and Python

Learn how to predict the winner of the 2024 US presidential election using Python, machine learning, and data from FiveThirtyEight and the Federal Election Commission.

Tom Farnschläder

code-along

How to Deal with Messi Data

Pegah Rahimian, a Soccer Analytics Researcher at Uppsala University, will guide you through analyzing World Cup player data with a focus on Lionel Messi.

Pegah Rahimian

See More See More

In a Nutshell

Build MLOps Skills Today

The Data Behind the Predictions

Where the data comes from

What the model predicts

The features that matter

Keeping the data reproducible

Choosing the Best Model

The contenders

How they were scored

What won

Putting It Into Production

The bi-hourly pipeline on GCP

Two modes: frozen vs. per-round

From predictions to a tournament: the Monte Carlo simulation

Tracking it all with MLflow

Monitoring for drift

So, Who Wins the World Cup?

The favorites

The live dashboard

The coin-flip team: United States

Final Thoughts

FIFA World Cup 2026 Winner Prediction FAQs

Why predict the number of goals instead of the match result?

What is a Monte Carlo simulation, and why run 10,000 of them?

What tools do you need to build an MLOps pipeline like this?

Predicting FIFA World Cup Qatar 2022 Winners

Sports Analytics: How I Predicted the EURO 2024 Final

How Data Science is Changing Soccer

25 Top MLOps Tools You Need to Know in 2026

US Election 2024 Prediction With Machine Learning and Python

How to Deal with Messi Data

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Understanding Machine Learning

MLOps Concepts

Designing Forecasting Pipelines for Production

Predicting FIFA World Cup Qatar 2022 Winners

Sports Analytics: How I Predicted the EURO 2024 Final

How Data Science is Changing Soccer

25 Top MLOps Tools You Need to Know in 2026

US Election 2024 Prediction With Machine Learning and Python

How to Deal with Messi Data

Understanding Machine Learning