Skip to content
EURO Prediction Project
  • AI Chat
  • Code
  • Report
  • Welcome to the EURO 2024 Prediction Project!

    Thank you for visiting my DataLab project on predicting the outcomes of EURO 2024 matches. This project has been a labor of love, involving extensive data collection, feature engineering, and machine learning model building to forecast the results of one of the most exciting football tournaments in the world.

    Project Overview: The goal of this project is to predict the outcomes of EURO 2024 matches, leveraging various features such as tactical formations, team status (favorites/underdogs), offensive and defensive form, playstyle, and tournament stages. Additionally, I've developed a betting system that uses calculated probabilities and the Kelly criterion to identify the best betting opportunities. If you're interested in my Excel sheet for betting the next matchday, feel free to hit me up on LinkedIn.

    Project Structure

    1 - Data Scraping:

    This part is dedicated to scraping historical match data from Transfermarkt and other sources using BeautifulSoup. The output is a comprehensive dataset of past matches.

    1b - Adding K.O. Matches:

    This notebook adds the EURO knockout phase matches based on group table results.

    2 - Feature Engineering:

    Here, we process the raw data to create meaningful features for our models, such as tactical formations, team status (favorites/underdogs), offensive and defensive form, playstyle, and tournament stages.

    2b - Adding Cluster Features:

    This notebook uses match data from Footystats to create playstyle variables that from the teams' average ball possession, offensive efficiency and defensive vulnerability.

    3 - Data Cleaning and Splitting:

    This notebook handles data cleaning, splitting the dataset into played and unplayed matches, and removing outliers and missing values. It prepares the data for model training and prediction.

    4 - Normalization and Encoding:

    This step involves normalizing the data, imputing missing values for formations, and applying one-hot encoding to categorical variables. The processed data is then ready for model input.

    5 - Model Building:

    This is where the magic happens. We instantiate various machine learning models, perform hyperparameter optimization using random search, and select the best features using recursive feature elimination. Cross-validation is used to compare models, and the best combinations are selected for predicting outcomes and goals.

    6 - Model Application:

    In this notebook, we apply the trained models to predict match outcomes and scores for upcoming games. It also updates the dataset with predicted values to simulate the progression of the tournament.

    Tournament Workflow:

    If you want to go for a run of the whole tournament, you can continue from the results with script number 2, if the next matchday after the predicted one is a group matchday. If the next matchday is a KO match however, you have to start at script 1b and enter some group stage data. You can find more information on that on the top of script 1b. After you have predicted at least one matchday, I would recommend to skip script number 5 and use the same model for the prediction of successive rounds, because the model building might take some hours.

    I am excited to share this journey with you and look forward to any feedback or insights you might have. Feel free to explore each notebook and see how the predictions unfold as the tournament progresses.

    Thank you for your interest and enjoy the EURO 2024!

    1) Gathering Data

    Data Scraping from Transfermarkt

    In this part of the code we will scrape the key data of matches played by all UEFA National Teams in the years of 2021 until now from the website transfermarkt.com. We will use BeatifulSoup and requests for that.

    First we define the scraping function. For this, it is important that we use the right header, since some website can be quite picky in who they allow to access/scrape the data. The one I am using here works for transfermarkt.com.

    Gathering the correct containers can be quite tricky and involved a lot of trial and error for me. First, we need to find the location of the elements of interest by inspecting the website. After defining the page, pageTree and pageSoup, we need to find the class of the table and loop through those tables. Finally, we can loop through the table rows and append the information of interest to a new row in a newly defined list, that then gets transformed into a DataFrame.

    Next, we will apply the function to all UEFA teams. For this, we will define the links of each of the UEFA national teams' match tables and the years to scrape (note that transfermarkt for some reason takes the previous year for the calendar year of interest), loop through them, and put the results into a dataframe.

    Every row now signifies one match from the perspective of one of the teams. In the columns, we can see the team name, its formation and coach, along with information on matchday, date, time, opponent and, of course, the match outcome and result. Later, we want to include both teams' formations as variables into our model. To get the opponent's formation as well, we have to later merge the same df to itself on the 'opponent' column. Therefore, we are interested in which teams outside of Europe the European teams played against in 2021 to 2024. We have to scrape the match data of those teams too. After merging them with the original dataframe, the matches not involving any European team can be dropped.

    Finally, in some initial data cleaning will remove the matches with 'TBD' as 'team', because they seem to be wrongly taken from a graphics header on the website which was mistaken by the scraping function as a match. Moreover, we will remove the matches taking place after the EURO ended.

    Before saving the dataset, we also add the unplayed K.O. matches of the EURO alongside a new ID. This will be important for applying the prediction workflow for the whole tournament later. To avoid having unplayed matches displayed as draws, we declare all outcomes after the day of data scraping as missing values. Finally, we can save the dataframe as a csv file, so we can continue with data cleaning in the following notebook.

    You can see the result of the freshly scraped and still quite raw data here:

    Hidden code

    Apply tournament logic: K.O. Matches

    This part of the notebook is only ran, for predicting K.O. rounds in a run of the whole tournament, where each round fed with the prediction results of the previous rounds. For this, the order of the group stage tables as well as the group letters of the best 3rd placed teams have to be entered manually. The new modus since 2016 makes it a little more complicated.

    Transfermarkt.com, my data source, does not show KO matches for which the teams are not yet decided. So after defining the tables and the names of the groups with the best 3rd placed teams, the code derives the team and opponent for every match in the round of the last 16. If they are already in the data, we will just add the correct KO_id to each match according to the date and time.

    For every further round, we will check, whether or not the team or opponent are filled already, and if not, derive the teams from the winners of the correct matches of the last round.

    Hidden code

    2) Feature Engineering

    Deriving features from scraped data

    In this part, we will engineer a bunch of features to be used for our predictive models. They include:

    • team and opponent level to determine favorite status in each match. As a simple and objective measurement, it is derived from each team's UEFA Nations League level and fitting FIFA World Ranking ranges for non-European teams
    • both teams' tactical formations and number of strikers/defenders
    • the two target features of scored and conceded goals derived from the match result string
    • dummy features to show if any given match went to extra time and/or penalties
    • the feature stage, divided into the categories friendlies, group stage and K.O. matches
    • features for both short-term (average scored/conceded goals in the last 5 minutes) and long-term trend (number of days since 2021-01-01)

    You can see a selection of all columns of the resulting dataframe here:

    df_form.loc[df_form['date'] > '2021-06-01', ['team_name', 'opponent', 'goals_team', 'goals_opp', 'outcome', 'stage', 'formation_team', 'formation_opponent', 'def_team', 'att_team', 'def_opponent', 'att_opponent', 'level_team', 'level_opponent', 'level_diff', 'team_form_att_log', 'team_form_def_log', 'opp_form_att_log', 'opp_form_def_log', 'trend']]

    Adding Playstyle Features

    To add some features representing each teams style of play, we use data from a different source (footystats.com). For every team in the database, we determine the mean ball possession, efficiency (defined as goals divided by own expected goals) and vulnerability (defined as conceded goals divided by opponent's expected goals).

    We create clusters based on two different variable combinations, which are both fitted to European teams only and then applied to all teams in the dataset:

    1. average ball possession vs. own expected goals
    Hidden code