Skip to content
EURO Prediction Project
  • AI Chat
  • Code
  • Report
  • Scraping international match data (2021-2024)

    In this workbook we will scrape the key data of matches played by all UEFA National Teams in the years of 2021 until now from the website transfermarkt.com. First, we will import the needed packages, especially BeatifulSoup and requests, which we both need to scrape the data.

    import pandas as pd
    import numpy as np
    import seaborn as sns
    import time
    import requests
    from bs4 import BeautifulSoup
    from datetime import datetime, timedelta
    import matplotlib.pyplot as plt
    %matplotlib inline

    Scraping function

    Now we define the scraping function.

    For this, it is important that we use the right header, since some website can be quite picky in who they allow to access/scrape the data. The one I am using here was suggested and tested positively.

    Gathering the correct containers can be quite tricky and involved a lot of trial and error for me. First, we need to find the location of the elements of interest by inspecting the website. After defining the page, pageTree and pageSoup, we need to find the class of the table and loop through those tables. Finally, we can loop through the table rows and append the information of interest to a new row in a newly defined list, that then gets transformed into a DataFrame.

    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
    
    def crawl_matches(team_url, year):
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'
        }
    
        # Construct the URL for the specific team and year
        page = f"{team_url}/saison_id/{year}/plus/1"
        pageTree = requests.get(page, headers=headers)
        pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
    
        # Locate the data and extract it
        all_matches_data = []
    
        # Find all matches containers
        matches_containers = pageSoup.find_all("div", {"class": "responsive-table"})
        for container in matches_containers:
            matches_table = container.find("table")
            if matches_table:
                match_rows = matches_table.find("tbody").find_all("tr") if matches_table.find("tbody") else []
    
                for row in match_rows:
                    cells = row.find_all("td")
    
                    stage = cells[0].text.strip() if len(cells) > 10 else "Friendly"
                    date = cells[1 if len(cells) > 10 else 0].text.strip()
                    time = cells[2 if len(cells) > 10 else 1].text.strip()
    
                    formation_index = 7 if len(cells) > 10 else 6
                    formation = cells[formation_index].text.strip() if cells[formation_index].text.strip() else "TBD"
                    
                    # Find the element for the team
                    team_cell = row.find("td", {"class": "no-border-links hauptlink"})
                    if team_cell and team_cell.find("a", title=True):
                        team = team_cell.find("a").get("title")
                    else:
                        team = "TBD"
                    
                    # Find the element for the opponent
                    opponent_cell = row.find("td", {"class": "no-border-links 1"})
                    if opponent_cell and opponent_cell.find("a", title=True):
                        opponent = opponent_cell.find("a").get("title")
                    else:
                        opponent = "TBD" # if not found
    
                    coach_index = -3 
                    coach = cells[coach_index].text.strip() if cells[coach_index].text.strip() else "TBD"
        
                    result_cell = cells[-1] if cells else None
                    result_text = "N/A"
                    match_outcome = "Draw"
    
                    if result_cell:
                        result_link = result_cell.find('a', {'class': 'ergebnis-link'})
                        if result_link:
                            result_span = result_link.find('span')
                            if result_span:
                                result_text = result_span.text.strip()
                                if "greentext" in result_span['class']:
                                    match_outcome = "Win"
                                elif "redtext" in result_span['class']:
                                    match_outcome = "Loss"
                            else:
                                result_text = "TBD"
                                match_outcome = "TBD"
    
                    all_matches_data.append({
                        "stage": stage,
                        "date": date,
                        "time": time,
                        "team": team,
                        "opponent": opponent,
                        "formation": formation,
                        "coach": coach,
                        "outcome": match_outcome,
                        "result": result_text
                    })
    
        return all_matches_data
    
    # Example usage
    team_url = "https://www.transfermarkt.com/poland/spielplan/verein/3442/"
    year = "2023"
    matches = crawl_matches(team_url, year)
    for match in matches:
        print(match)

    Apply the function in a loop

    Next, we will apply the function, which seems to be working. For this, we will define the links of each of the UEFA national teams' match tables and the years to scrape (note that transfermarkt for some reason takes the previous year for the calendar year of interest) and loop through them.

    After the data is collected, we will put it into a dataframe.

    team_urls = ["https://www.transfermarkt.com/polen/spielplan/verein/3442",
                "https://www.transfermarkt.com/ukraine/spielplan/verein/3699",
                "https://www.transfermarkt.com/georgien/spielplan/verein/3669",
                "https://www.transfermarkt.com/germany/spielplan/verein/3262",
                "https://www.transfermarkt.com/spanien/spielplan/verein/3375/", 
                "https://www.transfermarkt.com/schottland/spielplan/verein/3380", 
                "https://www.transfermarkt.com/frankreich/spielplan/verein/3377",
                "https://www.transfermarkt.com/niederlande/spielplan/verein/3379",
                "https://www.transfermarkt.com/england/spielplan/verein/3299",
                "https://www.transfermarkt.com/italien/spielplan/verein/3376",
                "https://www.transfermarkt.com/turkei/spielplan/verein/3381",
                "https://www.transfermarkt.com/kroatien/spielplan/verein/3556",
                "https://www.transfermarkt.com/albanien/spielplan/verein/3561",
                "https://www.transfermarkt.com/tschechien/spielplan/verein/3445",
                "https://www.transfermarkt.com/belgien/spielplan/verein/3382",
                "https://www.transfermarkt.com/osterreich/spielplan/verein/3383",
                "https://www.transfermarkt.com/ungarn/spielplan/verein/3468",
                "https://www.transfermarkt.com/serbien/spielplan/verein/3438",
                "https://www.transfermarkt.com/danemark/spielplan/verein/3436",
                "https://www.transfermarkt.com/slowenien/spielplan/verein/3588",
                "https://www.transfermarkt.com/rumanien/spielplan/verein/3447",
                "https://www.transfermarkt.com/schweiz/spielplan/verein/3384",
                "https://www.transfermarkt.com/portugal/spielplan/verein/3300",
                "https://www.transfermarkt.com/slowakei/spielplan/verein/3503",
                "https://www.transfermarkt.com/wales/spielplan/verein/3864",
                "https://www.transfermarkt.com/island/spielplan/verein/3574",
                "https://www.transfermarkt.com/griechenland/spielplan/verein/3378",
                "https://www.transfermarkt.com/luxemburg/spielplan/verein/3580", 
                "https://www.transfermarkt.com/estland/spielplan/verein/6133",
                "https://www.transfermarkt.com/finnland/spielplan/verein/3443",
                "https://www.transfermarkt.com/bosnien-herzegowina/spielplan/verein/3446",
                "https://www.transfermarkt.com/israel/spielplan/verein/5547",
                "https://www.transfermarkt.com/kasachstan/spielplan/verein/9110",
                "https://www.transfermarkt.com/norwegen/spielplan/verein/3440",
                "https://www.transfermarkt.com/zypern/spielplan/verein/3668",
                "https://www.transfermarkt.com/irland/spielplan/verein/3509",
                "https://www.transfermarkt.com/gibraltar/spielplan/verein/37574",
                "https://www.transfermarkt.com/nordmazedonien/spielplan/verein/5148",
                "https://www.transfermarkt.com/malta/spielplan/verein/3587",
                "https://www.transfermarkt.com/armenien/spielplan/verein/6219",
                "https://www.transfermarkt.com/lettland/spielplan/verein/3555",
                "https://www.transfermarkt.com/belarus/spielplan/verein/3450",
                "https://www.transfermarkt.com/kosovo/spielplan/verein/53982",
                "https://www.transfermarkt.com/andorra/spielplan/verein/10533",
                "https://www.transfermarkt.com/liechtenstein/spielplan/verein/5673",
                "https://www.transfermarkt.com/moldawien/spielplan/verein/6090",
                "https://www.transfermarkt.com/faroer/spielplan/verein/9173",
                "https://www.transfermarkt.com/schweden/spielplan/verein/3557",
                "https://www.transfermarkt.com/aserbaidschan/spielplan/verein/8605",
                "https://www.transfermarkt.com/montenegro/spielplan/verein/11953",
                "https://www.transfermarkt.com/litauen/spielplan/verein/3851",
                "https://www.transfermarkt.com/bulgarien/spielplan/verein/3394",
                "https://www.transfermarkt.com/nordirland/spielplan/verein/5674",
                "https://www.transfermarkt.com/san-marino/spielplan/verein/10521",
                "https://www.transfermarkt.com/russland/spielplan/verein/3448/"
                ]
    
    years = ["2020", "2021", "2022", "2023"]
    # Initialize lists to keep track of successes, errors, and to store all match data
    
    successes = []
    errors = []
    all_matches_data = []  # List to store the results from all successful crawl_matches calls
    
    for url in team_urls:
        for year in years:
            try:
                # Attempt to crawl matches for the given URL and year
                matches_data = crawl_matches(url, year)  # Store returned matches data from the function call
                
                # If successful, append the data to all_matches_data
                all_matches_data.extend(matches_data)  # add elements of matches_data to all_matches_data
                
                # Record the success
                successes.append((url, year))
                print(f"Success: {url} for year {year}")
            except Exception as e:
                # If an error occurs, record the error with its message
                errors.append((url, year, str(e)))
                print(f"Error processing {url} for year {year}: {e}")
            
            # Sleep for 2 seconds between requests to respect crawl-delay
            time.sleep(2)
    
    # Print out the results
    print("\nSuccesses:")
    for success in successes:
        print(f"- {success[0]} in year {success[1]}")
    
    print("\nErrors:")
    for error in errors:
        print(f"- {error[0]} in year {error[1]}: {error[2]}")
    
    # Optionally, print or process all_matches_data as needed
    # For example, to see the count of all collected match data:
    print(f"\nTotal matches collected: {len(all_matches_data)}")
    # combine into dataframe  
    df = pd.DataFrame(all_matches_data) 
    df.set_index("team", inplace=True) 
    df

    Getting further data for transformations

    Every row now stands for one match from the perspective of one of the teams. In the columns, we can see the team name, its formation and coach, along with information on matchday, date, time, opponent and, of course, the match outcome and result.

    Later, we want to include both teams' formations as variables into our model. To get the opponent's formation as well, we have to later merge the same df to itself on the 'opponent' column. Therefore, we are interested in which teams outside of Europe the European teams played against in 2021 to 2024. We have to scrape the match data of those teams too. After merging them with the original dataframe, the matches not involving any European team can be dropped.

    We define two lists: euro_teams and rest_europe, and then create a filtered list of opponents that are in neither list. After defining the links of the non-European teams, we can apply the same scraping function to their links.

    ### lists of EURO participants and other European teams
    
    euro_teams = ["Germany", "Spain", "Scotland", "France", "Netherlands", "England", "Italy", "Türkiye", "Croatia", "Albania", "Czech Republic", "Belgium", "Austria", "Hungary", "Serbia", "Denmark", "Slovenia", "Romania", "Switzerland", "Portugal", "Slovakia", "Georgia", "Ukraine", "Poland"]
    
    rest_europe = ["Wales", "Iceland", "Greece", "Luxembourg", "Estonia", "Finland", "Bosnia-Herzegovina", "Israel", "Kazakhstan", "Norway", "Cyprus", "Republic of Ireland",
    "Gibraltar", "North Macedonia", "Malta", "Armenia", "Latvia", "Belarus", "Kosovo", "Andorra", "Liechtenstein", "Moldova", "Faroe Islands",
    "Sweden", "Azerbaijan", "Montenegro", "Lithuania", "Bulgaria", "Northern Ireland", "San Marino", "Russia"]
    
    ### get list of opponents outside of Europe
    
    opponents = df['opponent'].unique().tolist()
    filtered_opponents = [team for team in opponents if team not in euro_teams and team not in rest_europe]
    filtered_opponents
    team_urls = ["https://www.transfermarkt.com/mexiko/spielplan/verein/6303",
                 "https://www.transfermarkt.com/saudi-arabien/spielplan/verein/3807",
                 "https://www.transfermarkt.com/argentinien/spielplan/verein/3437",
                 "https://www.transfermarkt.com/chile/spielplan/verein/3700",
                 "https://www.transfermarkt.com/bahrain/spielplan/verein/7214",
                 "https://www.transfermarkt.com/usbekistan/spielplan/verein/3563",
                 "https://www.transfermarkt.com/marokko/spielplan/verein/3575",
                 "https://www.transfermarkt.com/mongolei/spielplan/verein/15739",
                 "https://www.transfermarkt.com/thailand/spielplan/verein/5676",
                 "https://www.transfermarkt.com/japan/spielplan/verein/3435",
                 "https://www.transfermarkt.com/costa-rica/spielplan/verein/8497",
                 "https://www.transfermarkt.com/oman/spielplan/verein/14165",
                 "https://www.transfermarkt.com/peru/spielplan/verein/3584",
                 "https://www.transfermarkt.com/kolumbien/spielplan/verein/3816",
                 "https://www.transfermarkt.com/vereinigte-staaten/spielplan/verein/3505",
                 "https://www.transfermarkt.com/jordanien/spielplan/verein/15737",
                 "https://www.transfermarkt.com/brasilien/spielplan/verein/3439",
                 "https://www.transfermarkt.com/australien/spielplan/verein/3433",
                 "https://www.transfermarkt.com/tunesien/spielplan/verein/3670",
                 "https://www.transfermarkt.com/elfenbeinkuste/spielplan/verein/3591",
                 "https://www.transfermarkt.com/sudafrika/spielplan/verein/3806",
                 "https://www.transfermarkt.com/senegal/spielplan/verein/3499",
                 "https://www.transfermarkt.com/ecuador/spielplan/verein/5750",
                 "https://www.transfermarkt.com/katar/spielplan/verein/14162",
                 "https://www.transfermarkt.com/kanada/spielplan/verein/3510",
                 "https://www.transfermarkt.com/iran/spielplan/verein/3582",
                 "https://www.transfermarkt.com/venezuela/spielplan/verein/3504",
                 "https://www.transfermarkt.com/guinea/spielplan/verein/3856",
                 "https://www.transfermarkt.com/agypten/spielplan/verein/3672",
                 "https://www.transfermarkt.com/kuwait/spielplan/verein/3432",
                 "https://www.transfermarkt.com/burkina-faso/spielplan/verein/5872",
                 "https://www.transfermarkt.com/dominikanische-republik/spielplan/verein/15232",
                 "https://www.transfermarkt.com/panama/spielplan/verein/3577",
                 "https://www.transfermarkt.com/jamaika/spielplan/verein/3671",
                 "https://www.transfermarkt.com/kamerun/spielplan/verein/3434",
                 "https://www.transfermarkt.com/ghana/spielplan/verein/3441",
                 "https://www.transfermarkt.com/uruguay/spielplan/verein/3449",
                 "https://www.transfermarkt.com/sudkorea/spielplan/verein/3589",
                 "https://www.transfermarkt.com/nigeria/spielplan/verein/3444",
                 "https://www.transfermarkt.com/uganda/spielplan/verein/13497",
                 "https://www.transfermarkt.com/guatemala/spielplan/verein/13342",
                 "https://www.transfermarkt.com/honduras/spielplan/verein/3590",
                 "https://www.transfermarkt.com/neuseeland/spielplan/verein/9171",
                 "https://www.transfermarkt.com/sambia/spielplan/verein/3703",
                 "https://www.transfermarkt.com/tadschikistan/spielplan/verein/13975",
                 "https://www.transfermarkt.com/vereinigte-arabische-emirate/spielplan/verein/5147",
                 "https://www.transfermarkt.com/turkmenistan/spielplan/verein/14248",
                 "https://www.transfermarkt.com/grenada/spielplan/verein/14175",
                 "https://www.transfermarkt.com/indien/spielplan/verein/13957",
                 "https://www.transfermarkt.com/syrien/spielplan/verein/13674",
                 "https://www.transfermarkt.com/gambia/spielplan/verein/6186",
                 "https://www.transfermarkt.com/st-kitts-und-nevis/spielplan/verein/17760",
                 "https://www.transfermarkt.com/bolivien/spielplan/verein/5233",
                 "https://www.transfermarkt.com/kap-verde/spielplan/verein/4311",
                 "https://www.transfermarkt.com/cayman-inseln/spielplan/verein/17751",
                 "https://www.transfermarkt.com/algerien/spielplan/verein/3614",
                 "https://www.transfermarkt.com/libanon/spielplan/verein/3586",
                 "https://www.transfermarkt.com/tansania/spielplan/verein/14666",
                 "https://www.transfermarkt.com/italien-u20/spielplan/verein/21100",
                 "https://www.transfermarkt.com/seychellen/spielplan/verein/3562",
                 "https://www.transfermarkt.com/st-lucia/spielplan/verein/17761",
                 "https://www.transfermarkt.com/kirgisistan/spielplan/verein/3956",
                 "https://www.transfermarkt.com/irak/spielplan/verein/3560",
                 "https://www.transfermarkt.com/kenia/spielplan/verein/8987",
                 "https://www.transfermarkt.com/kuba/spielplan/verein/3808"]
    # Initialize lists to keep track of successes, errors, and to store all match data
    
    successes = []
    errors = []
    all_matches_data = []  # List to store the results from all successful crawl_matches calls
    
    for url in team_urls:
        for year in years:
            try:
                # Attempt to crawl matches for the given URL and year
                matches_data = crawl_matches(url, year)  # Store returned matches data from the function call
                
                # If successful, append the data to all_matches_data
                all_matches_data.extend(matches_data)  # add elements of matches_data to all_matches_data
                
                # Record the success
                successes.append((url, year))
                print(f"Success: {url} for year {year}")
            except Exception as e:
                # If an error occurs, record the error with its message
                errors.append((url, year, str(e)))
                print(f"Error processing {url} for year {year}: {e}")
            
            # Sleep for 2 seconds between requests to respect crawl-delay
            time.sleep(2)
    
    # Print out the results
    print("\nSuccesses:")
    for success in successes:
        print(f"- {success[0]} in year {success[1]}")
    
    print("\nErrors:")
    for error in errors:
        print(f"- {error[0]} in year {error[1]}: {error[2]}")
    
    # Optionally, print or process all_matches_data as needed
    # For example, to see the count of all collected match data:
    print(f"\nTotal matches collected: {len(all_matches_data)}")
    # combine into dataframe  
    df2 = pd.DataFrame(all_matches_data) 
    df2.set_index("team", inplace=True) 
    matches = pd.concat([df, df2])
    matches

    Initial data cleaning

    We will remove the matches with 'TBD' as 'team', because they seem to be wrongly taken from a graphics header on the website which was mistaken by the scraping function as a match. Moreover, we will remove the matches taking place after the EURO ended.