Introduction

Predicting football match outcomes has long intrigued statisticians and sports analysts. The inherently random nature of sports events, coupled with various influencing factors like team strength, players performance, and home advantage, presents a challenging yet fascinating problem for statistical modeling.

In previous projects, I have explored goal-based models such as those proposed by Maher and Dixon and Coles. Maher’s model describes the outcome of a match by modeling the goals of the two teams independently using Poisson distributions. Dixon and Coles enhanced this basic model by introducing correlation between the teams’ goal-scoring processes and accounting for the home advantage effect. While these models have been widely used due to their simplicity and effectiveness, they rely on the assumption that goals follow a Poisson distribution. This assumption, however, can be questionable in certain leagues where overdispersion —in which the sample variance exceeds the sample mean— has been observed in the number of goals.

To address this limitation, Karlis and Ntzoufras proposed an alternative approach by focusing on the goal difference rather than the individual goal counts of each team. Their method utilizes the Skellam distribution, which eliminates the need to account for the correlation between teams’ scores directly and avoids the assumption of Poisson marginals. Although the Skellam-based model cannot predict exact match scores, it offers valuable insights into the likely goal difference, thereby simplifying the complexity associated with modeling individual goal counts. The application of Bayesian methods in this context allows for the incorporation of prior knowledge and continuous updating of model parameters as new data becomes available, enhancing the predictive power and adaptability of the model.


Project structure

📂 BayesianExam/
│ 
├── 📂 data/
│   └──  📂 season_2122/
│         ├── 📊 Bundesliga1_2122.csv
│         ├── 📊 Bundesliga2_2122.csv
│         ├── 📊 Championship_2122.csv
│         ├── 📊 LaLiga1_2122.csv
│         ├── 📊 LaLiga2_2122.csv
│         ├── 📊 Ligue1_2122.csv
│         ├── 📊 Ligue2_2122.csv
│         ├── 📊 PremierLeague_2122.csv
│         ├── 📊 SerieA_2122.csv
│         └── 📊 SerieB_2122.csv         
│ 
├── 📂 estimated_models/ 
│   └──  📂 season_2122/
│        ├── 📂 offline_models/
│        │   ├── 📂 matchday19/
|        |   ├── ...
│        │   └── 📂 matchday38/
│        │
│        └── 📂 online_models/
│            ├── 📂 matchday19/
|            ├── ...
│            └── 📂 matchday38/
│
├── 📂 report/ 
│
├── 📂 stan/ 
|   ├── ⚙️ karlis-ntzoufras.stan
│   └── ⚙️️️ online.stan
|
├── 📂 utils/ 
|   ├── HDA_probabilities.R
|   ├── final_pts_distribution.R
|   ├── get_all_teams_data.R
|   ├── plot_confusionmatrix.R
|   └── plot_parameters_ts.R
|
├── 📂 web-scraping/ 
│   └── 🔍 scraper.py
|
├── BrierScore.R
├── ConfusionMatrix.R
├── DataAnalysis.R
├── DynamicEstimation.R
├── ExploratoryDataAnalysis.R
├── FinalRankingSimulation.R
├── ModelChecking.R
├── OnlineLearning.R
└── Predictions.R

This report presents the findings of a comprehensive analysis of Serie A football league data for the 2021-2022 season. Given the inherent complexity of the analysis at hand, my project is structured into several directories, each serving a specific purpose in the data collection, analysis, and modeling process.

  1. data/ is organised in subfolders for specific seasons, containing the raw data collected from football-data.co.uk.

  2. web-scraping/ contains a python web scraper (scraper.py) to automatically retrieve and download datasets for all major European teams. By executing the scraper from the terminal and providing the desired season, users can download datasets for analysis. Example of usage:

    python3 scraper.py <season> #e.g 2122
  3. stan/ contains the Stan models used for statistical modeling. Two models are included:

    • karlis-ntzoufras.stan: the standard version proposed by the authors. From now on in the report, we will refer to it as the “offline” model;
    • online.stan: the online-learning framework version of the model.
  4. utils/ simply contains some custom helper functions utilized throughout the analysis process

  5. report/ contains the final report file, where the results of the analysis are documented and presented.

  6. estimated_models/ contains the fitted models as .rds files. Like the data/ directory, it is organized in season-specific subfolders. Since I wanted to store the “offline” models separately from the “online” ones, I further increase the depth of my subfolders, and I also stored the models by matchday (I hope it will be intuitive)

In addition to these directories, various R scripts are available, each corresponding to different stages of the analysis pipeline. These scripts facilitate data cleaning, exploration, modeling, and visualization, contributing to a structured and reproducible workflow. The following sections detail the methodology employed, the results obtained from statistical modeling, and insights derived from the analysis of Serie A 2021-2022 data.


Exploratory Data Analysis

Before delving into model construction, it is crucial to gain a comprehensive understanding of the data. The exploratory data analysis (EDA) phase, through its visualizations, provide a glimpse into the underlying dynamics but also inform the modeling approach by highlighting factors that influence match outcomes. By analyzing various aspects of the data, such as goal differences, result distribution, and team performance, we are going to lay the groundwork for developing our models.

Data import and cleaning:

SerieA_data<- read.csv(file= paste0(DATA_DIR,"season_",SEASON,"/SerieA_",SEASON,".csv"))
SerieA_data<- SerieA_data %>% select(c("HomeTeam","AwayTeam","FTHG","FTAG","FTR"))
SerieA_data<- SerieA_data %>% mutate(GD=FTHG-FTAG)
teams<- unique(SerieA_data$HomeTeam)
n_games<- nrow(SerieA_data)
n_teams<- length(teams)
n_matchdays= ceiling(n_games/((n_teams)/2))
ht= unlist(sapply(1:n_games,function (g) which(teams==SerieA_data$HomeTeam[g])))
at= unlist(sapply(1:n_games,function (g) which(teams==SerieA_data$AwayTeam[g])))
SerieA_data$ht=ht
SerieA_data$at=at
teams<- str_replace_all(teams, " ", "")

The initial variable I decided to inspect was, of course, the goal differences, as it represents our main phenomenon of interest. The plot below shows the empirical data and overlays it with theoretical Skellam distributions (both the standard and a zero-inflated version), and highlights how well the theoretical distributions align with the observed data.

Next, we investigate the distribution of results (encoded as Home-win, Draw, and Away-win) across the entire season. As depicted in the plot below, HomeWin emerges as the majority class. This observation informs our modeling approach, where we will introduce a home effect parameter to account for the slight advantage of playing matches in one’s own stadium.

Another interesting aspect to explore are the goal statistics by team. In particular, by examining goals scored and conceded by each team, we gain valuable insights into the potential ranking of attack and defense coefficients in our upcoming model. Additionally, distinguishing between home and away matches underscores the significance of incorporating the home effect as a model parameter.

# Preliminary steps required to make plots
home_scored <- aggregate(SerieA_data$FTHG, list(SerieA_data$HomeTeam), FUN = sum) 
away_scored <- aggregate(SerieA_data$FTAG, list(SerieA_data$AwayTeam), FUN = sum)
home_conceeded <- aggregate(SerieA_data$FTAG, list(SerieA_data$HomeTeam), FUN = sum) 
away_conceeded <- aggregate(SerieA_data$FTHG, list(SerieA_data$AwayTeam), FUN = sum)

colnames(home_scored) <- c("team", "home_scored")
colnames(away_scored) <- c("team", "away_scored")
colnames(home_conceeded) <- c("team", "home_conceeded")
colnames(away_conceeded) <- c("team", "away_conceeded")

goals_scored <- merge(home_scored, away_scored, by = "team")
goals_scored$tot_goals <- goals_scored$home_scored + goals_scored$away_scored

goals_conceeded <- merge(home_conceeded, away_conceeded, by = "team")
goals_conceeded$tot_goals <- goals_conceeded$home_conceeded + goals_conceeded$away_conceeded

Comment: only five teams (out of a total of twenty) have scored more goals away than at home. Except for Napoli, which has scored the same number of goals at home and away, all other teams have had better offensive performances when playing in their own stadium. Symmetrically, only six teams have conceded more goals at home than away. All other teams have shown better defensive performances when playing in their own stadium (Inter, Salernitana, and Udinese, actually conceded the same number of goals at home and away).

The existence of a home effect is suggested also by the useful results achieved by each team, differentiated between outcomes at home and away:

useful_results_df<- SerieA_data %>%
  group_by(HomeTeam) %>%
  summarize(UsefulResultsHome = sum(FTR %in% c("H", "D"))) %>%
  left_join(SerieA_data %>% 
              group_by(AwayTeam) %>%
              summarize(UsefulResultsAway = sum(FTR %in% c("D", "A"))),
            by = c("HomeTeam" = "AwayTeam")) %>%
  rename(Team=HomeTeam)

Finally, an interesting insight that can be extracted from our data lies in the Final Rankings of the league:

SerieA_data <- SerieA_data %>%
  mutate(HomePoints = ifelse(FinalResult == "HomeWin", 3,
                             ifelse(FinalResult == "Draw", 1, 0)),
         AwayPoints = ifelse(FinalResult == "AwayWin", 3,
                             ifelse(FinalResult == "Draw", 1, 0)))

total_points <- SerieA_data %>%
  select(HomeTeam,AwayTeam,FTHG,FTAG,HomePoints,AwayPoints) %>%
  group_by(Team = HomeTeam) %>%
  summarise(TotalPoints = sum(HomePoints),
            GS = sum(FTHG),
            GC = sum(FTAG)) %>%
  bind_rows(
    SerieA_data %>%
      group_by(Team = AwayTeam) %>%
      summarise(TotalPoints = sum(AwayPoints),
                GS = sum(FTAG),
                GC = sum(FTHG))
  ) %>%
  group_by(Team) %>%
  summarise(TotalPoints=sum(TotalPoints),
            GS = sum(GS),
            GC = sum(GC)) %>%
  arrange(desc(TotalPoints)) %>%
  mutate(Position = row_number(),
         GD =GS-GC) %>%
  select(Position, Team, TotalPoints,GS,GC,GD) 
Serie A 2021-2022 final rankings
Position Team TotalPoints GS GC GD
1 Milan 86 69 31 38
2 Inter 84 84 32 52
3 Napoli 79 74 31 43
4 Juventus 70 57 37 20
5 Lazio 64 77 58 19
6 Roma 63 59 43 16
7 Fiorentina 62 59 51 8
8 Atalanta 59 65 48 17
9 Verona 53 65 59 6
10 Sassuolo 50 64 66 -2
11 Torino 50 46 41 5
12 Udinese 47 61 58 3
13 Bologna 46 44 55 -11
14 Empoli 41 50 70 -20
15 Sampdoria 36 46 63 -17
16 Spezia 36 41 71 -30
17 Salernitana 31 33 78 -45
18 Cagliari 30 34 68 -34
19 Genoa 28 27 60 -33
20 Venezia 27 34 69 -35

Model Formulation

As mentioned in the Introduction, we can use the Skellam distribution to model goal difference of a football match. More precisely, the goal difference of the match between teams \(i\) and \(j\) (where we adopt the convention of \(i\) being the home team) is described by: \[\begin{equation} GD_{i,j} \sim Skellam\left(\theta^{(H)}_{(i,j)},\theta^{(A)}_{(i,j)}\right) \end{equation}\] with: \[\begin{align*} \theta^{(H)}_{(i,j)} &= \exp\{\mu + h_{eff} + att_{(i)} + def_{(j)}\}\\[0.2cm] \theta^{(A)}_{(i,j)} &= \exp\{\mu + att_{(j)} + def_{(i)}\} \end{align*}\]

Under the above parametrization, all parameters have a straightforward interpretation:

Some notes:

In my previous experience with goal-based models, I noticed that the number of draws (and the corresponding probability) tends to be underestimated. To address this, I explored the idea proposed by Karlis and Ntzoufras of using a zero-inflated version of the Skellam distribution. This approach seemed promising to me, so the probability distribution which I referred to has been the following:

\[\begin{align*} f_{ZIS}\left(x,\theta_1,\theta_2,p\right) = \begin{cases} (1-p)\cdot f_{S}\left(x,\theta_1,\theta_2\right)\hspace{0.5cm}&\text{if $x\neq0$}\\[0.3cm] p+(1-p)\cdot f_{S}\left(x,\theta_1,\theta_2\right)\hspace{0.5cm}&\text{if $x=0$} \end{cases} \end{align*}\] with \(x \in \mathbb{Z},\theta_1,\theta_2 >0\) and \(p \in [0,1]\)

To fully specify a Bayesian model, we need to define prior distributions for the model parameters. In my analysis, I chose not to incorporate any external information from previous seasons. In such cases, Karlis and Ntzoufras suggested using normal prior distributions with a mean of zero and a large variance (e.g. \(10^4\)) to express prior ignorance. However, since the Stan documentation discourages non-informative priors, I set my priors to be weakly informative (still I assumed them to be normal):

\[\begin{align*} \mu \sim N(0,10)\\[0.2cm] H_{eff} \sim N(0,10)\\[0.2cm] att_{(\bullet)} \sim N(0,10)\\[0.2cm] def_{(\bullet)} \sim N(0,10)\\[0.2cm] p \sim U(0,1) \end{align*}\]

In addition to the standard approach, I considered an online-learning approach. In the Bayesian framework, we typically start with a prior distribution and update it with observed data, in order to obtain a posterior distribution. When new data becomes available, this posterior can serve as the prior for the next update and so on. Despite concerns about specifying overly informative priors, I was curious to compare the online learning approach with the traditional “offline” model, where the information that updates our knowledge derives (almost) only from the data.

Note: in principle we don’t know the functional form of the posterior distribution. However, in my analysis ,I assumed my priors to be normal, updating only their location and scale according to the matchday by matchday estimates.

The formulation of the online model has -of course- the same likelihood of the offline one, except from the priors specification. More precisely, at a certain matchday \(m\), the priors will be the following:

\[\begin{align*} \mu &\sim N\big(MAP(\mu)^{(m-1)},\sigma_{\mu}^{(m-1)}\big)\\[0.2cm] H_{eff} &\sim \big(MAP(H_{eff})^{(m-1)},\sigma_{H_{eff}}^{(m-1)}\big)\\[0.2cm] att_{(\bullet)} &\sim \big(MAP(att_{(\bullet)}\;)^{(m-1)},\sigma_{att_{(\bullet)}}^{(m-1)}\big)\\[0.2cm] def_{(\bullet)} &\sim \big(MAP(def_{(\bullet)}\;)^{(m-1)},\sigma_{def_{(\bullet)}}^{(m-1)}\big)\\[0.2cm] p &\sim U(0,1) \end{align*}\]


Model implementation in Stan

The models have been fully implemented using the Stan ecosystem and managed in R through the library rstan. I decided to follow carefully the procedure illustrated by Karlis and Ntzoufras in their article, setting the following MCMC parameters:

In what follows the main blocks of the Stan files are illustrated.

Offline model

  • Functions: since Stan did not provide the log pmf of a Skellam, I had to define my own custom functions

    functions {
      real skellam_lpmf(int k, real lambda1, real lambda2) {
        real total;
        real log_prob;
        total = (- lambda1 - lambda2) + (log(lambda1) - log(lambda2)) * k / 2;
        log_prob = total + log(modified_bessel_first_kind(k, 2 * sqrt(lambda1*lambda2)));
        return log_prob;
      }
      real zero_inflated_skellam_lpmf(int k, real lambda1, real lambda2, real p) {
        real base_prob;
        real prob;
        real log_prob;
        base_prob = exp(skellam_lpmf(k|lambda1, lambda2));
        if (k == 0){
          prob = p + (1 - p) * base_prob;
        }
        else{
          prob = (1 - p) * base_prob;
        }
        log_prob = log(prob);
        return log_prob;
      }
    }
  • Data:

    data {
      int<lower=1> n_teams;
      int<lower=1> n_games;
      array[n_games] int<lower=1, upper=n_teams> home_team;
      array[n_games] int<lower=1, upper=n_teams> away_team;
      array[n_games] int goal_difference;
    }
  • Parameters and transformed parameters:

    parameters {
      real<lower=0, upper=1> p;
      real mu;
      real home_advantage;
      array[n_teams-1] real att_raw;
      array[n_teams-1] real def_raw;
    }
    
    transformed parameters {
      // Sum-to-zero constraint
      array[n_teams] real att;
      array[n_teams] real def;
    
      for (t in 1:(n_teams-1)) {
        att[t] = att_raw[t];
        def[t] = def_raw[t];
      }
    
      att[n_teams] = -sum(att_raw);
      def[n_teams] = -sum(def_raw);
    }
  • Model:

    model {
      array[n_games] real theta_H;
      array[n_games] real theta_A;
      // Priors
      p ~ uniform(0, 1);
      att_raw ~ normal(0, 10);
      def_raw ~ normal(0, 10);
      home_advantage ~ normal(0, 10);
      mu ~ normal(0, 10);
      // Likelihood
      for (g in 1:n_games) {
        theta_H[g] = exp(mu + home_advantage +att[home_team[g]] + def[away_team[g]]);
        theta_A[g] = exp(mu + att[away_team[g]] + def[home_team[g]]);
        goal_difference[g] ~ zero_inflated_skellam(theta_H[g],theta_A[g],p);
      }
    }
  • Generated quantities: I couldn’t exploit the generated quantities block since Stan does not provide any random generator for the Skellam distribution (a sort of skellam_rng()). Some users usually by-pass this limitations by generating 2 values from a Poisson (with poisson_rng()) and computing their subtraction, but this would be in contrast with our setup, in which we don’t assume Poisson marginals. Fortunately, the generated quantities can be easily computed in R after the model has been fitted! I opted for this latter approach, referring to the skellam library without the need of re-inventing the wheel.

Technical Note: the skellam library provided a rskellam() function, but the documentation clearly specified that it worked internally by generating 2 random Poisson (with rpois()), which was the mechanism I wanted to avoid :( Nevertheless, I could easily implement a custom Skellam generator through the inverse-sampling method, exploiting the qskellam() function from the same library, as shown below:

library(skellam)
#-------------------------------------------------------------------------------
# 1) Random generator for standard skellam
my_rskellam<- function(n,lambda1,lambda2){
  q= runif(n,min=0,max=1)
  return(qskellam(q,lambda1,lambda2))
}
#-------------------------------------------------------------------------------
# 2) Random generator for zero-inflated skellam
my_rzeroinflatedskellam<- function(n,lambda1,lambda2,p){
  samples<- vector(mode="numeric",length = n)
  for(i in 1:n){
    # Generate 0 with probability p
    # Or a random number from a standard skellam with prob (1-p)
    if(runif(1)>p){
      samples[i]<- my_rskellam(1,lambda1,lambda2)
    }
    #else not needed, since there are already zeros!
  }
  return(samples)
}
#-------------------------------------------------------------------------------

Online model

The Stan code for the online model is similar to the previous one, with the major differences in the data block and in the priors definition (model block)

data {
  int<lower=1> n_teams;
  int<lower=1> n_games;
  array[n_games] int<lower=1, upper=n_teams> home_team;
  array[n_games] int<lower=1, upper=n_teams> away_team;
  array[n_games] int goal_difference;
  // Previous estimates and sd
  array[n_teams-1] real prev_att_MAP;
  array[n_teams-1] real prev_def_MAP;
  real prev_mu_MAP;
  real prev_home_advantage_MAP;
  array[n_teams-1] real<lower=0> prev_att_sd;
  array[n_teams-1] real<lower=0> prev_def_sd;
  real<lower=0> prev_mu_sd;
  real<lower=0> prev_home_advantage_sd;
}

model {
  ...
  // Priors
  p ~ uniform(0, 1);
  for(a in 1:(n_teams-1)){
    att_raw[a] ~ normal(prev_att_MAP[a], prev_att_sd[a]);
    def_raw[a] ~ normal(prev_def_MAP[a], prev_def_sd[a]);
  }
  home_advantage ~ normal(prev_home_advantage_MAP, prev_home_advantage_sd);
  mu ~ normal(prev_mu_MAP, prev_mu_sd);
  ...
}

Disclaimer:

  • Due to the extensive time required to fit the models, I am not including the full code here (it is not particularly interesting, though). However, I have fitted all the models and stored them in a dedicated folder, with the possibility of loading each of them when required.
  • The bayesian procedure of fitting the model matchday by matchday is, in practice, done on a weekly basis. With access to the complete dataset for the analyzed season, I was able to fit the models in a for loop, iterating over each matchday. The procedure for each iteration is the following:
    1. Select the rows of the dataframe corresponding to the available information up to the current matchday.
    2. Load the model estimated at the end of the previous matchday (from the models folder).
    3. Fit the model and store the updated model in the dedicated folder.

Posterior distributions of the coefficients

After fitting my Bayesian models, I used the bayesplot package to generate informative plots of the posterior distributions of the model parameters. To provide a comprehensive overview, I will present the plots for the “final model”, which is the one fitted at the end of the season, with all the available data. These plots will be shown for both the offline and online models, with a clear distinction between the attack parameters and the defense parameters.

Plots for offline models

load(paste0(OFFLINE_MODELS_DIR,"matchday38/KN_matchday38.rds"))
posterior=as.array(KN_model)
#-------------------------------------------------------------------------------
# Lazy code to retrieve only the parameters of interest
par_names<-  names(KN_model)
useful_par_names<- par_names[!(grepl("raw", par_names))]
att_params <- useful_par_names[grepl("att", useful_par_names)]
def_params <- useful_par_names[grepl("def", useful_par_names)]
att_labels <- setNames(teams, paste0("att[", 1:20, "]"))
def_labels <- setNames(teams, paste0("def[", 1:20, "]"))
#-------------------------------------------------------------------------------
#-------------------------------------------------------------------------------
# ATTACK COEFFICIENTS
color_scheme_set("blue")
att_intervals_offline<- mcmc_intervals(posterior,pars=att_params)+
  scale_y_discrete(labels=att_labels)+
  ggtitle("MCMC intervals for attack coefficients (offline)")

att_areas_offline<- mcmc_areas(posterior,pars= att_params)+
  scale_y_discrete(labels=att_labels)+
  ggtitle("MCMC areas for attack coefficients  (offline)")

att_dens_overlay_offline<- mcmc_dens_overlay(posterior,pars=att_params)+
  ggtitle("Marginal posteriors for attack coefficients by chain  (offline)")
att_dens_overlay_offline[[1]]$Parameter = rep(att_labels,
                                             each=N_CHAINS*(N_ITERS-N_WARMUP))
#-----------------------------------------------------------------------------
# DEFENSE COEFFICIENTS
 color_scheme_set("red")
def_intervals_offline<- mcmc_intervals(posterior,pars=def_params)+
   scale_y_discrete(labels=def_labels)+
   ggtitle("MCMC intervals for defense coefficients (offline)")
 
def_areas_offline<- mcmc_areas(posterior,pars= def_params)+
  scale_y_discrete(labels=def_labels)+
  ggtitle("MCMC areas for defense coefficients (offline)")
 
def_dens_overlay_offline<- mcmc_dens_overlay(posterior,pars=def_params)+
  ggtitle("Marginal posteriors for defense coefficients by chain (offline)")
def_dens_overlay_offline[[1]]$Parameter = rep(def_labels,
                                             each=N_CHAINS*(N_ITERS-N_WARMUP))
#-------------------------------------------------------------------------------
# HOME ADVANTAGE
home_areas_offline<- mcmc_areas(posterior,pars="home_advantage")+
  ggtitle("MCMC areas for home advantage (offline)")
home_dens_overlay_offline<- mcmc_dens_overlay(posterior,pars="home_advantage")+
  ggtitle("Marginal posteriors for home advantage by chain (offline)")

Plots for online models

load(paste0(ONLINE_MODELS_DIR,"matchday38/KN_matchday38.rds"))
posterior=as.array(KN_model)
#-------------------------------------------------------------------------------
# Lazy code to retrieve only the parameters of interest
par_names<-  names(KN_model)
useful_par_names<- par_names[!(grepl("raw", par_names))]
att_params <- useful_par_names[grepl("att", useful_par_names)]
def_params <- useful_par_names[grepl("def", useful_par_names)]
att_labels <- setNames(teams, paste0("att[", 1:20, "]"))
def_labels <- setNames(teams, paste0("def[", 1:20, "]"))
#-------------------------------------------------------------------------------
#-------------------------------------------------------------------------------
# ATTACK COEFFICIENTS
color_scheme_set("blue")
att_intervals_online<- mcmc_intervals(posterior,pars=att_params)+
  scale_y_discrete(labels=att_labels)+
  ggtitle("MCMC intervals for attack coefficients (online)")

att_areas_online<- mcmc_areas(posterior,pars= att_params)+
  scale_y_discrete(labels=att_labels)+
  ggtitle("MCMC areas for attack coefficient (online)")

att_dens_overlay_online<- mcmc_dens_overlay(posterior,pars=att_params)+
  ggtitle("Marginal posteriors for attack coefficients by chain (online)")
att_dens_overlay_online[[1]]$Parameter = rep(att_labels,
                                             each=N_CHAINS*(N_ITERS-N_WARMUP))
#-------------------------------------------------------------------------------
# DEFENSE COEFFICIENTS
color_scheme_set("red")
def_intervals_online<- mcmc_intervals(posterior,pars=def_params)+
   scale_y_discrete(labels=def_labels)+
   ggtitle("MCMC intervals for defense coefficients (online)")

def_areas_online<- mcmc_areas(posterior,pars= def_params)+
  scale_y_discrete(labels=def_labels)+
  ggtitle("MCMC areas for defense coefficients (online)")
# 
def_dens_overlay_online<- mcmc_dens_overlay(posterior,pars=def_params)+
  ggtitle("Marginal posteriors for defense coefficients by chain (online)")
def_dens_overlay_online[[1]]$Parameter = rep(def_labels,
                                             each=N_CHAINS*(N_ITERS-N_WARMUP))
#-------------------------------------------------------------------------------
# HOME ADVANTAGE
home_areas_online<- mcmc_areas(posterior,pars="home_advantage")+
  ggtitle("MCMC areas for home advantage (online)")
home_dens_overlay_online<- mcmc_dens_overlay(posterior,pars="home_advantage")+
  ggtitle("Marginal posteriors for home advantage by chain (online)")

MCMC Intervals plots

MCMC Areas plots

Marginal posteriors (by chain)

Home advantage

Teams comparison

Abilities over time

Teams’s abilities over time (offline model)

Teams’s abilities over time (online model)


A glimpse to convergence diagnostics

The training phase of the models seemed promising, as the MCMC procedure completed succesfully without any warning or error. To assess the convergence of the chains, we can inspect the summary of the models and look at the Gelman-Rubin statistics \(\hat{R}\) (Rhat), which provide a quantitative measure of the chains’ convergence to the steady state.

Offline model

## Inference for Stan model: anon_model.
## 4 chains, each with iter=11000; warmup=1000; thin=1; 
## post-warmup draws per chain=10000, total post-warmup draws=40000.
## 
##                 mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## mu             -0.19       0 0.11 -0.40 -0.26 -0.19 -0.11  0.02 27901    1
## home_advantage  0.12       0 0.07 -0.01  0.08  0.12  0.17  0.26 47969    1
## att[1]          0.68       0 0.17  0.33  0.56  0.68  0.80  1.02 36874    1
## att[2]          0.06       0 0.23 -0.40 -0.09  0.07  0.22  0.51 40797    1
## att[3]         -0.25       0 0.29 -0.84 -0.44 -0.24 -0.05  0.29 40810    1
## att[4]         -0.10       0 0.24 -0.58 -0.25 -0.09  0.07  0.35 38297    1
## att[5]         -0.05       0 0.26 -0.58 -0.22 -0.04  0.13  0.44 42691    1
## att[6]          0.49       0 0.22  0.04  0.34  0.49  0.64  0.91 35893    1
## att[7]          0.50       0 0.18  0.15  0.39  0.51  0.62  0.85 37597    1
## att[8]          0.28       0 0.21 -0.14  0.14  0.29  0.43  0.69 40854    1
## att[9]         -0.73       0 0.40 -1.57 -0.98 -0.71 -0.46 -0.02 39762    1
## att[10]         0.29       0 0.24 -0.19  0.13  0.29  0.45  0.74 38666    1
## att[11]         0.30       0 0.21 -0.12  0.16  0.30  0.44  0.71 44720    1
## att[12]         0.67       0 0.19  0.29  0.54  0.67  0.80  1.04 35146    1
## att[13]         0.53       0 0.21  0.12  0.39  0.53  0.67  0.93 37462    1
## att[14]        -0.04       0 0.21 -0.47 -0.18 -0.04  0.10  0.36 42073    1
## att[15]        -1.28       0 0.52 -2.41 -1.61 -1.24 -0.92 -0.35 35030    1
## att[16]         0.29       0 0.23 -0.18  0.14  0.30  0.45  0.73 35889    1
## att[17]         0.29       0 0.19 -0.08  0.16  0.29  0.42  0.65 42743    1
## att[18]        -0.51       0 0.39 -1.33 -0.77 -0.50 -0.24  0.20 37348    1
## att[19]        -0.39       0 0.32 -1.05 -0.60 -0.38 -0.17  0.20 40775    1
## att[20]        -1.03       0 0.43 -1.93 -1.30 -1.01 -0.74 -0.26 37245    1
## def[1]         -0.54       0 0.41 -1.42 -0.80 -0.51 -0.25  0.20 35246    1
## def[2]         -0.13       0 0.26 -0.66 -0.30 -0.13  0.05  0.35 39088    1
## def[3]          0.22       0 0.20 -0.18  0.09  0.22  0.36  0.61 41929    1
## def[4]         -0.28       0 0.27 -0.82 -0.46 -0.28 -0.10  0.22 39696    1
## def[5]          0.18       0 0.22 -0.25  0.04  0.18  0.33  0.60 39619    1
## def[6]          0.41       0 0.23 -0.05  0.25  0.41  0.56  0.85 36563    1
## def[7]         -0.65       0 0.39 -1.48 -0.90 -0.63 -0.38  0.06 37040    1
## def[8]         -0.13       0 0.27 -0.69 -0.31 -0.12  0.06  0.39 40010    1
## def[9]          0.28       0 0.19 -0.09  0.16  0.29  0.41  0.66 39832    1
## def[10]         0.54       0 0.19  0.16  0.42  0.54  0.67  0.91 37093    1
## def[11]        -0.14       0 0.28 -0.71 -0.32 -0.13  0.05  0.39 44269    1
## def[12]         0.34       0 0.24 -0.14  0.19  0.35  0.51  0.80 36020    1
## def[13]         0.37       0 0.23 -0.08  0.22  0.37  0.53  0.80 39245    1
## def[14]        -0.85       0 0.35 -1.58 -1.07 -0.83 -0.61 -0.20 43395    1
## def[15]         0.11       0 0.20 -0.27 -0.02  0.11  0.25  0.50 45973    1
## def[16]         0.29       0 0.23 -0.16  0.14  0.29  0.44  0.72 35305    1
## def[17]        -1.11       0 0.46 -2.10 -1.40 -1.08 -0.79 -0.28 35375    1
## def[18]         0.54       0 0.18  0.19  0.42  0.54  0.66  0.89 37538    1
## def[19]         0.33       0 0.19 -0.04  0.21  0.34  0.46  0.70 39414    1
## def[20]         0.20       0 0.19 -0.17  0.07  0.20  0.32  0.56 37993    1
## 
## Samples were drawn using NUTS(diag_e) at Thu May 23 16:24:21 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Online model

## Inference for Stan model: anon_model.
## 4 chains, each with iter=11000; warmup=1000; thin=1; 
## post-warmup draws per chain=10000, total post-warmup draws=40000.
## 
##                 mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## mu             -0.32       0 0.02 -0.35 -0.33 -0.32 -0.31 -0.29 57363    1
## home_advantage  0.15       0 0.02  0.12  0.14  0.15  0.16  0.18 55709    1
## att[1]          0.78       0 0.03  0.71  0.75  0.78  0.80  0.84 54325    1
## att[2]          0.11       0 0.05  0.02  0.08  0.11  0.15  0.21 57006    1
## att[3]          0.01       0 0.05 -0.09 -0.02  0.01  0.05  0.12 54203    1
## att[4]          0.03       0 0.05 -0.06  0.00  0.03  0.06  0.12 54559    1
## att[5]          0.21       0 0.05  0.11  0.17  0.21  0.24  0.30 51632    1
## att[6]          0.37       0 0.05  0.27  0.33  0.37  0.40  0.47 50910    1
## att[7]          0.49       0 0.04  0.42  0.46  0.49  0.51  0.56 57498    1
## att[8]          0.44       0 0.04  0.36  0.41  0.44  0.47  0.52 53710    1
## att[9]         -0.64       0 0.08 -0.79 -0.69 -0.64 -0.59 -0.49 51317    1
## att[10]         0.26       0 0.05  0.16  0.22  0.26  0.29  0.36 58589    1
## att[11]         0.40       0 0.04  0.32  0.37  0.40  0.43  0.48 58019    1
## att[12]         0.79       0 0.04  0.72  0.76  0.79  0.81  0.86 52849    1
## att[13]         0.57       0 0.04  0.49  0.54  0.57  0.60  0.65 57625    1
## att[14]         0.11       0 0.04  0.02  0.08  0.11  0.14  0.20 54968    1
## att[15]        -2.18       0 0.12 -2.41 -2.26 -2.18 -2.10 -1.95 50041    1
## att[16]         0.01       0 0.05 -0.09 -0.02  0.01  0.05  0.12 53968    1
## att[17]         0.40       0 0.04  0.32  0.37  0.40  0.43  0.48 49641    1
## att[18]        -0.69       0 0.09 -0.87 -0.75 -0.69 -0.63 -0.52 51054    1
## att[19]        -0.42       0 0.07 -0.56 -0.47 -0.42 -0.38 -0.29 51023    1
## att[20]        -1.03       0 0.22 -1.47 -1.18 -1.03 -0.88 -0.61 62394    1
## def[1]         -0.62       0 0.07 -0.75 -0.66 -0.62 -0.57 -0.49 56236    1
## def[2]         -0.16       0 0.05 -0.26 -0.20 -0.17 -0.13 -0.07 48733    1
## def[3]          0.41       0 0.04  0.33  0.38  0.41  0.43  0.48 51272    1
## def[4]         -0.25       0 0.05 -0.35 -0.28 -0.25 -0.21 -0.15 53831    1
## def[5]          0.41       0 0.04  0.33  0.38  0.41  0.43  0.48 53237    1
## def[6]          0.34       0 0.04  0.25  0.31  0.34  0.37  0.42 53148    1
## def[7]         -0.83       0 0.06 -0.95 -0.87 -0.83 -0.79 -0.71 50939    1
## def[8]          0.01       0 0.05 -0.08 -0.02  0.01  0.04  0.10 52022    1
## def[9]          0.32       0 0.04  0.25  0.30  0.32  0.35  0.39 56688    1
## def[10]         0.53       0 0.04  0.45  0.50  0.53  0.55  0.60 50807    1
## def[11]        -0.31       0 0.06 -0.42 -0.35 -0.31 -0.28 -0.21 54624    1
## def[12]         0.43       0 0.04  0.34  0.40  0.43  0.45  0.51 52231    1
## def[13]         0.30       0 0.04  0.22  0.27  0.30  0.33  0.39 57413    1
## def[14]        -0.98       0 0.06 -1.11 -1.02 -0.98 -0.94 -0.86 53674    1
## def[15]         0.06       0 0.04 -0.01  0.04  0.06  0.09  0.14 49967    1
## def[16]        -0.10       0 0.05 -0.19 -0.13 -0.10 -0.07  0.00 53460    1
## def[17]        -0.78       0 0.06 -0.90 -0.82 -0.78 -0.74 -0.66 50585    1
## def[18]         0.62       0 0.03  0.56  0.60  0.62  0.65  0.69 59261    1
## def[19]         0.36       0 0.04  0.29  0.33  0.36  0.38  0.43 53456    1
## def[20]         0.25       0 0.13 -0.01  0.16  0.25  0.34  0.50 64327    1
## 
## Samples were drawn using NUTS(diag_e) at Thu May 23 15:21:00 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

In the summaries, all the Rhat values are equal to 1, indicating convergence of the chains (remember the rule of thumb, \(\hat{R}< 1.1\)). Additionally, the effective sample size (n_eff) values are high, which is a positive indicator. Notably, the n_eff values for the online model are generally higher than those for the offline models, which was something that I somehow expected. Interestingly, many n_eff values (all, actually, for the online model) are greater than the total number of samples set during the fitting process. At first, this seemed paradoxical to me. However, after reading various discussions on the Stan developers forum, I found the following answer:“Hi! It means that the draws that Stan is producing are better than independent draws for those parameters. Sounds like a joke, but apparently Stan is actually that awesome. :)”. (I leave the link to the discussion here)

Thanks to the bayesplot package, we can also plot the traceplots and visually inspect for divergent transitions. In our case, no chain had diverged, so the plots are not particularly interesting. However, if there had been any divergent chains, they would be evident in the plots below. As an example, I’ll present only the traceplots for the attack and defense coefficients of the online model:

## No divergences to plot.
## No divergences to plot.

After examining the Gelman-Rubin statistic and the effective sample size, another important diagnostic for assessing convergence is the autocorrelation function (ACF) plot. This plot helps to inspect the autocorrelations within the chain samples. Upon reviewing all the ACF plots, I did not observe any particularly concerning patterns. To provide a comprehensive overview, I will present ACF plots just for two parameters: one with greater variability (the defense coefficient for Venezia FC) and one with less variability (the attack coefficient for Inter). These examples illustrate that regardless of the variability, there are no significant autocorrelation patterns, indicating good convergence properties of the model.


Model checking and comparison

So far we have constructed a Bayesian model, computed the posterior distribution of all estimates, and conducted diagnostic tests —both quantitative and qualitative— regarding the convergence of the chains. Now, our attention turns towards evaluating the fit of the model to the data and our substantive knowledge.

Posterior predictive checks (in-sample replications)

As illustrated in the Bayesian Data Analysis book, a key technique for assessing model fit involves drawing simulated values from the posterior predictive distribution and comparing them to the observed data. More specifically, we’ll perform now an in-sample replication analysis. Here, we aim to assess whether the model adequately captures the observed data by assessing the plausibility of the observed data under the posterior predictive distribution. If the model is well-fitted, replicated data generated under its framework should closely resemble the observed data, while any discrepancy between the replicated and observed data may indicate model misfit or chance variations. If the model fits, then replicated data generated under the model should look similar to observed data. In short, the observed data should look plausible under the posterior predictive distribution.

The analysis of model checking can be effectively conducted in a graphical/qualitative manner. Fortunately, our friend bayesplot offers a suite of graphical solutions to assess the posterior predictive distribution for our in-sample analysis. Before delving into the graphical model checking plots, a preliminary step is required. Due to limitations in defining the generated quantities block in Stan, as discussed earlier, it was necessary to simulate these quantities in R (as shown in the chunks below).

Note: In conducting the in-sample replication procedure, we assess the plausibility of observed data relative to replications drawn from the model trained with such data. While it is possible to perform this check match day by match day, I opted to utilize our more comprehensive model—the one estimated after the final match of the season. By employing this approach, the observed results encompass the entirety of the season’s data, providing a robust assessment of model fit across the entire dataset.

PPC for offline models

#-------------------------------------------------------------------------------
# Dataframe were to store the simulated GD
GD_df<- data.frame(matrix(NA,nrow = N_CHAINS*(N_ITERS-N_WARMUP),ncol=n_games))
# Load the final model (end of season)
load(paste0(OFFLINE_MODELS_DIR,"matchday",n_matchdays,
            "/KN_matchday",n_matchdays,".rds"))
#-------------------------------------------------------------------------------
# Retrieve some "global" parameters
posterior<- as.array(KN_model)
mu = posterior[,,"mu"]
home=posterior[,,"home_advantage"]
p = posterior[,,"p"]
#-------------------------------------------------------------------------------
for (m in 1:n_games){
  cat(m,"\n")
  ht=SerieA_data$ht[m]
  at=SerieA_data$at[m]
  attH=posterior[,,paste0("att[",ht,"]")]
  defH=posterior[,,paste0("def[",ht,"]")]
  attA=posterior[,,paste0("att[",at,"]")]
  defA=posterior[,,paste0("def[",at,"]")]
  
  theta_H = exp(mu+home+attH+defA)
  theta_A = exp(mu+attA+defH)
  
  GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
  
  GD_df[,m]<- GD
  colnames(GD_df)[m]<-paste0(teams[ht],"-vs-",teams[at])
}
#-------------------------------------------------------------------------------
color_scheme_set("teal")

ppc_dens_overlay_offline <- ppc_dens_overlay(y=SerieA_data$GD,
                                             yrep=as.matrix(GD_df[1:4000,]))

ppc_ecdf_overlay_offline <- ppc_ecdf_overlay(y=SerieA_data$GD,
                                             yrep=as.matrix(GD_df[1:10000,]))

ppc_stat_mean_offline<- ppc_stat(y = SerieA_data$GD,
                                 yrep=,as.matrix(GD_df[1:40000,]),
                                 stat = "mean",bins = 20)

propzero <- function(x) mean(x == 0)
ppc_stat_propzero_offline <- ppc_stat(y = SerieA_data$GD,
                                      yrep=as.matrix(GD_df[1:40000,]),
                                      stat = "propzero",bins = 20)

ppc_stat_2d_offline <- ppc_stat_2d(y = SerieA_data$GD,
                                   yrep=as.matrix(GD_df[1:40000,]),
                                   stat = c("mean","sd"))

PPC for online models

#-------------------------------------------------------------------------------
# Dataframe were to store the simulated GD
GD_df<- data.frame(matrix(NA,nrow = N_CHAINS*(N_ITERS-N_WARMUP),ncol=n_games))
# Load the final model (end of season)
load(paste0(ONLINE_MODELS_DIR,"matchday",n_matchdays,
            "/KN_matchday",n_matchdays,".rds"))
#-------------------------------------------------------------------------------
# Retrieve some "global" parameters
posterior<- as.array(KN_model)
mu = posterior[,,"mu"]
home=posterior[,,"home_advantage"]
p = posterior[,,"p"]
#-------------------------------------------------------------------------------
for (m in 1:n_games){
  
  ht=SerieA_data$ht[m]
  at=SerieA_data$at[m]
  attH=posterior[,,paste0("att[",ht,"]")]
  defH=posterior[,,paste0("def[",ht,"]")]
  attA=posterior[,,paste0("att[",at,"]")]
  defA=posterior[,,paste0("def[",at,"]")]
  
  theta_H = exp(mu+home+attH+defA)
  theta_A = exp(mu+attA+defH)
  
  GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
  
  GD_df[,m]<- GD
  colnames(GD_df)[m]<-paste0(teams[ht],"-vs-",teams[at])
}
#-------------------------------------------------------------------------------
color_scheme_set("purple")

ppc_dens_overlay_online <- ppc_dens_overlay(y=SerieA_data$GD,
                                            yrep=as.matrix(GD_df[1:4000,]))

ppc_ecdf_overlay_online<- ppc_ecdf_overlay(y=SerieA_data$GD,
                                           yrep=as.matrix(GD_df[1:10000,]))

ppc_stat_mean_online<- ppc_stat(y = SerieA_data$GD,
                                yrep=,as.matrix(GD_df[1:40000,]),
                                stat = "mean",bins = 20)

propzero <- function(x) mean(x == 0)
ppc_stat_propzero_online <- ppc_stat(y = SerieA_data$GD,
                                     yrep=as.matrix(GD_df[1:40000,]),
                                     stat = "propzero",bins = 20)

ppc_stat_2d_online <- ppc_stat_2d(y = SerieA_data$GD,
                                  yrep=as.matrix(GD_df[1:40000,]),
                                  stat = c("mean","sd"))

Note: here,the plots on the left (right) side are referred to the offline (online) model!


Posterior predictive checks (out-sample replications)

Having completed the in-sample posterior predictive checks, we now turn our attention to out-of-sample replication. This process involves testing the model on future observable values or vectors of new observable quantities. Specifically, for our football league analysis, I employed a sequential prediction strategy based on matchdays. Recall that we initially estimated models matchday by matchday, starting from the end of the first half of the league.

With the model fitted using data up to matchday \(m\), I used it to predict the outcomes of matchday \(m+1\) (which are future, unknown data!). The Bayesian approach, with its chain samples, offers a significant advantage over classical statistics: instead of estimating a single value (either as a point estimate or with confidence intervals), we can obtain a distribution for our quantities of interest. In this case, using the chain samples, I simulated the goal differences for each match, obtaining numerous samples from the distribution of goal differences for a given match.

Such distribution then allowed me to assign probability values to the final outcomes, encoded as “H” (Home win), “D” (Draw) and “A” (Away win). I selected the most frequent class as a guess for each match’s final outcome and, by repeating this for all matches in the second half of the league, I constructed confusion matrices to compare the model’s accuracy. Additionally, I computed the Brier score matchday by matchday to further assess the predictive power of my models.

Accuracy and Brier Score

Let’s start from the accuracy:

Offline model:

SerieA_data$PredictedGD=NA
for (m in 20:n_matchdays){
  cat(paste0("...Simulation of matchday n. ",m,"...\n"))
  #-------------------------------------------------
  # (1) Get the current matchday to predict
  test_set=SerieA_data[(10*m -9):(10*m),]
  #-------------------------------------------------
  # (2) Load the most recent model and some of its stuff
  cat("...Loading the model and retrieving useful info...\n")
  load(paste0(OFFLINE_MODELS_DIR,"matchday",m-1,"/KN_matchday",m-1,".rds"))
  posterior<- as.array(KN_model)
  mu = posterior[,,"mu"]
  home=posterior[,,"home_advantage"]
  p = posterior[,,"p"]
  #-------------------------------------------------
  # (3) Make predictions
  cat("...Computing predictions and updating rankings...\n")
  PredictedGD<- vector(mode="numeric",length = nrow(test_set))
  for(mm in 1:nrow(test_set)){
    ht=test_set$ht[mm]
    at=test_set$at[mm]
    attH=posterior[,,paste0("att[",ht,"]")]
    defH=posterior[,,paste0("def[",ht,"]")]
    attA=posterior[,,paste0("att[",at,"]")]
    defA=posterior[,,paste0("def[",at,"]")]
    theta_H = exp(mu+home+attH+defA)
    theta_A = exp(mu+attA+defH)
    GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
    
    PredictedGD[mm]<- GD %>% mean() %>% round()
  }
  # (4) Add predicted GD to the dataset
  SerieA_data$PredictedGD[(10*m -9):(10*m)]<- PredictedGD
  cat("-------------------------------------------------\n")
}

ConfusionMatrix_offline_df<- SerieA_data %>% slice(191:n_games) %>% select(GD,PredictedGD)
ConfusionMatrix_offline_df<- ConfusionMatrix_offline_df %>% mutate(
   Actual = ifelse(GD > 0, "H", ifelse(GD < 0, "A", "D")),
   Predicted = ifelse(PredictedGD > 0, "H", ifelse(PredictedGD < 0, "A", "D"))
 )

 CM_offline <- confusion_matrix(
   targets = factor(ConfusionMatrix_offline_df$Actual, levels = c("H", "D", "A")),
   predictions = factor(ConfusionMatrix_offline_df$Predicted, levels = c("H", "D", "A"))
 )

Online model

The same procedure can be performed for the online model (we just need to change OFFLINE_MODELS_DIR with ONLINE_MODELS_DIR in the load() function), so I am not going to show the full code again. In the end, we get a confusion matrix also for the online model:

ConfusionMatrix_online_df<- ConfusionMatrix_online_df %>% mutate(
   Actual = ifelse(GD > 0, "H", ifelse(GD < 0, "A", "D")),
   Predicted = ifelse(PredictedGD > 0, "H", ifelse(PredictedGD < 0, "A", "D"))
 )

 CM_online <- confusion_matrix(
   targets = factor(ConfusionMatrix_online_df$Actual, levels = c("H", "D", "A")),
   predictions = factor(ConfusionMatrix_online_df$Predicted, levels = c("H", "D", "A"))
 )

Comparison of the two Confusion matrices:

## >> Balanced accuracy of the offline model: 0.6412566
## >> Balanced accuracy of the online model : 0.6337596

The two confusion matrices highlight the similarity in predictions between the models. Notably, the (balanced) accuracy of the offline model is slightly higher than the one of the online model. However, keep in mind that while accuracy is an important measure for assessing a model, it is not the sole factor in determining its suitability.

Additionally, the similarity in performance between the models is further evidenced by their Brier scores, which is an evaluation metric defined as:

\[BS = \frac{1}{M} \sum_{m=1}^{M}{\sum_{i}{(p_{mi} - d_{mi})^2}} \hspace{0.2cm}\; \hspace{0.2cm}\text{with } i\in \{H,D,A\}\] Note: for the Brier score it holds the rule The lower, the better!

Despite the BS being an overall measure of performance (like accuracy, though), I found it interesting to compute it (and to plot it) matchday by matchday:

BS_ts<- read.csv("BS_ts.csv")
BS_ts %>% 
  pivot_longer(cols=c(BS_offline,BS_online),
               names_to = "BS",
               values_to = "Value") %>%
  ggplot(aes(x=matchday,y=Value,color=BS))+
  geom_line(linewidth=1)+
  ggtitle("Brier score for predictions mathday by matchday")

Following the Brier score plot, we observe that both approaches yield very similar results in terms of predictive performance. However, it is important to notice that, in general, these quantitative measures do not provide the full picture. The ideal approach is to consider the entirety of our analysis, integrating various aspects, both quantitative and qualitative.

Final rankings simulations

In the previous section we highlighted how an ideal approach should integrate various aspects, both quantitative and qualitative. As an example, I did something (hopefully) interesting to evaluate and compare the models: I implemented a simulation of the final rankings. Starting from the situation at the halfway point of the season, I began simulating matches. With the model fitted after the \(m\)-th match, I made predictions for match \(m+1\). Thanks to the Bayesian approach, I obtained a distribution of the goal differences and, from this distribution, it was relatively straightforward to derive a distribution of the points for each team. By proceeding in this manner over all the matchdays, I obtained the distribution of the final points at the end of the season. Having a distribution allowed me to provide the expected number of points at the end of the season, along with a confidence band for the result. This approach offers a comprehensive view of the potential outcomes, capturing the inherent uncertainty and variability in our two models.

To perform this task, I implemented a custom function for retrieving the final points distribution:

final_pts_distribution<- function(teams,start=20,end=38,
                             n_chains=4,n_iters=11000,n_warmup=1000,
                             models_dir=NULL) {
  "
  Function to simulate the points at the end of the season basing on coef estimates
  Returns a df s.t.
    - each column is associated to a team
    - nrows: n_chains*(n_iters-n_warmup) to be friendly with the MCMC posteriors
  "
  if(is.null(models_dir) || !file.exists(models_dir)){
    stop("Error! Models directory path not provided or does not exist")
  }
  #-----------------------------------------------------------------------------
  # Rankings after the first half of the league
  n_teams=length(teams)
  teams_pts <- data.frame(matrix(NA,nrow = n_chains*(n_iters-n_warmup),ncol=n_teams))
  colnames(teams_pts)=teams
  for (t in 1:n_teams){
    n_wins= SerieA_data[1:190,] %>% 
      filter(((ht==t) &(FTHG>FTAG)) | ((at==t) & (FTHG<FTAG)))%>% nrow() 
    n_draws= SerieA_data[1:190,] %>% 
      filter(((ht==t) | (at==t)) & (FTHG == FTAG))%>% nrow() 
    teams_pts[,t]= 3*n_wins+n_draws
  }
  #-----------------------------------------------------------------------------
  # Iteration over the 2nd half
  for (m in start:end){
    cat(paste0("...Simulation of matchday n. ",m,"...\n"))
    #...........................................................................
    # (1) Get the current matchday to predict
    test_set=SerieA_data[(10*m -9):(10*m),]
    test_set=na.omit(test_set) #just to manage postponed matches in last batch
    #...........................................................................
    # (2) Load the most recent model and some of its stuff
    cat("...Loading the model and retrieving useful info...\n")
    load(paste0(models_dir,"matchday",m-1,"/KN_matchday",m-1,".rds"))
    posterior<- as.array(KN_model)
    n_iters=dim(posterior)[1]
    n_chains=dim(posterior)[2]
    mu = posterior[,,"mu"]
    home=posterior[,,"home_advantage"]
    p = posterior[,,"p"]
    #...........................................................................
    # (3) Make predictions
    cat("...Computing predictions and updating rankings...\n")
    for(mm in 1:nrow(test_set)){
      ht=test_set$ht[mm]
      at=test_set$at[mm]
      attH=posterior[,,paste0("att[",ht,"]")]
      defH=posterior[,,paste0("def[",ht,"]")]
      attA=posterior[,,paste0("att[",at,"]")]
      defA=posterior[,,paste0("def[",at,"]")]
      theta_H = exp(mu+home+attH+defA)
      theta_A = exp(mu+attA+defH)
      GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
      #.........................................................................
      # (4) Calculate points according to predicted goal diffs
      pts_ht = ifelse(GD > 0, 3, ifelse(GD < 0, 0, 1))
      pts_at = ifelse(GD < 0, 3, ifelse(GD > 0, 0, 1))
      #.........................................................................
      # (5) Sum them to the previous points
      teams_pts[,ht] = teams_pts[,ht]+pts_ht
      teams_pts[,at] = teams_pts[,at]+pts_at
      #.........................................................................
    }
    cat("-------------------------------------------------\n")
  }
  #-----------------------------------------------------------------------------
  # (4) Return the df
  return(teams_pts)
}
final_pts_offline<- final_pts_distribution(teams,start=20,end=38,
                                           n_chains=N_CHAINS,
                                           n_iters=N_ITERS,
                                           n_warmup=N_WARMUP,
                                           models_dir =OFFLINE_MODELS_DIR)

final_pts_online<- final_pts_distribution(teams,start=20,end=38,
                                           n_chains=N_CHAINS,
                                           n_iters=N_ITERS,
                                           n_warmup=N_WARMUP,
                                           models_dir =ONLINE_MODELS_DIR)
offline_rankings<- data.frame(Team= teams,
                         Pts_mean=NA,
                         Pts_mode=NA,
                         Pts_lower.95=NA,
                         Pts_lower.50=NA,
                         Pts_upper.50=NA,
                         Pts_upper.95=NA)

online_rankings<- data.frame(Team= teams,
                         Pts_mean=NA,
                         Pts_mode=NA,
                         Pts_lower.95=NA,
                         Pts_lower.50=NA,
                         Pts_upper.50=NA,
                         Pts_upper.95=NA)

for(t in 1:n_teams){
  
  # Update rankings for offline model
  offline_rankings$Pts_mean[t]<- final_pts_offline[,t] %>% mean() %>% round()
  offline_rankings$Pts_mode[t]<- final_pts_offline[,t] %>% DescTools::Mode() %>% mean() %>% round() #mean just to manage multimodal case
  offline_rankings$Pts_lower.95[t]<- final_pts_offline[,t] %>% quantile(probs = 0.025) %>% round()
  offline_rankings$Pts_lower.50[t]<- final_pts_offline[,t] %>% quantile(probs = 0.25) %>% round()
  offline_rankings$Pts_upper.50[t]<- final_pts_offline[,t] %>% quantile(probs = 0.75) %>% round()
  offline_rankings$Pts_upper.95[t]<- final_pts_offline[,t] %>% quantile(probs = 0.975) %>% round()
  
  # Update rankings for online model
  online_rankings$Pts_mean[t]<- final_pts_online[,t] %>% mean() %>% round()
  online_rankings$Pts_mode[t]<- final_pts_online[,t] %>% DescTools::Mode() %>% mean() %>% round() #mean just to manage multimodal case
  online_rankings$Pts_lower.95[t]<- final_pts_online[,t] %>% quantile(probs = 0.025) %>% round()
  online_rankings$Pts_lower.50[t]<- final_pts_online[,t] %>% quantile(probs = 0.25) %>% round()
  online_rankings$Pts_upper.50[t]<- final_pts_online[,t] %>% quantile(probs = 0.75) %>% round()
  online_rankings$Pts_upper.95[t]<- final_pts_online[,t] %>% quantile(probs = 0.975) %>% round()
}
table_offline<- offline_rankings %>% 
  arrange(desc(Pts_mode)) %>%
  mutate(Position = row_number()) %>%
  select(Position,Team,Pts_mean,Pts_mode,Pts_lower.95,Pts_lower.50,Pts_upper.50,Pts_upper.95) %>%
  kbl(caption = "Serie A 2021-2022 simulated rankings (offline models)", align = 'c') %>%
  kable_styling(font_size = 12, full_width = F) %>%
  row_spec(0, background = "#FFFFFF", color = "#000000",bold=T) %>%
  # Champions League
  row_spec(1:4, background = "#8AD1F3") %>% 
  # Europa League
  row_spec(5:6, background = "#EEA256") %>%
  # Conference League
  row_spec(7, background = "#E1D625") %>%  
  # Relegation
  row_spec((nrow(offline_rankings)-2):nrow(offline_rankings), background = "#FF9E94") %>%
  row_spec(1:nrow(offline_rankings), extra_css = "border-top: 1px solid black;")
table_online<- online_rankings %>% 
  arrange(desc(Pts_mode)) %>%
  mutate(Position = row_number()) %>%
  select(Position,Team,Pts_mean,Pts_mode,Pts_lower.95,Pts_lower.50,Pts_upper.50,Pts_upper.95) %>%
  kbl(caption = "Serie A 2021-2022 simulated rankings (online models)", align = 'c') %>%
  kable_styling(font_size = 12, full_width = F) %>%
  row_spec(0, background = "#FFFFFF", color = "#000000",bold=T) %>%
  # Champions League
  row_spec(1:4, background = "#8AD1F3") %>% 
  # Europa League
  row_spec(5:6, background = "#EEA256") %>%
  # Conference League
  row_spec(7, background = "#E1D625") %>%  
  # Relegation
  row_spec((nrow(online_rankings)-2):nrow(online_rankings), background = "#FF9E94") %>%
  row_spec(1:nrow(online_rankings), extra_css = "border-top: 1px solid black;")
Serie A 2021-2022 simulated rankings (offline models)
Position Team Pts_mean Pts_mode Pts_lower.95 Pts_lower.50 Pts_upper.50 Pts_upper.95
1 Inter 88 88 79 85 91 97
2 Milan 80 80 70 77 83 89
3 Napoli 78 78 68 74 81 87
4 Atalanta 72 73 62 68 75 82
5 Juventus 69 68 59 65 72 78
6 Roma 63 64 53 60 67 73
7 Lazio 62 62 52 58 65 71
8 Fiorentina 62 62 52 59 66 72
9 Torino 53 53 43 49 56 63
10 Verona 52 52 42 48 55 62
11 Bologna 49 49 40 46 53 59
12 Sassuolo 49 48 39 45 52 59
13 Empoli 47 47 38 44 51 57
14 Udinese 47 47 37 44 51 57
15 Sampdoria 39 39 31 36 43 49
16 Spezia 32 32 24 29 35 41
17 Venezia 30 30 23 28 33 39
18 Cagliari 26 25 19 23 29 35
19 Genoa 22 21 16 20 24 29
20 Salernitana 20 19 13 17 22 28
Serie A 2021-2022 simulated rankings (online models)
Position Team Pts_mean Pts_mode Pts_lower.95 Pts_lower.50 Pts_upper.50 Pts_upper.95
1 Inter 90 90 81 87 93 98
2 Milan 80 81 71 77 84 89
3 Napoli 79 78 69 75 82 87
4 Atalanta 73 73 64 70 77 83
5 Juventus 69 69 59 65 72 78
6 Roma 63 63 53 60 67 73
7 Fiorentina 63 62 53 59 66 72
8 Lazio 62 61 52 58 65 71
9 Torino 53 54 44 50 57 63
10 Verona 51 51 42 48 55 61
11 Bologna 50 50 40 46 53 59
12 Empoli 48 48 39 45 51 58
13 Sassuolo 49 48 39 45 52 58
14 Udinese 47 47 37 43 50 57
15 Sampdoria 39 39 31 36 42 49
16 Spezia 31 31 23 28 34 40
17 Venezia 30 29 23 27 33 38
18 Cagliari 26 25 19 23 28 34
19 Genoa 20 20 15 18 22 27
20 Salernitana 18 17 12 16 20 26

We can also plot the final rankings distributions and compare them with the real observed points

real_points= data.frame(
  Team=teams,
  Pts=NA
)
for (t in 1:n_teams){
  n_wins= SerieA_data %>% filter(((ht==t) &(FTHG>FTAG)) | ((at==t) & (FTHG<FTAG)))%>% nrow() 
  n_draws= SerieA_data %>% filter(((ht==t) | (at==t)) & (FTHG == FTAG))%>% nrow()
  real_points$Pts[t]= 3*n_wins+n_draws
}

Upon reviewing the simulated rankings generated by the two models against the actual Serie A 2021-2022 standings, several key observations can be made:

  • Overall Picture

    • Both models effectively captured the overall situation of the league. Specifically, they identified the correct trend for many teams, even though the exact positions was not always correct.
    • Lower Table: The points for lower-ranked teams were generally underestimated, as these teams secured more points than predicted.
    • Upper Table: The points for top-ranked teams were slightly overestimated but not significantly.
  • Qualifications to the European competitions

    The models’ predictions for the top 7 positions, which determine qualification for European competitions (Champions League, UEFA Europa League, and Conference League), showed notable accuracy. In particular, both the offline model and the online model correctly identified 6 out of 7 teams. However, the exact rankings of these teams were not as accurate. For instance, both models predicted Inter to top the league, while Milan actually won the championship. This discrepancy can be attributed to the competitive nature of the season. Additionally, Milan’s final matches were much more challenging than Inter’s ones, which could explain why the models predicted Inter to win.

  • Some notable examples

    • Inter: Predicted as the top team by both models but actually finished second. The season was highly competitive, and Milan’s more challenging end-of-season schedule was a significant factor in the outcome.
    • Atalanta: Predicted to qualify for the Champions League in both models. However, after a strong first half of the season, Atalanta experienced a significant drop in form, which the models did not account for.
    • Relegation Zone: both models accurately predicted Genoa and Cagliari as relegated teams. Venezia was predicted to avoid relegation, whereas in reality, they did not. Additionally, both models predicted Salernitana to finish last. Given the team’s position at the 32nd matchday, where they were at the bottom of the table and seemingly destined for relegation, this prediction was reasonable (Salernitana’s miraculous escape under coach Davide Nicola contradicted everyone’s expectations on earth).

Nevertheless, it is important to note that football is inherently unpredictable. Our models, despite their simplicity, managed to capture the broader trends effectively, which is admirable considering the inherent randomness and volatility of this beautiful sport.


Conclusions and further developments

In summary, this project demonstrated the utility of a Bayesian approach in sports analytics, providing a solid foundation for further exploration and development. The model’s simplicity and interpretative power make it a valuable tool for understanding football dynamics, while future enhancements could improve both its computational costs and its predictive capabilities.