Introduction

Predicting football match outcomes has long intrigued statisticians and sports analysts. The inherently random nature of sports events, coupled with various influencing factors like team strength, players performance, and home advantage, presents a challenging yet fascinating problem for statistical modeling.

In previous projects, I have explored goal-based models such as those proposed by Maher and Dixon and Coles. Maher’s model describes the outcome of a match by modeling the goals of the two teams independently using Poisson distributions. Dixon and Coles enhanced this basic model by introducing correlation between the teams’ goal-scoring processes and accounting for the home advantage effect. While these models have been widely used due to their simplicity and effectiveness, they rely on the assumption that goals follow a Poisson distribution. This assumption, however, can be questionable in certain leagues where overdispersion —in which the sample variance exceeds the sample mean— has been observed in the number of goals.

To address this limitation, Karlis and Ntzoufras proposed an alternative approach by focusing on the goal difference rather than the individual goal counts of each team. Their method utilizes the Skellam distribution, which eliminates the need to account for the correlation between teams’ scores directly and avoids the assumption of Poisson marginals. Although the Skellam-based model cannot predict exact match scores, it offers valuable insights into the likely goal difference, thereby simplifying the complexity associated with modeling individual goal counts. The application of Bayesian methods in this context allows for the incorporation of prior knowledge and continuous updating of model parameters as new data becomes available, enhancing the predictive power and adaptability of the model.

Project structure

📂 BayesianExam/
│ 
├── 📂 data/
│   └──  📂 season_2122/
│         ├── 📊 Bundesliga1_2122.csv
│         ├── 📊 Bundesliga2_2122.csv
│         ├── 📊 Championship_2122.csv
│         ├── 📊 LaLiga1_2122.csv
│         ├── 📊 LaLiga2_2122.csv
│         ├── 📊 Ligue1_2122.csv
│         ├── 📊 Ligue2_2122.csv
│         ├── 📊 PremierLeague_2122.csv
│         ├── 📊 SerieA_2122.csv
│         └── 📊 SerieB_2122.csv         
│ 
├── 📂 estimated_models/ 
│   └──  📂 season_2122/
│        ├── 📂 offline_models/
│        │   ├── 📂 matchday19/
|        |   ├── ...
│        │   └── 📂 matchday38/
│        │
│        └── 📂 online_models/
│            ├── 📂 matchday19/
|            ├── ...
│            └── 📂 matchday38/
│
├── 📂 report/ 
│
├── 📂 stan/ 
|   ├── ⚙️ karlis-ntzoufras.stan
│   └── ⚙️️️ online.stan
|
├── 📂 utils/ 
|   ├── HDA_probabilities.R
|   ├── final_pts_distribution.R
|   ├── get_all_teams_data.R
|   ├── plot_confusionmatrix.R
|   └── plot_parameters_ts.R
|
├── 📂 web-scraping/ 
│   └── 🔍 scraper.py
|
├── BrierScore.R
├── ConfusionMatrix.R
├── DataAnalysis.R
├── DynamicEstimation.R
├── ExploratoryDataAnalysis.R
├── FinalRankingSimulation.R
├── ModelChecking.R
├── OnlineLearning.R
└── Predictions.R

This report presents the findings of a comprehensive analysis of Serie A football league data for the 2021-2022 season. Given the inherent complexity of the analysis at hand, my project is structured into several directories, each serving a specific purpose in the data collection, analysis, and modeling process.

data/ is organised in subfolders for specific seasons, containing the raw data collected from football-data.co.uk.
web-scraping/ contains a python web scraper (scraper.py) to automatically retrieve and download datasets for all major European teams. By executing the scraper from the terminal and providing the desired season, users can download datasets for analysis. Example of usage:
```
python3 scraper.py <season> #e.g 2122
```
stan/ contains the Stan models used for statistical modeling. Two models are included:
- karlis-ntzoufras.stan: the standard version proposed by the authors. From now on in the report, we will refer to it as the “offline” model;
- online.stan: the online-learning framework version of the model.
utils/ simply contains some custom helper functions utilized throughout the analysis process
report/ contains the final report file, where the results of the analysis are documented and presented.
estimated_models/ contains the fitted models as .rds files. Like the data/ directory, it is organized in season-specific subfolders. Since I wanted to store the “offline” models separately from the “online” ones, I further increase the depth of my subfolders, and I also stored the models by matchday (I hope it will be intuitive)

In addition to these directories, various R scripts are available, each corresponding to different stages of the analysis pipeline. These scripts facilitate data cleaning, exploration, modeling, and visualization, contributing to a structured and reproducible workflow. The following sections detail the methodology employed, the results obtained from statistical modeling, and insights derived from the analysis of Serie A 2021-2022 data.

Exploratory Data Analysis

Before delving into model construction, it is crucial to gain a comprehensive understanding of the data. The exploratory data analysis (EDA) phase, through its visualizations, provide a glimpse into the underlying dynamics but also inform the modeling approach by highlighting factors that influence match outcomes. By analyzing various aspects of the data, such as goal differences, result distribution, and team performance, we are going to lay the groundwork for developing our models.

Data import and cleaning:

SerieA_data<- read.csv(file= paste0(DATA_DIR,"season_",SEASON,"/SerieA_",SEASON,".csv"))
SerieA_data<- SerieA_data %>% select(c("HomeTeam","AwayTeam","FTHG","FTAG","FTR"))
SerieA_data<- SerieA_data %>% mutate(GD=FTHG-FTAG)
teams<- unique(SerieA_data$HomeTeam)
n_games<- nrow(SerieA_data)
n_teams<- length(teams)
n_matchdays= ceiling(n_games/((n_teams)/2))
ht= unlist(sapply(1:n_games,function (g) which(teams==SerieA_data$HomeTeam[g])))
at= unlist(sapply(1:n_games,function (g) which(teams==SerieA_data$AwayTeam[g])))
SerieA_data$ht=ht
SerieA_data$at=at
teams<- str_replace_all(teams, " ", "")

The initial variable I decided to inspect was, of course, the goal differences, as it represents our main phenomenon of interest. The plot below shows the empirical data and overlays it with theoretical Skellam distributions (both the standard and a zero-inflated version), and highlights how well the theoretical distributions align with the observed data.

Next, we investigate the distribution of results (encoded as Home-win, Draw, and Away-win) across the entire season. As depicted in the plot below, HomeWin emerges as the majority class. This observation informs our modeling approach, where we will introduce a home effect parameter to account for the slight advantage of playing matches in one’s own stadium.

Another interesting aspect to explore are the goal statistics by team. In particular, by examining goals scored and conceded by each team, we gain valuable insights into the potential ranking of attack and defense coefficients in our upcoming model. Additionally, distinguishing between home and away matches underscores the significance of incorporating the home effect as a model parameter.

# Preliminary steps required to make plots
home_scored <- aggregate(SerieA_data$FTHG, list(SerieA_data$HomeTeam), FUN = sum) 
away_scored <- aggregate(SerieA_data$FTAG, list(SerieA_data$AwayTeam), FUN = sum)
home_conceeded <- aggregate(SerieA_data$FTAG, list(SerieA_data$HomeTeam), FUN = sum) 
away_conceeded <- aggregate(SerieA_data$FTHG, list(SerieA_data$AwayTeam), FUN = sum)

colnames(home_scored) <- c("team", "home_scored")
colnames(away_scored) <- c("team", "away_scored")
colnames(home_conceeded) <- c("team", "home_conceeded")
colnames(away_conceeded) <- c("team", "away_conceeded")

goals_scored <- merge(home_scored, away_scored, by = "team")
goals_scored$tot_goals <- goals_scored$home_scored + goals_scored$away_scored

goals_conceeded <- merge(home_conceeded, away_conceeded, by = "team")
goals_conceeded$tot_goals <- goals_conceeded$home_conceeded + goals_conceeded$away_conceeded

Comment: only five teams (out of a total of twenty) have scored more goals away than at home. Except for Napoli, which has scored the same number of goals at home and away, all other teams have had better offensive performances when playing in their own stadium. Symmetrically, only six teams have conceded more goals at home than away. All other teams have shown better defensive performances when playing in their own stadium (Inter, Salernitana, and Udinese, actually conceded the same number of goals at home and away).

The existence of a home effect is suggested also by the useful results achieved by each team, differentiated between outcomes at home and away:

useful_results_df<- SerieA_data %>%
  group_by(HomeTeam) %>%
  summarize(UsefulResultsHome = sum(FTR %in% c("H", "D"))) %>%
  left_join(SerieA_data %>% 
              group_by(AwayTeam) %>%
              summarize(UsefulResultsAway = sum(FTR %in% c("D", "A"))),
            by = c("HomeTeam" = "AwayTeam")) %>%
  rename(Team=HomeTeam)

Finally, an interesting insight that can be extracted from our data lies in the Final Rankings of the league:

SerieA_data <- SerieA_data %>%
  mutate(HomePoints = ifelse(FinalResult == "HomeWin", 3,
                             ifelse(FinalResult == "Draw", 1, 0)),
         AwayPoints = ifelse(FinalResult == "AwayWin", 3,
                             ifelse(FinalResult == "Draw", 1, 0)))

total_points <- SerieA_data %>%
  select(HomeTeam,AwayTeam,FTHG,FTAG,HomePoints,AwayPoints) %>%
  group_by(Team = HomeTeam) %>%
  summarise(TotalPoints = sum(HomePoints),
            GS = sum(FTHG),
            GC = sum(FTAG)) %>%
  bind_rows(
    SerieA_data %>%
      group_by(Team = AwayTeam) %>%
      summarise(TotalPoints = sum(AwayPoints),
                GS = sum(FTAG),
                GC = sum(FTHG))
  ) %>%
  group_by(Team) %>%
  summarise(TotalPoints=sum(TotalPoints),
            GS = sum(GS),
            GC = sum(GC)) %>%
  arrange(desc(TotalPoints)) %>%
  mutate(Position = row_number(),
         GD =GS-GC) %>%
  select(Position, Team, TotalPoints,GS,GC,GD)

Serie A 2021-2022 final rankings
Position	Team	TotalPoints	GS	GC	GD
1	Milan	86	69	31	38
2	Inter	84	84	32	52
3	Napoli	79	74	31	43
4	Juventus	70	57	37	20
5	Lazio	64	77	58	19
6	Roma	63	59	43	16
7	Fiorentina	62	59	51	8
8	Atalanta	59	65	48	17
9	Verona	53	65	59	6
10	Sassuolo	50	64	66	-2
11	Torino	50	46	41	5
12	Udinese	47	61	58	3
13	Bologna	46	44	55	-11
14	Empoli	41	50	70	-20
15	Sampdoria	36	46	63	-17
16	Spezia	36	41	71	-30
17	Salernitana	31	33	78	-45
18	Cagliari	30	34	68	-34
19	Genoa	28	27	60	-33
20	Venezia	27	34	69	-35

Model Formulation

As mentioned in the Introduction, we can use the Skellam distribution to model goal difference of a football match. More precisely, the goal difference of the match between teams $i$ and $j$ (where we adopt the convention of $i$ being the home team) is described by: \[\begin{equation} GD_{i,j} \sim Skellam\left(\theta^{(H)}_{(i,j)},\theta^{(A)}_{(i,j)}\right) \end{equation}\] with: \[\begin{align*} \theta^{(H)}_{(i,j)} &= \exp\{\mu + h_{eff} + att_{(i)} + def_{(j)}\}\\[0.2cm] \theta^{(A)}_{(i,j)} &= \exp\{\mu + att_{(j)} + def_{(i)}\} \end{align*}\]

Under the above parametrization, all parameters have a straightforward interpretation:

$h_{eff}$ is the home effect, which describes the advantage for a team of playing the match at its stadium;
$\mu$ is a constant, representing the average “ability” (combining both attack and defence at the same time) of a team in the analyzed league;
$att_{(t)}$ represents the attack strength of team $t$;
$def_{(t)}$ represents the defence weakness of team $t$;

Some notes:

Since we consider a “teams’s population intercept” ($\mu$), then $att_{(t)}$ and $def_{(t)}$ should be interpreted as deviations of the net attacking and defensive abilities from a team of average performance;
In order to ensure the identifiability of the model, some constraints are imposed over the attack and defence coefficients: \[\sum_{t=1}^{T} {att_{t}} = 0 \hspace{0.5cm};\hspace{0.5cm}\sum_{t=1}^{T} {def_{t}} = 0 \]
The probability function of our Skellam is: \[f_{S}\left(x,\theta_1,\theta_2\right)= e^{-(\theta_1+\theta_2)} \left(\frac{\theta_1}{\theta_2}\right)^{x/2} I_{|x|}\left(2\sqrt{\theta_1\theta_2}\right)\] with $x \in \mathbb{Z},\theta_1,\theta_2 >0$ and where $I_{|r|}(\bullet)$ is the modified Bessel function of order r (check it here)

In my previous experience with goal-based models, I noticed that the number of draws (and the corresponding probability) tends to be underestimated. To address this, I explored the idea proposed by Karlis and Ntzoufras of using a zero-inflated version of the Skellam distribution. This approach seemed promising to me, so the probability distribution which I referred to has been the following:

\[\begin{align*} f_{ZIS}\left(x,\theta_1,\theta_2,p\right) = \begin{cases} (1-p)\cdot f_{S}\left(x,\theta_1,\theta_2\right)\hspace{0.5cm}&\text{if $x\neq0$}\\[0.3cm] p+(1-p)\cdot f_{S}\left(x,\theta_1,\theta_2\right)\hspace{0.5cm}&\text{if $x=0$} \end{cases} \end{align*}\] with $x \in \mathbb{Z},\theta_1,\theta_2 >0$ and $p \in [0,1]$

To fully specify a Bayesian model, we need to define prior distributions for the model parameters. In my analysis, I chose not to incorporate any external information from previous seasons. In such cases, Karlis and Ntzoufras suggested using normal prior distributions with a mean of zero and a large variance (e.g. $10^4$) to express prior ignorance. However, since the Stan documentation discourages non-informative priors, I set my priors to be weakly informative (still I assumed them to be normal):

\[\begin{align*} \mu \sim N(0,10)\\[0.2cm] H_{eff} \sim N(0,10)\\[0.2cm] att_{(\bullet)} \sim N(0,10)\\[0.2cm] def_{(\bullet)} \sim N(0,10)\\[0.2cm] p \sim U(0,1) \end{align*}\]

In addition to the standard approach, I considered an online-learning approach. In the Bayesian framework, we typically start with a prior distribution and update it with observed data, in order to obtain a posterior distribution. When new data becomes available, this posterior can serve as the prior for the next update and so on. Despite concerns about specifying overly informative priors, I was curious to compare the online learning approach with the traditional “offline” model, where the information that updates our knowledge derives (almost) only from the data.

Note: in principle we don’t know the functional form of the posterior distribution. However, in my analysis ,I assumed my priors to be normal, updating only their location and scale according to the matchday by matchday estimates.

The formulation of the online model has -of course- the same likelihood of the offline one, except from the priors specification. More precisely, at a certain matchday $m$, the priors will be the following:

\[\begin{align*} \mu &\sim N\big(MAP(\mu)^{(m-1)},\sigma_{\mu}^{(m-1)}\big)\\[0.2cm] H_{eff} &\sim \big(MAP(H_{eff})^{(m-1)},\sigma_{H_{eff}}^{(m-1)}\big)\\[0.2cm] att_{(\bullet)} &\sim \big(MAP(att_{(\bullet)}\;)^{(m-1)},\sigma_{att_{(\bullet)}}^{(m-1)}\big)\\[0.2cm] def_{(\bullet)} &\sim \big(MAP(def_{(\bullet)}\;)^{(m-1)},\sigma_{def_{(\bullet)}}^{(m-1)}\big)\\[0.2cm] p &\sim U(0,1) \end{align*}\]

Model implementation in `Stan`

The models have been fully implemented using the Stan ecosystem and managed in R through the library rstan. I decided to follow carefully the procedure illustrated by Karlis and Ntzoufras in their article, setting the following MCMC parameters:

N_CHAINS= 4
N_ITERS= 11000
N_WARMUP= 1000

In what follows the main blocks of the Stan files are illustrated.

Offline model

Functions: since Stan did not provide the log pmf of a Skellam, I had to define my own custom functions

functions {
  real skellam_lpmf(int k, real lambda1, real lambda2) {
    real total;
    real log_prob;
    total = (- lambda1 - lambda2) + (log(lambda1) - log(lambda2)) * k / 2;
    log_prob = total + log(modified_bessel_first_kind(k, 2 * sqrt(lambda1*lambda2)));
    return log_prob;
  }
  real zero_inflated_skellam_lpmf(int k, real lambda1, real lambda2, real p) {
    real base_prob;
    real prob;
    real log_prob;
    base_prob = exp(skellam_lpmf(k|lambda1, lambda2));
    if (k == 0){
      prob = p + (1 - p) * base_prob;
    }
    else{
      prob = (1 - p) * base_prob;
    }
    log_prob = log(prob);
    return log_prob;
  }
}

Data:

data {
  int<lower=1> n_teams;
  int<lower=1> n_games;
  array[n_games] int<lower=1, upper=n_teams> home_team;
  array[n_games] int<lower=1, upper=n_teams> away_team;
  array[n_games] int goal_difference;
}

Parameters and transformed parameters:

parameters {
  real<lower=0, upper=1> p;
  real mu;
  real home_advantage;
  array[n_teams-1] real att_raw;
  array[n_teams-1] real def_raw;
}

transformed parameters {
  // Sum-to-zero constraint
  array[n_teams] real att;
  array[n_teams] real def;

  for (t in 1:(n_teams-1)) {
    att[t] = att_raw[t];
    def[t] = def_raw[t];
  }

  att[n_teams] = -sum(att_raw);
  def[n_teams] = -sum(def_raw);
}

Model:

model {
  array[n_games] real theta_H;
  array[n_games] real theta_A;
  // Priors
  p ~ uniform(0, 1);
  att_raw ~ normal(0, 10);
  def_raw ~ normal(0, 10);
  home_advantage ~ normal(0, 10);
  mu ~ normal(0, 10);
  // Likelihood
  for (g in 1:n_games) {
    theta_H[g] = exp(mu + home_advantage +att[home_team[g]] + def[away_team[g]]);
    theta_A[g] = exp(mu + att[away_team[g]] + def[home_team[g]]);
    goal_difference[g] ~ zero_inflated_skellam(theta_H[g],theta_A[g],p);
  }
}

Generated quantities: I couldn’t exploit the generated quantities block since Stan does not provide any random generator for the Skellam distribution (a sort of skellam_rng()). Some users usually by-pass this limitations by generating 2 values from a Poisson (with poisson_rng()) and computing their subtraction, but this would be in contrast with our setup, in which we don’t assume Poisson marginals. Fortunately, the generated quantities can be easily computed in R after the model has been fitted! I opted for this latter approach, referring to the skellam library without the need of re-inventing the wheel.

Technical Note: the skellam library provided a rskellam() function, but the documentation clearly specified that it worked internally by generating 2 random Poisson (with rpois()), which was the mechanism I wanted to avoid :( Nevertheless, I could easily implement a custom Skellam generator through the inverse-sampling method, exploiting the qskellam() function from the same library, as shown below:

library(skellam)
#-------------------------------------------------------------------------------
# 1) Random generator for standard skellam
my_rskellam<- function(n,lambda1,lambda2){
  q= runif(n,min=0,max=1)
  return(qskellam(q,lambda1,lambda2))
}
#-------------------------------------------------------------------------------
# 2) Random generator for zero-inflated skellam
my_rzeroinflatedskellam<- function(n,lambda1,lambda2,p){
  samples<- vector(mode="numeric",length = n)
  for(i in 1:n){
    # Generate 0 with probability p
    # Or a random number from a standard skellam with prob (1-p)
    if(runif(1)>p){
      samples[i]<- my_rskellam(1,lambda1,lambda2)
    }
    #else not needed, since there are already zeros!
  }
  return(samples)
}
#-------------------------------------------------------------------------------

Online model

The Stan code for the online model is similar to the previous one, with the major differences in the data block and in the priors definition (model block)

data {
  int<lower=1> n_teams;
  int<lower=1> n_games;
  array[n_games] int<lower=1, upper=n_teams> home_team;
  array[n_games] int<lower=1, upper=n_teams> away_team;
  array[n_games] int goal_difference;
  // Previous estimates and sd
  array[n_teams-1] real prev_att_MAP;
  array[n_teams-1] real prev_def_MAP;
  real prev_mu_MAP;
  real prev_home_advantage_MAP;
  array[n_teams-1] real<lower=0> prev_att_sd;
  array[n_teams-1] real<lower=0> prev_def_sd;
  real<lower=0> prev_mu_sd;
  real<lower=0> prev_home_advantage_sd;
}

model {
  ...
  // Priors
  p ~ uniform(0, 1);
  for(a in 1:(n_teams-1)){
    att_raw[a] ~ normal(prev_att_MAP[a], prev_att_sd[a]);
    def_raw[a] ~ normal(prev_def_MAP[a], prev_def_sd[a]);
  }
  home_advantage ~ normal(prev_home_advantage_MAP, prev_home_advantage_sd);
  mu ~ normal(prev_mu_MAP, prev_mu_sd);
  ...
}

Disclaimer:

Due to the extensive time required to fit the models, I am not including the full code here (it is not particularly interesting, though). However, I have fitted all the models and stored them in a dedicated folder, with the possibility of loading each of them when required.
The bayesian procedure of fitting the model matchday by matchday is, in practice, done on a weekly basis. With access to the complete dataset for the analyzed season, I was able to fit the models in a for loop, iterating over each matchday. The procedure for each iteration is the following:
1. Select the rows of the dataframe corresponding to the available information up to the current matchday.
2. Load the model estimated at the end of the previous matchday (from the models folder).
3. Fit the model and store the updated model in the dedicated folder.

Posterior distributions of the coefficients

After fitting my Bayesian models, I used the bayesplot package to generate informative plots of the posterior distributions of the model parameters. To provide a comprehensive overview, I will present the plots for the “final model”, which is the one fitted at the end of the season, with all the available data. These plots will be shown for both the offline and online models, with a clear distinction between the attack parameters and the defense parameters.

Plots for offline models

load(paste0(OFFLINE_MODELS_DIR,"matchday38/KN_matchday38.rds"))
posterior=as.array(KN_model)
#-------------------------------------------------------------------------------
# Lazy code to retrieve only the parameters of interest
par_names<-  names(KN_model)
useful_par_names<- par_names[!(grepl("raw", par_names))]
att_params <- useful_par_names[grepl("att", useful_par_names)]
def_params <- useful_par_names[grepl("def", useful_par_names)]
att_labels <- setNames(teams, paste0("att[", 1:20, "]"))
def_labels <- setNames(teams, paste0("def[", 1:20, "]"))
#-------------------------------------------------------------------------------

#-------------------------------------------------------------------------------
# ATTACK COEFFICIENTS
color_scheme_set("blue")
att_intervals_offline<- mcmc_intervals(posterior,pars=att_params)+
  scale_y_discrete(labels=att_labels)+
  ggtitle("MCMC intervals for attack coefficients (offline)")

att_areas_offline<- mcmc_areas(posterior,pars= att_params)+
  scale_y_discrete(labels=att_labels)+
  ggtitle("MCMC areas for attack coefficients  (offline)")

att_dens_overlay_offline<- mcmc_dens_overlay(posterior,pars=att_params)+
  ggtitle("Marginal posteriors for attack coefficients by chain  (offline)")
att_dens_overlay_offline[[1]]$Parameter = rep(att_labels,
                                             each=N_CHAINS*(N_ITERS-N_WARMUP))
#-----------------------------------------------------------------------------
# DEFENSE COEFFICIENTS
 color_scheme_set("red")
def_intervals_offline<- mcmc_intervals(posterior,pars=def_params)+
   scale_y_discrete(labels=def_labels)+
   ggtitle("MCMC intervals for defense coefficients (offline)")
 
def_areas_offline<- mcmc_areas(posterior,pars= def_params)+
  scale_y_discrete(labels=def_labels)+
  ggtitle("MCMC areas for defense coefficients (offline)")
 
def_dens_overlay_offline<- mcmc_dens_overlay(posterior,pars=def_params)+
  ggtitle("Marginal posteriors for defense coefficients by chain (offline)")
def_dens_overlay_offline[[1]]$Parameter = rep(def_labels,
                                             each=N_CHAINS*(N_ITERS-N_WARMUP))
#-------------------------------------------------------------------------------
# HOME ADVANTAGE
home_areas_offline<- mcmc_areas(posterior,pars="home_advantage")+
  ggtitle("MCMC areas for home advantage (offline)")
home_dens_overlay_offline<- mcmc_dens_overlay(posterior,pars="home_advantage")+
  ggtitle("Marginal posteriors for home advantage by chain (offline)")

Plots for online models

load(paste0(ONLINE_MODELS_DIR,"matchday38/KN_matchday38.rds"))
posterior=as.array(KN_model)
#-------------------------------------------------------------------------------
# Lazy code to retrieve only the parameters of interest
par_names<-  names(KN_model)
useful_par_names<- par_names[!(grepl("raw", par_names))]
att_params <- useful_par_names[grepl("att", useful_par_names)]
def_params <- useful_par_names[grepl("def", useful_par_names)]
att_labels <- setNames(teams, paste0("att[", 1:20, "]"))
def_labels <- setNames(teams, paste0("def[", 1:20, "]"))
#-------------------------------------------------------------------------------

#-------------------------------------------------------------------------------
# ATTACK COEFFICIENTS
color_scheme_set("blue")
att_intervals_online<- mcmc_intervals(posterior,pars=att_params)+
  scale_y_discrete(labels=att_labels)+
  ggtitle("MCMC intervals for attack coefficients (online)")

att_areas_online<- mcmc_areas(posterior,pars= att_params)+
  scale_y_discrete(labels=att_labels)+
  ggtitle("MCMC areas for attack coefficient (online)")

att_dens_overlay_online<- mcmc_dens_overlay(posterior,pars=att_params)+
  ggtitle("Marginal posteriors for attack coefficients by chain (online)")
att_dens_overlay_online[[1]]$Parameter = rep(att_labels,
                                             each=N_CHAINS*(N_ITERS-N_WARMUP))
#-------------------------------------------------------------------------------
# DEFENSE COEFFICIENTS
color_scheme_set("red")
def_intervals_online<- mcmc_intervals(posterior,pars=def_params)+
   scale_y_discrete(labels=def_labels)+
   ggtitle("MCMC intervals for defense coefficients (online)")

def_areas_online<- mcmc_areas(posterior,pars= def_params)+
  scale_y_discrete(labels=def_labels)+
  ggtitle("MCMC areas for defense coefficients (online)")
# 
def_dens_overlay_online<- mcmc_dens_overlay(posterior,pars=def_params)+
  ggtitle("Marginal posteriors for defense coefficients by chain (online)")
def_dens_overlay_online[[1]]$Parameter = rep(def_labels,
                                             each=N_CHAINS*(N_ITERS-N_WARMUP))
#-------------------------------------------------------------------------------
# HOME ADVANTAGE
home_areas_online<- mcmc_areas(posterior,pars="home_advantage")+
  ggtitle("MCMC areas for home advantage (online)")
home_dens_overlay_online<- mcmc_dens_overlay(posterior,pars="home_advantage")+
  ggtitle("Marginal posteriors for home advantage by chain (online)")

MCMC Intervals plots

MCMC Areas plots

Marginal posteriors (by chain)

Home advantage

Teams comparison

Abilities over time

Teams’s abilities over time (offline model)

Teams’s abilities over time (online model)

A glimpse to convergence diagnostics

The training phase of the models seemed promising, as the MCMC procedure completed succesfully without any warning or error. To assess the convergence of the chains, we can inspect the summary of the models and look at the Gelman-Rubin statistics $\hat{R}$ (Rhat), which provide a quantitative measure of the chains’ convergence to the steady state.

Offline model

## Inference for Stan model: anon_model.
## 4 chains, each with iter=11000; warmup=1000; thin=1; 
## post-warmup draws per chain=10000, total post-warmup draws=40000.
## 
##                 mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## mu             -0.19       0 0.11 -0.40 -0.26 -0.19 -0.11  0.02 27901    1
## home_advantage  0.12       0 0.07 -0.01  0.08  0.12  0.17  0.26 47969    1
## att[1]          0.68       0 0.17  0.33  0.56  0.68  0.80  1.02 36874    1
## att[2]          0.06       0 0.23 -0.40 -0.09  0.07  0.22  0.51 40797    1
## att[3]         -0.25       0 0.29 -0.84 -0.44 -0.24 -0.05  0.29 40810    1
## att[4]         -0.10       0 0.24 -0.58 -0.25 -0.09  0.07  0.35 38297    1
## att[5]         -0.05       0 0.26 -0.58 -0.22 -0.04  0.13  0.44 42691    1
## att[6]          0.49       0 0.22  0.04  0.34  0.49  0.64  0.91 35893    1
## att[7]          0.50       0 0.18  0.15  0.39  0.51  0.62  0.85 37597    1
## att[8]          0.28       0 0.21 -0.14  0.14  0.29  0.43  0.69 40854    1
## att[9]         -0.73       0 0.40 -1.57 -0.98 -0.71 -0.46 -0.02 39762    1
## att[10]         0.29       0 0.24 -0.19  0.13  0.29  0.45  0.74 38666    1
## att[11]         0.30       0 0.21 -0.12  0.16  0.30  0.44  0.71 44720    1
## att[12]         0.67       0 0.19  0.29  0.54  0.67  0.80  1.04 35146    1
## att[13]         0.53       0 0.21  0.12  0.39  0.53  0.67  0.93 37462    1
## att[14]        -0.04       0 0.21 -0.47 -0.18 -0.04  0.10  0.36 42073    1
## att[15]        -1.28       0 0.52 -2.41 -1.61 -1.24 -0.92 -0.35 35030    1
## att[16]         0.29       0 0.23 -0.18  0.14  0.30  0.45  0.73 35889    1
## att[17]         0.29       0 0.19 -0.08  0.16  0.29  0.42  0.65 42743    1
## att[18]        -0.51       0 0.39 -1.33 -0.77 -0.50 -0.24  0.20 37348    1
## att[19]        -0.39       0 0.32 -1.05 -0.60 -0.38 -0.17  0.20 40775    1
## att[20]        -1.03       0 0.43 -1.93 -1.30 -1.01 -0.74 -0.26 37245    1
## def[1]         -0.54       0 0.41 -1.42 -0.80 -0.51 -0.25  0.20 35246    1
## def[2]         -0.13       0 0.26 -0.66 -0.30 -0.13  0.05  0.35 39088    1
## def[3]          0.22       0 0.20 -0.18  0.09  0.22  0.36  0.61 41929    1
## def[4]         -0.28       0 0.27 -0.82 -0.46 -0.28 -0.10  0.22 39696    1
## def[5]          0.18       0 0.22 -0.25  0.04  0.18  0.33  0.60 39619    1
## def[6]          0.41       0 0.23 -0.05  0.25  0.41  0.56  0.85 36563    1
## def[7]         -0.65       0 0.39 -1.48 -0.90 -0.63 -0.38  0.06 37040    1
## def[8]         -0.13       0 0.27 -0.69 -0.31 -0.12  0.06  0.39 40010    1
## def[9]          0.28       0 0.19 -0.09  0.16  0.29  0.41  0.66 39832    1
## def[10]         0.54       0 0.19  0.16  0.42  0.54  0.67  0.91 37093    1
## def[11]        -0.14       0 0.28 -0.71 -0.32 -0.13  0.05  0.39 44269    1
## def[12]         0.34       0 0.24 -0.14  0.19  0.35  0.51  0.80 36020    1
## def[13]         0.37       0 0.23 -0.08  0.22  0.37  0.53  0.80 39245    1
## def[14]        -0.85       0 0.35 -1.58 -1.07 -0.83 -0.61 -0.20 43395    1
## def[15]         0.11       0 0.20 -0.27 -0.02  0.11  0.25  0.50 45973    1
## def[16]         0.29       0 0.23 -0.16  0.14  0.29  0.44  0.72 35305    1
## def[17]        -1.11       0 0.46 -2.10 -1.40 -1.08 -0.79 -0.28 35375    1
## def[18]         0.54       0 0.18  0.19  0.42  0.54  0.66  0.89 37538    1
## def[19]         0.33       0 0.19 -0.04  0.21  0.34  0.46  0.70 39414    1
## def[20]         0.20       0 0.19 -0.17  0.07  0.20  0.32  0.56 37993    1
## 
## Samples were drawn using NUTS(diag_e) at Thu May 23 16:24:21 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Online model

## Inference for Stan model: anon_model.
## 4 chains, each with iter=11000; warmup=1000; thin=1; 
## post-warmup draws per chain=10000, total post-warmup draws=40000.
## 
##                 mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## mu             -0.32       0 0.02 -0.35 -0.33 -0.32 -0.31 -0.29 57363    1
## home_advantage  0.15       0 0.02  0.12  0.14  0.15  0.16  0.18 55709    1
## att[1]          0.78       0 0.03  0.71  0.75  0.78  0.80  0.84 54325    1
## att[2]          0.11       0 0.05  0.02  0.08  0.11  0.15  0.21 57006    1
## att[3]          0.01       0 0.05 -0.09 -0.02  0.01  0.05  0.12 54203    1
## att[4]          0.03       0 0.05 -0.06  0.00  0.03  0.06  0.12 54559    1
## att[5]          0.21       0 0.05  0.11  0.17  0.21  0.24  0.30 51632    1
## att[6]          0.37       0 0.05  0.27  0.33  0.37  0.40  0.47 50910    1
## att[7]          0.49       0 0.04  0.42  0.46  0.49  0.51  0.56 57498    1
## att[8]          0.44       0 0.04  0.36  0.41  0.44  0.47  0.52 53710    1
## att[9]         -0.64       0 0.08 -0.79 -0.69 -0.64 -0.59 -0.49 51317    1
## att[10]         0.26       0 0.05  0.16  0.22  0.26  0.29  0.36 58589    1
## att[11]         0.40       0 0.04  0.32  0.37  0.40  0.43  0.48 58019    1
## att[12]         0.79       0 0.04  0.72  0.76  0.79  0.81  0.86 52849    1
## att[13]         0.57       0 0.04  0.49  0.54  0.57  0.60  0.65 57625    1
## att[14]         0.11       0 0.04  0.02  0.08  0.11  0.14  0.20 54968    1
## att[15]        -2.18       0 0.12 -2.41 -2.26 -2.18 -2.10 -1.95 50041    1
## att[16]         0.01       0 0.05 -0.09 -0.02  0.01  0.05  0.12 53968    1
## att[17]         0.40       0 0.04  0.32  0.37  0.40  0.43  0.48 49641    1
## att[18]        -0.69       0 0.09 -0.87 -0.75 -0.69 -0.63 -0.52 51054    1
## att[19]        -0.42       0 0.07 -0.56 -0.47 -0.42 -0.38 -0.29 51023    1
## att[20]        -1.03       0 0.22 -1.47 -1.18 -1.03 -0.88 -0.61 62394    1
## def[1]         -0.62       0 0.07 -0.75 -0.66 -0.62 -0.57 -0.49 56236    1
## def[2]         -0.16       0 0.05 -0.26 -0.20 -0.17 -0.13 -0.07 48733    1
## def[3]          0.41       0 0.04  0.33  0.38  0.41  0.43  0.48 51272    1
## def[4]         -0.25       0 0.05 -0.35 -0.28 -0.25 -0.21 -0.15 53831    1
## def[5]          0.41       0 0.04  0.33  0.38  0.41  0.43  0.48 53237    1
## def[6]          0.34       0 0.04  0.25  0.31  0.34  0.37  0.42 53148    1
## def[7]         -0.83       0 0.06 -0.95 -0.87 -0.83 -0.79 -0.71 50939    1
## def[8]          0.01       0 0.05 -0.08 -0.02  0.01  0.04  0.10 52022    1
## def[9]          0.32       0 0.04  0.25  0.30  0.32  0.35  0.39 56688    1
## def[10]         0.53       0 0.04  0.45  0.50  0.53  0.55  0.60 50807    1
## def[11]        -0.31       0 0.06 -0.42 -0.35 -0.31 -0.28 -0.21 54624    1
## def[12]         0.43       0 0.04  0.34  0.40  0.43  0.45  0.51 52231    1
## def[13]         0.30       0 0.04  0.22  0.27  0.30  0.33  0.39 57413    1
## def[14]        -0.98       0 0.06 -1.11 -1.02 -0.98 -0.94 -0.86 53674    1
## def[15]         0.06       0 0.04 -0.01  0.04  0.06  0.09  0.14 49967    1
## def[16]        -0.10       0 0.05 -0.19 -0.13 -0.10 -0.07  0.00 53460    1
## def[17]        -0.78       0 0.06 -0.90 -0.82 -0.78 -0.74 -0.66 50585    1
## def[18]         0.62       0 0.03  0.56  0.60  0.62  0.65  0.69 59261    1
## def[19]         0.36       0 0.04  0.29  0.33  0.36  0.38  0.43 53456    1
## def[20]         0.25       0 0.13 -0.01  0.16  0.25  0.34  0.50 64327    1
## 
## Samples were drawn using NUTS(diag_e) at Thu May 23 15:21:00 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

In the summaries, all the Rhat values are equal to 1, indicating convergence of the chains (remember the rule of thumb, $\hat{R}< 1.1$). Additionally, the effective sample size (n_eff) values are high, which is a positive indicator. Notably, the n_eff values for the online model are generally higher than those for the offline models, which was something that I somehow expected. Interestingly, many n_eff values (all, actually, for the online model) are greater than the total number of samples set during the fitting process. At first, this seemed paradoxical to me. However, after reading various discussions on the Stan developers forum, I found the following answer:“Hi! It means that the draws that Stan is producing are better than independent draws for those parameters. Sounds like a joke, but apparently Stan is actually that awesome. :)”. (I leave the link to the discussion here)

Thanks to the bayesplot package, we can also plot the traceplots and visually inspect for divergent transitions. In our case, no chain had diverged, so the plots are not particularly interesting. However, if there had been any divergent chains, they would be evident in the plots below. As an example, I’ll present only the traceplots for the attack and defense coefficients of the online model:

## No divergences to plot.
## No divergences to plot.

After examining the Gelman-Rubin statistic and the effective sample size, another important diagnostic for assessing convergence is the autocorrelation function (ACF) plot. This plot helps to inspect the autocorrelations within the chain samples. Upon reviewing all the ACF plots, I did not observe any particularly concerning patterns. To provide a comprehensive overview, I will present ACF plots just for two parameters: one with greater variability (the defense coefficient for Venezia FC) and one with less variability (the attack coefficient for Inter). These examples illustrate that regardless of the variability, there are no significant autocorrelation patterns, indicating good convergence properties of the model.

Model checking and comparison

So far we have constructed a Bayesian model, computed the posterior distribution of all estimates, and conducted diagnostic tests —both quantitative and qualitative— regarding the convergence of the chains. Now, our attention turns towards evaluating the fit of the model to the data and our substantive knowledge.

Posterior predictive checks (in-sample replications)

As illustrated in the Bayesian Data Analysis book, a key technique for assessing model fit involves drawing simulated values from the posterior predictive distribution and comparing them to the observed data. More specifically, we’ll perform now an in-sample replication analysis. Here, we aim to assess whether the model adequately captures the observed data by assessing the plausibility of the observed data under the posterior predictive distribution. If the model is well-fitted, replicated data generated under its framework should closely resemble the observed data, while any discrepancy between the replicated and observed data may indicate model misfit or chance variations. If the model fits, then replicated data generated under the model should look similar to observed data. In short, the observed data should look plausible under the posterior predictive distribution.

The analysis of model checking can be effectively conducted in a graphical/qualitative manner. Fortunately, our friend bayesplot offers a suite of graphical solutions to assess the posterior predictive distribution for our in-sample analysis. Before delving into the graphical model checking plots, a preliminary step is required. Due to limitations in defining the generated quantities block in Stan, as discussed earlier, it was necessary to simulate these quantities in R (as shown in the chunks below).

Note: In conducting the in-sample replication procedure, we assess the plausibility of observed data relative to replications drawn from the model trained with such data. While it is possible to perform this check match day by match day, I opted to utilize our more comprehensive model—the one estimated after the final match of the season. By employing this approach, the observed results encompass the entirety of the season’s data, providing a robust assessment of model fit across the entire dataset.

PPC for offline models

#-------------------------------------------------------------------------------
# Dataframe were to store the simulated GD
GD_df<- data.frame(matrix(NA,nrow = N_CHAINS*(N_ITERS-N_WARMUP),ncol=n_games))
# Load the final model (end of season)
load(paste0(OFFLINE_MODELS_DIR,"matchday",n_matchdays,
            "/KN_matchday",n_matchdays,".rds"))
#-------------------------------------------------------------------------------
# Retrieve some "global" parameters
posterior<- as.array(KN_model)
mu = posterior[,,"mu"]
home=posterior[,,"home_advantage"]
p = posterior[,,"p"]
#-------------------------------------------------------------------------------
for (m in 1:n_games){
  cat(m,"\n")
  ht=SerieA_data$ht[m]
  at=SerieA_data$at[m]
  attH=posterior[,,paste0("att[",ht,"]")]
  defH=posterior[,,paste0("def[",ht,"]")]
  attA=posterior[,,paste0("att[",at,"]")]
  defA=posterior[,,paste0("def[",at,"]")]
  
  theta_H = exp(mu+home+attH+defA)
  theta_A = exp(mu+attA+defH)
  
  GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
  
  GD_df[,m]<- GD
  colnames(GD_df)[m]<-paste0(teams[ht],"-vs-",teams[at])
}
#-------------------------------------------------------------------------------

color_scheme_set("teal")

ppc_dens_overlay_offline <- ppc_dens_overlay(y=SerieA_data$GD,
                                             yrep=as.matrix(GD_df[1:4000,]))

ppc_ecdf_overlay_offline <- ppc_ecdf_overlay(y=SerieA_data$GD,
                                             yrep=as.matrix(GD_df[1:10000,]))

ppc_stat_mean_offline<- ppc_stat(y = SerieA_data$GD,
                                 yrep=,as.matrix(GD_df[1:40000,]),
                                 stat = "mean",bins = 20)

propzero <- function(x) mean(x == 0)
ppc_stat_propzero_offline <- ppc_stat(y = SerieA_data$GD,
                                      yrep=as.matrix(GD_df[1:40000,]),
                                      stat = "propzero",bins = 20)

ppc_stat_2d_offline <- ppc_stat_2d(y = SerieA_data$GD,
                                   yrep=as.matrix(GD_df[1:40000,]),
                                   stat = c("mean","sd"))

PPC for online models

#-------------------------------------------------------------------------------
# Dataframe were to store the simulated GD
GD_df<- data.frame(matrix(NA,nrow = N_CHAINS*(N_ITERS-N_WARMUP),ncol=n_games))
# Load the final model (end of season)
load(paste0(ONLINE_MODELS_DIR,"matchday",n_matchdays,
            "/KN_matchday",n_matchdays,".rds"))
#-------------------------------------------------------------------------------
# Retrieve some "global" parameters
posterior<- as.array(KN_model)
mu = posterior[,,"mu"]
home=posterior[,,"home_advantage"]
p = posterior[,,"p"]
#-------------------------------------------------------------------------------
for (m in 1:n_games){
  
  ht=SerieA_data$ht[m]
  at=SerieA_data$at[m]
  attH=posterior[,,paste0("att[",ht,"]")]
  defH=posterior[,,paste0("def[",ht,"]")]
  attA=posterior[,,paste0("att[",at,"]")]
  defA=posterior[,,paste0("def[",at,"]")]
  
  theta_H = exp(mu+home+attH+defA)
  theta_A = exp(mu+attA+defH)
  
  GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
  
  GD_df[,m]<- GD
  colnames(GD_df)[m]<-paste0(teams[ht],"-vs-",teams[at])
}
#-------------------------------------------------------------------------------

color_scheme_set("purple")

ppc_dens_overlay_online <- ppc_dens_overlay(y=SerieA_data$GD,
                                            yrep=as.matrix(GD_df[1:4000,]))

ppc_ecdf_overlay_online<- ppc_ecdf_overlay(y=SerieA_data$GD,
                                           yrep=as.matrix(GD_df[1:10000,]))

ppc_stat_mean_online<- ppc_stat(y = SerieA_data$GD,
                                yrep=,as.matrix(GD_df[1:40000,]),
                                stat = "mean",bins = 20)

propzero <- function(x) mean(x == 0)
ppc_stat_propzero_online <- ppc_stat(y = SerieA_data$GD,
                                     yrep=as.matrix(GD_df[1:40000,]),
                                     stat = "propzero",bins = 20)

ppc_stat_2d_online <- ppc_stat_2d(y = SerieA_data$GD,
                                  yrep=as.matrix(GD_df[1:40000,]),
                                  stat = c("mean","sd"))

Note: here,the plots on the left (right) side are referred to the offline (online) model!

Posterior predictive checks (out-sample replications)

Having completed the in-sample posterior predictive checks, we now turn our attention to out-of-sample replication. This process involves testing the model on future observable values or vectors of new observable quantities. Specifically, for our football league analysis, I employed a sequential prediction strategy based on matchdays. Recall that we initially estimated models matchday by matchday, starting from the end of the first half of the league.

With the model fitted using data up to matchday $m$, I used it to predict the outcomes of matchday $m+1$ (which are future, unknown data!). The Bayesian approach, with its chain samples, offers a significant advantage over classical statistics: instead of estimating a single value (either as a point estimate or with confidence intervals), we can obtain a distribution for our quantities of interest. In this case, using the chain samples, I simulated the goal differences for each match, obtaining numerous samples from the distribution of goal differences for a given match.

Such distribution then allowed me to assign probability values to the final outcomes, encoded as “H” (Home win), “D” (Draw) and “A” (Away win). I selected the most frequent class as a guess for each match’s final outcome and, by repeating this for all matches in the second half of the league, I constructed confusion matrices to compare the model’s accuracy. Additionally, I computed the Brier score matchday by matchday to further assess the predictive power of my models.

Accuracy and Brier Score

Let’s start from the accuracy:

Offline model:

SerieA_data$PredictedGD=NA
for (m in 20:n_matchdays){
  cat(paste0("...Simulation of matchday n. ",m,"...\n"))
  #-------------------------------------------------
  # (1) Get the current matchday to predict
  test_set=SerieA_data[(10*m -9):(10*m),]
  #-------------------------------------------------
  # (2) Load the most recent model and some of its stuff
  cat("...Loading the model and retrieving useful info...\n")
  load(paste0(OFFLINE_MODELS_DIR,"matchday",m-1,"/KN_matchday",m-1,".rds"))
  posterior<- as.array(KN_model)
  mu = posterior[,,"mu"]
  home=posterior[,,"home_advantage"]
  p = posterior[,,"p"]
  #-------------------------------------------------
  # (3) Make predictions
  cat("...Computing predictions and updating rankings...\n")
  PredictedGD<- vector(mode="numeric",length = nrow(test_set))
  for(mm in 1:nrow(test_set)){
    ht=test_set$ht[mm]
    at=test_set$at[mm]
    attH=posterior[,,paste0("att[",ht,"]")]
    defH=posterior[,,paste0("def[",ht,"]")]
    attA=posterior[,,paste0("att[",at,"]")]
    defA=posterior[,,paste0("def[",at,"]")]
    theta_H = exp(mu+home+attH+defA)
    theta_A = exp(mu+attA+defH)
    GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
    
    PredictedGD[mm]<- GD %>% mean() %>% round()
  }
  # (4) Add predicted GD to the dataset
  SerieA_data$PredictedGD[(10*m -9):(10*m)]<- PredictedGD
  cat("-------------------------------------------------\n")
}

ConfusionMatrix_offline_df<- SerieA_data %>% slice(191:n_games) %>% select(GD,PredictedGD)

ConfusionMatrix_offline_df<- ConfusionMatrix_offline_df %>% mutate(
   Actual = ifelse(GD > 0, "H", ifelse(GD < 0, "A", "D")),
   Predicted = ifelse(PredictedGD > 0, "H", ifelse(PredictedGD < 0, "A", "D"))
 )

 CM_offline <- confusion_matrix(
   targets = factor(ConfusionMatrix_offline_df$Actual, levels = c("H", "D", "A")),
   predictions = factor(ConfusionMatrix_offline_df$Predicted, levels = c("H", "D", "A"))
 )

Online model

The same procedure can be performed for the online model (we just need to change OFFLINE_MODELS_DIR with ONLINE_MODELS_DIR in the load() function), so I am not going to show the full code again. In the end, we get a confusion matrix also for the online model:

ConfusionMatrix_online_df<- ConfusionMatrix_online_df %>% mutate(
   Actual = ifelse(GD > 0, "H", ifelse(GD < 0, "A", "D")),
   Predicted = ifelse(PredictedGD > 0, "H", ifelse(PredictedGD < 0, "A", "D"))
 )

 CM_online <- confusion_matrix(
   targets = factor(ConfusionMatrix_online_df$Actual, levels = c("H", "D", "A")),
   predictions = factor(ConfusionMatrix_online_df$Predicted, levels = c("H", "D", "A"))
 )

Comparison of the two Confusion matrices:

## >> Balanced accuracy of the offline model: 0.6412566

## >> Balanced accuracy of the online model : 0.6337596

The two confusion matrices highlight the similarity in predictions between the models. Notably, the (balanced) accuracy of the offline model is slightly higher than the one of the online model. However, keep in mind that while accuracy is an important measure for assessing a model, it is not the sole factor in determining its suitability.

Additionally, the similarity in performance between the models is further evidenced by their Brier scores, which is an evaluation metric defined as:

\[BS = \frac{1}{M} \sum_{m=1}^{M}{\sum_{i}{(p_{mi} - d_{mi})^2}} \hspace{0.2cm}\; \hspace{0.2cm}\text{with } i\in \{H,D,A\}\] Note: for the Brier score it holds the rule The lower, the better!

Despite the BS being an overall measure of performance (like accuracy, though), I found it interesting to compute it (and to plot it) matchday by matchday:

BS_ts<- read.csv("BS_ts.csv")
BS_ts %>% 
  pivot_longer(cols=c(BS_offline,BS_online),
               names_to = "BS",
               values_to = "Value") %>%
  ggplot(aes(x=matchday,y=Value,color=BS))+
  geom_line(linewidth=1)+
  ggtitle("Brier score for predictions mathday by matchday")

Following the Brier score plot, we observe that both approaches yield very similar results in terms of predictive performance. However, it is important to notice that, in general, these quantitative measures do not provide the full picture. The ideal approach is to consider the entirety of our analysis, integrating various aspects, both quantitative and qualitative.

Final rankings simulations

In the previous section we highlighted how an ideal approach should integrate various aspects, both quantitative and qualitative. As an example, I did something (hopefully) interesting to evaluate and compare the models: I implemented a simulation of the final rankings. Starting from the situation at the halfway point of the season, I began simulating matches. With the model fitted after the $m$-th match, I made predictions for match $m+1$. Thanks to the Bayesian approach, I obtained a distribution of the goal differences and, from this distribution, it was relatively straightforward to derive a distribution of the points for each team. By proceeding in this manner over all the matchdays, I obtained the distribution of the final points at the end of the season. Having a distribution allowed me to provide the expected number of points at the end of the season, along with a confidence band for the result. This approach offers a comprehensive view of the potential outcomes, capturing the inherent uncertainty and variability in our two models.

To perform this task, I implemented a custom function for retrieving the final points distribution:

final_pts_distribution<- function(teams,start=20,end=38,
                             n_chains=4,n_iters=11000,n_warmup=1000,
                             models_dir=NULL) {
  "
  Function to simulate the points at the end of the season basing on coef estimates
  Returns a df s.t.
    - each column is associated to a team
    - nrows: n_chains*(n_iters-n_warmup) to be friendly with the MCMC posteriors
  "
  if(is.null(models_dir) || !file.exists(models_dir)){
    stop("Error! Models directory path not provided or does not exist")
  }
  #-----------------------------------------------------------------------------
  # Rankings after the first half of the league
  n_teams=length(teams)
  teams_pts <- data.frame(matrix(NA,nrow = n_chains*(n_iters-n_warmup),ncol=n_teams))
  colnames(teams_pts)=teams
  for (t in 1:n_teams){
    n_wins= SerieA_data[1:190,] %>% 
      filter(((ht==t) &(FTHG>FTAG)) | ((at==t) & (FTHG<FTAG)))%>% nrow() 
    n_draws= SerieA_data[1:190,] %>% 
      filter(((ht==t) | (at==t)) & (FTHG == FTAG))%>% nrow() 
    teams_pts[,t]= 3*n_wins+n_draws
  }
  #-----------------------------------------------------------------------------
  # Iteration over the 2nd half
  for (m in start:end){
    cat(paste0("...Simulation of matchday n. ",m,"...\n"))
    #...........................................................................
    # (1) Get the current matchday to predict
    test_set=SerieA_data[(10*m -9):(10*m),]
    test_set=na.omit(test_set) #just to manage postponed matches in last batch
    #...........................................................................
    # (2) Load the most recent model and some of its stuff
    cat("...Loading the model and retrieving useful info...\n")
    load(paste0(models_dir,"matchday",m-1,"/KN_matchday",m-1,".rds"))
    posterior<- as.array(KN_model)
    n_iters=dim(posterior)[1]
    n_chains=dim(posterior)[2]
    mu = posterior[,,"mu"]
    home=posterior[,,"home_advantage"]
    p = posterior[,,"p"]
    #...........................................................................
    # (3) Make predictions
    cat("...Computing predictions and updating rankings...\n")
    for(mm in 1:nrow(test_set)){
      ht=test_set$ht[mm]
      at=test_set$at[mm]
      attH=posterior[,,paste0("att[",ht,"]")]
      defH=posterior[,,paste0("def[",ht,"]")]
      attA=posterior[,,paste0("att[",at,"]")]
      defA=posterior[,,paste0("def[",at,"]")]
      theta_H = exp(mu+home+attH+defA)
      theta_A = exp(mu+attA+defH)
      GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
      #.........................................................................
      # (4) Calculate points according to predicted goal diffs
      pts_ht = ifelse(GD > 0, 3, ifelse(GD < 0, 0, 1))
      pts_at = ifelse(GD < 0, 3, ifelse(GD > 0, 0, 1))
      #.........................................................................
      # (5) Sum them to the previous points
      teams_pts[,ht] = teams_pts[,ht]+pts_ht
      teams_pts[,at] = teams_pts[,at]+pts_at
      #.........................................................................
    }
    cat("-------------------------------------------------\n")
  }
  #-----------------------------------------------------------------------------
  # (4) Return the df
  return(teams_pts)
}

final_pts_offline<- final_pts_distribution(teams,start=20,end=38,
                                           n_chains=N_CHAINS,
                                           n_iters=N_ITERS,
                                           n_warmup=N_WARMUP,
                                           models_dir =OFFLINE_MODELS_DIR)

final_pts_online<- final_pts_distribution(teams,start=20,end=38,
                                           n_chains=N_CHAINS,
                                           n_iters=N_ITERS,
                                           n_warmup=N_WARMUP,
                                           models_dir =ONLINE_MODELS_DIR)

offline_rankings<- data.frame(Team= teams,
                         Pts_mean=NA,
                         Pts_mode=NA,
                         Pts_lower.95=NA,
                         Pts_lower.50=NA,
                         Pts_upper.50=NA,
                         Pts_upper.95=NA)

online_rankings<- data.frame(Team= teams,
                         Pts_mean=NA,
                         Pts_mode=NA,
                         Pts_lower.95=NA,
                         Pts_lower.50=NA,
                         Pts_upper.50=NA,
                         Pts_upper.95=NA)

for(t in 1:n_teams){
  
  # Update rankings for offline model
  offline_rankings$Pts_mean[t]<- final_pts_offline[,t] %>% mean() %>% round()
  offline_rankings$Pts_mode[t]<- final_pts_offline[,t] %>% DescTools::Mode() %>% mean() %>% round() #mean just to manage multimodal case
  offline_rankings$Pts_lower.95[t]<- final_pts_offline[,t] %>% quantile(probs = 0.025) %>% round()
  offline_rankings$Pts_lower.50[t]<- final_pts_offline[,t] %>% quantile(probs = 0.25) %>% round()
  offline_rankings$Pts_upper.50[t]<- final_pts_offline[,t] %>% quantile(probs = 0.75) %>% round()
  offline_rankings$Pts_upper.95[t]<- final_pts_offline[,t] %>% quantile(probs = 0.975) %>% round()
  
  # Update rankings for online model
  online_rankings$Pts_mean[t]<- final_pts_online[,t] %>% mean() %>% round()
  online_rankings$Pts_mode[t]<- final_pts_online[,t] %>% DescTools::Mode() %>% mean() %>% round() #mean just to manage multimodal case
  online_rankings$Pts_lower.95[t]<- final_pts_online[,t] %>% quantile(probs = 0.025) %>% round()
  online_rankings$Pts_lower.50[t]<- final_pts_online[,t] %>% quantile(probs = 0.25) %>% round()
  online_rankings$Pts_upper.50[t]<- final_pts_online[,t] %>% quantile(probs = 0.75) %>% round()
  online_rankings$Pts_upper.95[t]<- final_pts_online[,t] %>% quantile(probs = 0.975) %>% round()
}

table_offline<- offline_rankings %>% 
  arrange(desc(Pts_mode)) %>%
  mutate(Position = row_number()) %>%
  select(Position,Team,Pts_mean,Pts_mode,Pts_lower.95,Pts_lower.50,Pts_upper.50,Pts_upper.95) %>%
  kbl(caption = "Serie A 2021-2022 simulated rankings (offline models)", align = 'c') %>%
  kable_styling(font_size = 12, full_width = F) %>%
  row_spec(0, background = "#FFFFFF", color = "#000000",bold=T) %>%
  # Champions League
  row_spec(1:4, background = "#8AD1F3") %>% 
  # Europa League
  row_spec(5:6, background = "#EEA256") %>%
  # Conference League
  row_spec(7, background = "#E1D625") %>%  
  # Relegation
  row_spec((nrow(offline_rankings)-2):nrow(offline_rankings), background = "#FF9E94") %>%
  row_spec(1:nrow(offline_rankings), extra_css = "border-top: 1px solid black;")

table_online<- online_rankings %>% 
  arrange(desc(Pts_mode)) %>%
  mutate(Position = row_number()) %>%
  select(Position,Team,Pts_mean,Pts_mode,Pts_lower.95,Pts_lower.50,Pts_upper.50,Pts_upper.95) %>%
  kbl(caption = "Serie A 2021-2022 simulated rankings (online models)", align = 'c') %>%
  kable_styling(font_size = 12, full_width = F) %>%
  row_spec(0, background = "#FFFFFF", color = "#000000",bold=T) %>%
  # Champions League
  row_spec(1:4, background = "#8AD1F3") %>% 
  # Europa League
  row_spec(5:6, background = "#EEA256") %>%
  # Conference League
  row_spec(7, background = "#E1D625") %>%  
  # Relegation
  row_spec((nrow(online_rankings)-2):nrow(online_rankings), background = "#FF9E94") %>%
  row_spec(1:nrow(online_rankings), extra_css = "border-top: 1px solid black;")

Serie A 2021-2022 simulated rankings (offline models)
Position	Team	Pts_mean	Pts_mode	Pts_lower.95	Pts_lower.50	Pts_upper.50	Pts_upper.95
1	Inter	88	88	79	85	91	97
2	Milan	80	80	70	77	83	89
3	Napoli	78	78	68	74	81	87
4	Atalanta	72	73	62	68	75	82
5	Juventus	69	68	59	65	72	78
6	Roma	63	64	53	60	67	73
7	Lazio	62	62	52	58	65	71
8	Fiorentina	62	62	52	59	66	72
9	Torino	53	53	43	49	56	63
10	Verona	52	52	42	48	55	62
11	Bologna	49	49	40	46	53	59
12	Sassuolo	49	48	39	45	52	59
13	Empoli	47	47	38	44	51	57
14	Udinese	47	47	37	44	51	57
15	Sampdoria	39	39	31	36	43	49
16	Spezia	32	32	24	29	35	41
17	Venezia	30	30	23	28	33	39
18	Cagliari	26	25	19	23	29	35
19	Genoa	22	21	16	20	24	29
20	Salernitana	20	19	13	17	22	28

Serie A 2021-2022 simulated rankings (online models)
Position	Team	Pts_mean	Pts_mode	Pts_lower.95	Pts_lower.50	Pts_upper.50	Pts_upper.95
1	Inter	90	90	81	87	93	98
2	Milan	80	81	71	77	84	89
3	Napoli	79	78	69	75	82	87
4	Atalanta	73	73	64	70	77	83
5	Juventus	69	69	59	65	72	78
6	Roma	63	63	53	60	67	73
7	Fiorentina	63	62	53	59	66	72
8	Lazio	62	61	52	58	65	71
9	Torino	53	54	44	50	57	63
10	Verona	51	51	42	48	55	61
11	Bologna	50	50	40	46	53	59
12	Empoli	48	48	39	45	51	58
13	Sassuolo	49	48	39	45	52	58
14	Udinese	47	47	37	43	50	57
15	Sampdoria	39	39	31	36	42	49
16	Spezia	31	31	23	28	34	40
17	Venezia	30	29	23	27	33	38
18	Cagliari	26	25	19	23	28	34
19	Genoa	20	20	15	18	22	27
20	Salernitana	18	17	12	16	20	26

We can also plot the final rankings distributions and compare them with the real observed points

real_points= data.frame(
  Team=teams,
  Pts=NA
)
for (t in 1:n_teams){
  n_wins= SerieA_data %>% filter(((ht==t) &(FTHG>FTAG)) | ((at==t) & (FTHG<FTAG)))%>% nrow() 
  n_draws= SerieA_data %>% filter(((ht==t) | (at==t)) & (FTHG == FTAG))%>% nrow()
  real_points$Pts[t]= 3*n_wins+n_draws
}

Upon reviewing the simulated rankings generated by the two models against the actual Serie A 2021-2022 standings, several key observations can be made:

Overall Picture
- Both models effectively captured the overall situation of the league. Specifically, they identified the correct trend for many teams, even though the exact positions was not always correct.
- Lower Table: The points for lower-ranked teams were generally underestimated, as these teams secured more points than predicted.
- Upper Table: The points for top-ranked teams were slightly overestimated but not significantly.
Qualifications to the European competitions

The models’ predictions for the top 7 positions, which determine qualification for European competitions (Champions League, UEFA Europa League, and Conference League), showed notable accuracy. In particular, both the offline model and the online model correctly identified 6 out of 7 teams. However, the exact rankings of these teams were not as accurate. For instance, both models predicted Inter to top the league, while Milan actually won the championship. This discrepancy can be attributed to the competitive nature of the season. Additionally, Milan’s final matches were much more challenging than Inter’s ones, which could explain why the models predicted Inter to win.
Some notable examples
- Inter: Predicted as the top team by both models but actually finished second. The season was highly competitive, and Milan’s more challenging end-of-season schedule was a significant factor in the outcome.
- Atalanta: Predicted to qualify for the Champions League in both models. However, after a strong first half of the season, Atalanta experienced a significant drop in form, which the models did not account for.
- Relegation Zone: both models accurately predicted Genoa and Cagliari as relegated teams. Venezia was predicted to avoid relegation, whereas in reality, they did not. Additionally, both models predicted Salernitana to finish last. Given the team’s position at the 32nd matchday, where they were at the bottom of the table and seemingly destined for relegation, this prediction was reasonable (Salernitana’s miraculous escape under coach Davide Nicola contradicted everyone’s expectations on earth).

Nevertheless, it is important to note that football is inherently unpredictable. Our models, despite their simplicity, managed to capture the broader trends effectively, which is admirable considering the inherent randomness and volatility of this beautiful sport.

Conclusions and further developments

Model Insights: Despite the simplicity of the model, it provided numerous interesting insights into the Serie A season analyzed. The estimates and predictive checks were generally consistent with real data, demonstrating the model’s robustness.
Bayesian Approach Advantages: Compared to previous projects using goal-based models, the Bayesian approach offered significant advantages. Dealing with distributions of parameters instead of just point estimates (with or without confidence intervals) allowed for:
- Simulating matches with greater detail.
- Estimating probabilities through simulations.
- Checking the goodness of results via posterior predictive checks
Comparison of Approaches: The comparison between the online and offline approaches revealed similar predictive performances for the analyzed season, in terms of both (balanced) accuracy and Brier score. However, it was particularly evident that the online learning approach provides narrower variability bands in coefficient estimates over time, suggesting more precise and stable estimates.
Future Comparisons: It would be beneficial to compare the two approaches using multiple seasons’ data. My personal guess is that the online approach might outperform the offline approach in terms of quantitative metrics, especially when dealing with evolving data over several seasons.
Interpretative Tool: While the model is simple and not suitable for betting or making precise predictions (mainly due to the inherent randomness of football itself), it serves as an excellent interpretative tool. It captures the dynamics over time and provides valuable insights into the factors influencing team performance. Even more complex models in the literature struggle to make accurate predictions in football, underscoring the sport’s unpredictability.
Further Developments:
- Variational Inference: One of the limitations encountered was the slow sampling process. Incorporating variational inference procedures could significantly speed up the computations. Due to time constraints, only a preliminary exploration of this method was possible, but it will be for sure one of the points I will explore in next months.
- Multiple Seasons Analysis: Extending the analysis to multiple seasons might provide a more comprehensive understanding of team performance and model accuracy. This would also test the robustness of the online learning approach in capturing long-term trends.
- Model Enhancements: Future work could also explore enhancing the model’s complexity, such as incorporating more advanced statistical techniques or additional variables that could impact team performance (e.g., player injuries, manager changes, etc.).

In summary, this project demonstrated the utility of a Bayesian approach in sports analytics, providing a solid foundation for further exploration and development. The model’s simplicity and interpretative power make it a valuable tool for understanding football dynamics, while future enhancements could improve both its computational costs and its predictive capabilities.

Bayesian modelling of football outcomes using the Skellam distribution

Final project of the Bayesian Statistics Course (University of Trieste, Master’s degree in DSAI)

Giulio Fantuzzi [SM3800012]

2024-05-27

Introduction

Project structure

Exploratory Data Analysis

Model Formulation

Model implementation in `Stan`

Offline model

Online model

Posterior distributions of the coefficients

MCMC Intervals plots

MCMC Areas plots

Marginal posteriors (by chain)

Home advantage

Teams comparison

Abilities over time

A glimpse to convergence diagnostics

Model checking and comparison

Posterior predictive checks (in-sample replications)

Posterior predictive checks (out-sample replications)

Accuracy and Brier Score

Final rankings simulations

Conclusions and further developments

Bayesian modelling of football outcomes using the Skellam distribution

Final project of the Bayesian Statistics Course (University of Trieste, Master’s degree in DSAI)

Giulio Fantuzzi [SM3800012]

2024-05-27

Introduction

Project structure

Exploratory Data Analysis

Model Formulation

Model implementation in Stan

Offline model

Online model

Posterior distributions of the coefficients

MCMC Intervals plots

MCMC Areas plots

Marginal posteriors (by chain)

Home advantage

Teams comparison

Abilities over time

A glimpse to convergence diagnostics

Model checking and comparison

Posterior predictive checks (in-sample replications)

Posterior predictive checks (out-sample replications)

Accuracy and Brier Score

Final rankings simulations

Conclusions and further developments

Model implementation in `Stan`