Introduction
Predicting football match outcomes has long intrigued statisticians and sports analysts. The inherently random nature of sports events, coupled with various influencing factors like team strength, players performance, and home advantage, presents a challenging yet fascinating problem for statistical modeling.
In previous projects, I have explored goal-based models such as those proposed by Maher and Dixon and Coles. Maher’s model describes the outcome of a match by modeling the goals of the two teams independently using Poisson distributions. Dixon and Coles enhanced this basic model by introducing correlation between the teams’ goal-scoring processes and accounting for the home advantage effect. While these models have been widely used due to their simplicity and effectiveness, they rely on the assumption that goals follow a Poisson distribution. This assumption, however, can be questionable in certain leagues where overdispersion —in which the sample variance exceeds the sample mean— has been observed in the number of goals.
To address this limitation, Karlis and Ntzoufras proposed an alternative approach by focusing on the goal difference rather than the individual goal counts of each team. Their method utilizes the Skellam distribution, which eliminates the need to account for the correlation between teams’ scores directly and avoids the assumption of Poisson marginals. Although the Skellam-based model cannot predict exact match scores, it offers valuable insights into the likely goal difference, thereby simplifying the complexity associated with modeling individual goal counts. The application of Bayesian methods in this context allows for the incorporation of prior knowledge and continuous updating of model parameters as new data becomes available, enhancing the predictive power and adaptability of the model.
Project structure
📂 BayesianExam/
│
├── 📂 data/
│ └── 📂 season_2122/
│ ├── 📊 Bundesliga1_2122.csv
│ ├── 📊 Bundesliga2_2122.csv
│ ├── 📊 Championship_2122.csv
│ ├── 📊 LaLiga1_2122.csv
│ ├── 📊 LaLiga2_2122.csv
│ ├── 📊 Ligue1_2122.csv
│ ├── 📊 Ligue2_2122.csv
│ ├── 📊 PremierLeague_2122.csv
│ ├── 📊 SerieA_2122.csv
│ └── 📊 SerieB_2122.csv
│
├── 📂 estimated_models/
│ └── 📂 season_2122/
│ ├── 📂 offline_models/
│ │ ├── 📂 matchday19/
| | ├── ...
│ │ └── 📂 matchday38/
│ │
│ └── 📂 online_models/
│ ├── 📂 matchday19/
| ├── ...
│ └── 📂 matchday38/
│
├── 📂 report/
│
├── 📂 stan/
| ├── ⚙️ karlis-ntzoufras.stan
│ └── ⚙️️️ online.stan
|
├── 📂 utils/
| ├── HDA_probabilities.R
| ├── final_pts_distribution.R
| ├── get_all_teams_data.R
| ├── plot_confusionmatrix.R
| └── plot_parameters_ts.R
|
├── 📂 web-scraping/
│ └── 🔍 scraper.py
|
├── BrierScore.R
├── ConfusionMatrix.R
├── DataAnalysis.R
├── DynamicEstimation.R
├── ExploratoryDataAnalysis.R
├── FinalRankingSimulation.R
├── ModelChecking.R
├── OnlineLearning.R
└── Predictions.R
This report presents the findings of a comprehensive analysis of Serie A football league data for the 2021-2022 season. Given the inherent complexity of the analysis at hand, my project is structured into several directories, each serving a specific purpose in the data collection, analysis, and modeling process.
data/
is organised in subfolders for specific seasons, containing the raw data collected from football-data.co.uk.web-scraping/
contains a python web scraper (scraper.py
) to automatically retrieve and download datasets for all major European teams. By executing the scraper from the terminal and providing the desired season, users can download datasets for analysis. Example of usage:stan/
contains the Stan models used for statistical modeling. Two models are included:karlis-ntzoufras.stan
: the standard version proposed by the authors. From now on in the report, we will refer to it as the “offline” model;online.stan
: the online-learning framework version of the model.
utils/
simply contains some custom helper functions utilized throughout the analysis processreport/
contains the final report file, where the results of the analysis are documented and presented.estimated_models/
contains the fitted models as.rds
files. Like thedata/
directory, it is organized in season-specific subfolders. Since I wanted to store the “offline” models separately from the “online” ones, I further increase the depth of my subfolders, and I also stored the models by matchday (I hope it will be intuitive)
In addition to these directories, various R
scripts are
available, each corresponding to different stages of the analysis
pipeline. These scripts facilitate data cleaning, exploration, modeling,
and visualization, contributing to a structured and reproducible
workflow. The following sections detail the methodology employed, the
results obtained from statistical modeling, and insights derived from
the analysis of Serie A 2021-2022 data.
Exploratory Data Analysis
Before delving into model construction, it is crucial to gain a comprehensive understanding of the data. The exploratory data analysis (EDA) phase, through its visualizations, provide a glimpse into the underlying dynamics but also inform the modeling approach by highlighting factors that influence match outcomes. By analyzing various aspects of the data, such as goal differences, result distribution, and team performance, we are going to lay the groundwork for developing our models.
Data import and cleaning:
SerieA_data<- read.csv(file= paste0(DATA_DIR,"season_",SEASON,"/SerieA_",SEASON,".csv"))
SerieA_data<- SerieA_data %>% select(c("HomeTeam","AwayTeam","FTHG","FTAG","FTR"))
SerieA_data<- SerieA_data %>% mutate(GD=FTHG-FTAG)
teams<- unique(SerieA_data$HomeTeam)
n_games<- nrow(SerieA_data)
n_teams<- length(teams)
n_matchdays= ceiling(n_games/((n_teams)/2))
ht= unlist(sapply(1:n_games,function (g) which(teams==SerieA_data$HomeTeam[g])))
at= unlist(sapply(1:n_games,function (g) which(teams==SerieA_data$AwayTeam[g])))
SerieA_data$ht=ht
SerieA_data$at=at
teams<- str_replace_all(teams, " ", "")
The initial variable I decided to inspect was, of course, the goal differences, as it represents our main phenomenon of interest. The plot below shows the empirical data and overlays it with theoretical Skellam distributions (both the standard and a zero-inflated version), and highlights how well the theoretical distributions align with the observed data.
Next, we investigate the distribution of results (encoded as Home-win, Draw, and Away-win) across the entire season. As depicted in the plot below, HomeWin emerges as the majority class. This observation informs our modeling approach, where we will introduce a home effect parameter to account for the slight advantage of playing matches in one’s own stadium.
Another interesting aspect to explore are the goal statistics by team. In particular, by examining goals scored and conceded by each team, we gain valuable insights into the potential ranking of attack and defense coefficients in our upcoming model. Additionally, distinguishing between home and away matches underscores the significance of incorporating the home effect as a model parameter.
# Preliminary steps required to make plots
home_scored <- aggregate(SerieA_data$FTHG, list(SerieA_data$HomeTeam), FUN = sum)
away_scored <- aggregate(SerieA_data$FTAG, list(SerieA_data$AwayTeam), FUN = sum)
home_conceeded <- aggregate(SerieA_data$FTAG, list(SerieA_data$HomeTeam), FUN = sum)
away_conceeded <- aggregate(SerieA_data$FTHG, list(SerieA_data$AwayTeam), FUN = sum)
colnames(home_scored) <- c("team", "home_scored")
colnames(away_scored) <- c("team", "away_scored")
colnames(home_conceeded) <- c("team", "home_conceeded")
colnames(away_conceeded) <- c("team", "away_conceeded")
goals_scored <- merge(home_scored, away_scored, by = "team")
goals_scored$tot_goals <- goals_scored$home_scored + goals_scored$away_scored
goals_conceeded <- merge(home_conceeded, away_conceeded, by = "team")
goals_conceeded$tot_goals <- goals_conceeded$home_conceeded + goals_conceeded$away_conceeded
Comment: only five teams (out of a total of twenty) have scored more goals away than at home. Except for Napoli, which has scored the same number of goals at home and away, all other teams have had better offensive performances when playing in their own stadium. Symmetrically, only six teams have conceded more goals at home than away. All other teams have shown better defensive performances when playing in their own stadium (Inter, Salernitana, and Udinese, actually conceded the same number of goals at home and away).
The existence of a home effect is suggested also by the useful results achieved by each team, differentiated between outcomes at home and away:
useful_results_df<- SerieA_data %>%
group_by(HomeTeam) %>%
summarize(UsefulResultsHome = sum(FTR %in% c("H", "D"))) %>%
left_join(SerieA_data %>%
group_by(AwayTeam) %>%
summarize(UsefulResultsAway = sum(FTR %in% c("D", "A"))),
by = c("HomeTeam" = "AwayTeam")) %>%
rename(Team=HomeTeam)
Finally, an interesting insight that can be extracted from our data lies in the Final Rankings of the league:
SerieA_data <- SerieA_data %>%
mutate(HomePoints = ifelse(FinalResult == "HomeWin", 3,
ifelse(FinalResult == "Draw", 1, 0)),
AwayPoints = ifelse(FinalResult == "AwayWin", 3,
ifelse(FinalResult == "Draw", 1, 0)))
total_points <- SerieA_data %>%
select(HomeTeam,AwayTeam,FTHG,FTAG,HomePoints,AwayPoints) %>%
group_by(Team = HomeTeam) %>%
summarise(TotalPoints = sum(HomePoints),
GS = sum(FTHG),
GC = sum(FTAG)) %>%
bind_rows(
SerieA_data %>%
group_by(Team = AwayTeam) %>%
summarise(TotalPoints = sum(AwayPoints),
GS = sum(FTAG),
GC = sum(FTHG))
) %>%
group_by(Team) %>%
summarise(TotalPoints=sum(TotalPoints),
GS = sum(GS),
GC = sum(GC)) %>%
arrange(desc(TotalPoints)) %>%
mutate(Position = row_number(),
GD =GS-GC) %>%
select(Position, Team, TotalPoints,GS,GC,GD)
Position | Team | TotalPoints | GS | GC | GD |
---|---|---|---|---|---|
1 | Milan | 86 | 69 | 31 | 38 |
2 | Inter | 84 | 84 | 32 | 52 |
3 | Napoli | 79 | 74 | 31 | 43 |
4 | Juventus | 70 | 57 | 37 | 20 |
5 | Lazio | 64 | 77 | 58 | 19 |
6 | Roma | 63 | 59 | 43 | 16 |
7 | Fiorentina | 62 | 59 | 51 | 8 |
8 | Atalanta | 59 | 65 | 48 | 17 |
9 | Verona | 53 | 65 | 59 | 6 |
10 | Sassuolo | 50 | 64 | 66 | -2 |
11 | Torino | 50 | 46 | 41 | 5 |
12 | Udinese | 47 | 61 | 58 | 3 |
13 | Bologna | 46 | 44 | 55 | -11 |
14 | Empoli | 41 | 50 | 70 | -20 |
15 | Sampdoria | 36 | 46 | 63 | -17 |
16 | Spezia | 36 | 41 | 71 | -30 |
17 | Salernitana | 31 | 33 | 78 | -45 |
18 | Cagliari | 30 | 34 | 68 | -34 |
19 | Genoa | 28 | 27 | 60 | -33 |
20 | Venezia | 27 | 34 | 69 | -35 |
Model Formulation
As mentioned in the Introduction, we can use the Skellam distribution to model goal difference of a football match. More precisely, the goal difference of the match between teams \(i\) and \(j\) (where we adopt the convention of \(i\) being the home team) is described by: \[\begin{equation} GD_{i,j} \sim Skellam\left(\theta^{(H)}_{(i,j)},\theta^{(A)}_{(i,j)}\right) \end{equation}\] with: \[\begin{align*} \theta^{(H)}_{(i,j)} &= \exp\{\mu + h_{eff} + att_{(i)} + def_{(j)}\}\\[0.2cm] \theta^{(A)}_{(i,j)} &= \exp\{\mu + att_{(j)} + def_{(i)}\} \end{align*}\]
Under the above parametrization, all parameters have a straightforward interpretation:
- \(h_{eff}\) is the home effect, which describes the advantage for a team of playing the match at its stadium;
- \(\mu\) is a constant, representing the average “ability” (combining both attack and defence at the same time) of a team in the analyzed league;
- \(att_{(t)}\) represents the attack strength of team \(t\);
- \(def_{(t)}\) represents the defence weakness of team \(t\);
Some notes:
- Since we consider a “teams’s population intercept” (\(\mu\)), then \(att_{(t)}\) and \(def_{(t)}\) should be interpreted as deviations of the net attacking and defensive abilities from a team of average performance;
- In order to ensure the identifiability of the model, some constraints are imposed over the attack and defence coefficients: \[\sum_{t=1}^{T} {att_{t}} = 0 \hspace{0.5cm};\hspace{0.5cm}\sum_{t=1}^{T} {def_{t}} = 0 \]
- The probability function of our Skellam is: \[f_{S}\left(x,\theta_1,\theta_2\right)= e^{-(\theta_1+\theta_2)} \left(\frac{\theta_1}{\theta_2}\right)^{x/2} I_{|x|}\left(2\sqrt{\theta_1\theta_2}\right)\] with \(x \in \mathbb{Z},\theta_1,\theta_2 >0\) and where \(I_{|r|}(\bullet)\) is the modified Bessel function of order r (check it here)
In my previous experience with goal-based models, I noticed that the number of draws (and the corresponding probability) tends to be underestimated. To address this, I explored the idea proposed by Karlis and Ntzoufras of using a zero-inflated version of the Skellam distribution. This approach seemed promising to me, so the probability distribution which I referred to has been the following:
\[\begin{align*} f_{ZIS}\left(x,\theta_1,\theta_2,p\right) = \begin{cases} (1-p)\cdot f_{S}\left(x,\theta_1,\theta_2\right)\hspace{0.5cm}&\text{if $x\neq0$}\\[0.3cm] p+(1-p)\cdot f_{S}\left(x,\theta_1,\theta_2\right)\hspace{0.5cm}&\text{if $x=0$} \end{cases} \end{align*}\] with \(x \in \mathbb{Z},\theta_1,\theta_2 >0\) and \(p \in [0,1]\)
To fully specify a Bayesian model, we need to define prior distributions for the model parameters. In my analysis, I chose not to incorporate any external information from previous seasons. In such cases, Karlis and Ntzoufras suggested using normal prior distributions with a mean of zero and a large variance (e.g. \(10^4\)) to express prior ignorance. However, since the Stan documentation discourages non-informative priors, I set my priors to be weakly informative (still I assumed them to be normal):
\[\begin{align*} \mu \sim N(0,10)\\[0.2cm] H_{eff} \sim N(0,10)\\[0.2cm] att_{(\bullet)} \sim N(0,10)\\[0.2cm] def_{(\bullet)} \sim N(0,10)\\[0.2cm] p \sim U(0,1) \end{align*}\]
In addition to the standard approach, I considered an online-learning approach. In the Bayesian framework, we typically start with a prior distribution and update it with observed data, in order to obtain a posterior distribution. When new data becomes available, this posterior can serve as the prior for the next update and so on. Despite concerns about specifying overly informative priors, I was curious to compare the online learning approach with the traditional “offline” model, where the information that updates our knowledge derives (almost) only from the data.
Note: in principle we don’t know the functional form of the posterior distribution. However, in my analysis ,I assumed my priors to be normal, updating only their location and scale according to the matchday by matchday estimates.
The formulation of the online model has -of course- the same likelihood of the offline one, except from the priors specification. More precisely, at a certain matchday \(m\), the priors will be the following:
\[\begin{align*} \mu &\sim N\big(MAP(\mu)^{(m-1)},\sigma_{\mu}^{(m-1)}\big)\\[0.2cm] H_{eff} &\sim \big(MAP(H_{eff})^{(m-1)},\sigma_{H_{eff}}^{(m-1)}\big)\\[0.2cm] att_{(\bullet)} &\sim \big(MAP(att_{(\bullet)}\;)^{(m-1)},\sigma_{att_{(\bullet)}}^{(m-1)}\big)\\[0.2cm] def_{(\bullet)} &\sim \big(MAP(def_{(\bullet)}\;)^{(m-1)},\sigma_{def_{(\bullet)}}^{(m-1)}\big)\\[0.2cm] p &\sim U(0,1) \end{align*}\]
Model implementation in Stan
The models have been fully implemented using the Stan
ecosystem and managed in R through the library rstan
. I
decided to follow carefully the procedure illustrated by Karlis and
Ntzoufras in their article, setting the following MCMC parameters:
N_CHAINS
= 4N_ITERS
= 11000N_WARMUP
= 1000
In what follows the main blocks of the Stan files are illustrated.
Offline model
Functions: since
Stan
did not provide the log pmf of a Skellam, I had to define my own custom functionsfunctions { real skellam_lpmf(int k, real lambda1, real lambda2) { real total; real log_prob; total = (- lambda1 - lambda2) + (log(lambda1) - log(lambda2)) * k / 2; log_prob = total + log(modified_bessel_first_kind(k, 2 * sqrt(lambda1*lambda2))); return log_prob; } real zero_inflated_skellam_lpmf(int k, real lambda1, real lambda2, real p) { real base_prob; real prob; real log_prob; base_prob = exp(skellam_lpmf(k|lambda1, lambda2)); if (k == 0){ prob = p + (1 - p) * base_prob; } else{ prob = (1 - p) * base_prob; } log_prob = log(prob); return log_prob; } }
Data:
Parameters and transformed parameters:
parameters { real<lower=0, upper=1> p; real mu; real home_advantage; array[n_teams-1] real att_raw; array[n_teams-1] real def_raw; } transformed parameters { // Sum-to-zero constraint array[n_teams] real att; array[n_teams] real def; for (t in 1:(n_teams-1)) { att[t] = att_raw[t]; def[t] = def_raw[t]; } att[n_teams] = -sum(att_raw); def[n_teams] = -sum(def_raw); }
Model:
model { array[n_games] real theta_H; array[n_games] real theta_A; // Priors p ~ uniform(0, 1); att_raw ~ normal(0, 10); def_raw ~ normal(0, 10); home_advantage ~ normal(0, 10); mu ~ normal(0, 10); // Likelihood for (g in 1:n_games) { theta_H[g] = exp(mu + home_advantage +att[home_team[g]] + def[away_team[g]]); theta_A[g] = exp(mu + att[away_team[g]] + def[home_team[g]]); goal_difference[g] ~ zero_inflated_skellam(theta_H[g],theta_A[g],p); } }
Generated quantities: I couldn’t exploit the generated quantities block since
Stan
does not provide any random generator for the Skellam distribution (a sort ofskellam_rng()
). Some users usually by-pass this limitations by generating 2 values from a Poisson (withpoisson_rng()
) and computing their subtraction, but this would be in contrast with our setup, in which we don’t assume Poisson marginals. Fortunately, the generated quantities can be easily computed in R after the model has been fitted! I opted for this latter approach, referring to theskellam
library without the need of re-inventing the wheel.
Technical Note: the skellam
library
provided a rskellam()
function, but the documentation
clearly specified that it worked internally by generating 2 random
Poisson (with rpois()
), which was the mechanism I wanted to
avoid :( Nevertheless, I could easily implement a custom Skellam
generator through the inverse-sampling method, exploiting the
qskellam()
function from the same library, as shown
below:
library(skellam)
#-------------------------------------------------------------------------------
# 1) Random generator for standard skellam
my_rskellam<- function(n,lambda1,lambda2){
q= runif(n,min=0,max=1)
return(qskellam(q,lambda1,lambda2))
}
#-------------------------------------------------------------------------------
# 2) Random generator for zero-inflated skellam
my_rzeroinflatedskellam<- function(n,lambda1,lambda2,p){
samples<- vector(mode="numeric",length = n)
for(i in 1:n){
# Generate 0 with probability p
# Or a random number from a standard skellam with prob (1-p)
if(runif(1)>p){
samples[i]<- my_rskellam(1,lambda1,lambda2)
}
#else not needed, since there are already zeros!
}
return(samples)
}
#-------------------------------------------------------------------------------
Online model
The Stan code for the online model is similar to the previous one, with the major differences in the data block and in the priors definition (model block)
data {
int<lower=1> n_teams;
int<lower=1> n_games;
array[n_games] int<lower=1, upper=n_teams> home_team;
array[n_games] int<lower=1, upper=n_teams> away_team;
array[n_games] int goal_difference;
// Previous estimates and sd
array[n_teams-1] real prev_att_MAP;
array[n_teams-1] real prev_def_MAP;
real prev_mu_MAP;
real prev_home_advantage_MAP;
array[n_teams-1] real<lower=0> prev_att_sd;
array[n_teams-1] real<lower=0> prev_def_sd;
real<lower=0> prev_mu_sd;
real<lower=0> prev_home_advantage_sd;
}
model {
...
// Priors
p ~ uniform(0, 1);
for(a in 1:(n_teams-1)){
att_raw[a] ~ normal(prev_att_MAP[a], prev_att_sd[a]);
def_raw[a] ~ normal(prev_def_MAP[a], prev_def_sd[a]);
}
home_advantage ~ normal(prev_home_advantage_MAP, prev_home_advantage_sd);
mu ~ normal(prev_mu_MAP, prev_mu_sd);
...
}
Disclaimer:
- Due to the extensive time required to fit the models, I am not including the full code here (it is not particularly interesting, though). However, I have fitted all the models and stored them in a dedicated folder, with the possibility of loading each of them when required.
- The bayesian procedure of fitting the model matchday by matchday is,
in practice, done on a weekly basis. With access to the complete dataset
for the analyzed season, I was able to fit the models in a for loop,
iterating over each matchday. The procedure for each iteration is the
following:
- Select the rows of the dataframe corresponding to the available information up to the current matchday.
- Load the model estimated at the end of the previous matchday (from the models folder).
- Fit the model and store the updated model in the dedicated folder.
Posterior distributions of the coefficients
After fitting my Bayesian models, I used the bayesplot
package to generate informative plots of the posterior distributions of
the model parameters. To provide a comprehensive overview, I will
present the plots for the “final model”, which is the one fitted at the
end of the season, with all the available data. These plots will be
shown for both the offline and online models, with a clear distinction
between the attack parameters and the defense parameters.
Plots for offline models
load(paste0(OFFLINE_MODELS_DIR,"matchday38/KN_matchday38.rds"))
posterior=as.array(KN_model)
#-------------------------------------------------------------------------------
# Lazy code to retrieve only the parameters of interest
par_names<- names(KN_model)
useful_par_names<- par_names[!(grepl("raw", par_names))]
att_params <- useful_par_names[grepl("att", useful_par_names)]
def_params <- useful_par_names[grepl("def", useful_par_names)]
att_labels <- setNames(teams, paste0("att[", 1:20, "]"))
def_labels <- setNames(teams, paste0("def[", 1:20, "]"))
#-------------------------------------------------------------------------------
#-------------------------------------------------------------------------------
# ATTACK COEFFICIENTS
color_scheme_set("blue")
att_intervals_offline<- mcmc_intervals(posterior,pars=att_params)+
scale_y_discrete(labels=att_labels)+
ggtitle("MCMC intervals for attack coefficients (offline)")
att_areas_offline<- mcmc_areas(posterior,pars= att_params)+
scale_y_discrete(labels=att_labels)+
ggtitle("MCMC areas for attack coefficients (offline)")
att_dens_overlay_offline<- mcmc_dens_overlay(posterior,pars=att_params)+
ggtitle("Marginal posteriors for attack coefficients by chain (offline)")
att_dens_overlay_offline[[1]]$Parameter = rep(att_labels,
each=N_CHAINS*(N_ITERS-N_WARMUP))
#-----------------------------------------------------------------------------
# DEFENSE COEFFICIENTS
color_scheme_set("red")
def_intervals_offline<- mcmc_intervals(posterior,pars=def_params)+
scale_y_discrete(labels=def_labels)+
ggtitle("MCMC intervals for defense coefficients (offline)")
def_areas_offline<- mcmc_areas(posterior,pars= def_params)+
scale_y_discrete(labels=def_labels)+
ggtitle("MCMC areas for defense coefficients (offline)")
def_dens_overlay_offline<- mcmc_dens_overlay(posterior,pars=def_params)+
ggtitle("Marginal posteriors for defense coefficients by chain (offline)")
def_dens_overlay_offline[[1]]$Parameter = rep(def_labels,
each=N_CHAINS*(N_ITERS-N_WARMUP))
#-------------------------------------------------------------------------------
# HOME ADVANTAGE
home_areas_offline<- mcmc_areas(posterior,pars="home_advantage")+
ggtitle("MCMC areas for home advantage (offline)")
home_dens_overlay_offline<- mcmc_dens_overlay(posterior,pars="home_advantage")+
ggtitle("Marginal posteriors for home advantage by chain (offline)")
Plots for online models
load(paste0(ONLINE_MODELS_DIR,"matchday38/KN_matchday38.rds"))
posterior=as.array(KN_model)
#-------------------------------------------------------------------------------
# Lazy code to retrieve only the parameters of interest
par_names<- names(KN_model)
useful_par_names<- par_names[!(grepl("raw", par_names))]
att_params <- useful_par_names[grepl("att", useful_par_names)]
def_params <- useful_par_names[grepl("def", useful_par_names)]
att_labels <- setNames(teams, paste0("att[", 1:20, "]"))
def_labels <- setNames(teams, paste0("def[", 1:20, "]"))
#-------------------------------------------------------------------------------
#-------------------------------------------------------------------------------
# ATTACK COEFFICIENTS
color_scheme_set("blue")
att_intervals_online<- mcmc_intervals(posterior,pars=att_params)+
scale_y_discrete(labels=att_labels)+
ggtitle("MCMC intervals for attack coefficients (online)")
att_areas_online<- mcmc_areas(posterior,pars= att_params)+
scale_y_discrete(labels=att_labels)+
ggtitle("MCMC areas for attack coefficient (online)")
att_dens_overlay_online<- mcmc_dens_overlay(posterior,pars=att_params)+
ggtitle("Marginal posteriors for attack coefficients by chain (online)")
att_dens_overlay_online[[1]]$Parameter = rep(att_labels,
each=N_CHAINS*(N_ITERS-N_WARMUP))
#-------------------------------------------------------------------------------
# DEFENSE COEFFICIENTS
color_scheme_set("red")
def_intervals_online<- mcmc_intervals(posterior,pars=def_params)+
scale_y_discrete(labels=def_labels)+
ggtitle("MCMC intervals for defense coefficients (online)")
def_areas_online<- mcmc_areas(posterior,pars= def_params)+
scale_y_discrete(labels=def_labels)+
ggtitle("MCMC areas for defense coefficients (online)")
#
def_dens_overlay_online<- mcmc_dens_overlay(posterior,pars=def_params)+
ggtitle("Marginal posteriors for defense coefficients by chain (online)")
def_dens_overlay_online[[1]]$Parameter = rep(def_labels,
each=N_CHAINS*(N_ITERS-N_WARMUP))
#-------------------------------------------------------------------------------
# HOME ADVANTAGE
home_areas_online<- mcmc_areas(posterior,pars="home_advantage")+
ggtitle("MCMC areas for home advantage (online)")
home_dens_overlay_online<- mcmc_dens_overlay(posterior,pars="home_advantage")+
ggtitle("Marginal posteriors for home advantage by chain (online)")
MCMC Intervals plots
MCMC Areas plots
Marginal posteriors (by chain)
Home advantage
Teams comparison
Abilities over time
Teams’s abilities over time (offline model)
Teams’s abilities over time (online model)
A glimpse to convergence diagnostics
The training phase of the models seemed promising, as the MCMC procedure completed succesfully without any warning or error. To assess the convergence of the chains, we can inspect the summary of the models and look at the Gelman-Rubin statistics \(\hat{R}\) (Rhat), which provide a quantitative measure of the chains’ convergence to the steady state.
Offline model
## Inference for Stan model: anon_model.
## 4 chains, each with iter=11000; warmup=1000; thin=1;
## post-warmup draws per chain=10000, total post-warmup draws=40000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## mu -0.19 0 0.11 -0.40 -0.26 -0.19 -0.11 0.02 27901 1
## home_advantage 0.12 0 0.07 -0.01 0.08 0.12 0.17 0.26 47969 1
## att[1] 0.68 0 0.17 0.33 0.56 0.68 0.80 1.02 36874 1
## att[2] 0.06 0 0.23 -0.40 -0.09 0.07 0.22 0.51 40797 1
## att[3] -0.25 0 0.29 -0.84 -0.44 -0.24 -0.05 0.29 40810 1
## att[4] -0.10 0 0.24 -0.58 -0.25 -0.09 0.07 0.35 38297 1
## att[5] -0.05 0 0.26 -0.58 -0.22 -0.04 0.13 0.44 42691 1
## att[6] 0.49 0 0.22 0.04 0.34 0.49 0.64 0.91 35893 1
## att[7] 0.50 0 0.18 0.15 0.39 0.51 0.62 0.85 37597 1
## att[8] 0.28 0 0.21 -0.14 0.14 0.29 0.43 0.69 40854 1
## att[9] -0.73 0 0.40 -1.57 -0.98 -0.71 -0.46 -0.02 39762 1
## att[10] 0.29 0 0.24 -0.19 0.13 0.29 0.45 0.74 38666 1
## att[11] 0.30 0 0.21 -0.12 0.16 0.30 0.44 0.71 44720 1
## att[12] 0.67 0 0.19 0.29 0.54 0.67 0.80 1.04 35146 1
## att[13] 0.53 0 0.21 0.12 0.39 0.53 0.67 0.93 37462 1
## att[14] -0.04 0 0.21 -0.47 -0.18 -0.04 0.10 0.36 42073 1
## att[15] -1.28 0 0.52 -2.41 -1.61 -1.24 -0.92 -0.35 35030 1
## att[16] 0.29 0 0.23 -0.18 0.14 0.30 0.45 0.73 35889 1
## att[17] 0.29 0 0.19 -0.08 0.16 0.29 0.42 0.65 42743 1
## att[18] -0.51 0 0.39 -1.33 -0.77 -0.50 -0.24 0.20 37348 1
## att[19] -0.39 0 0.32 -1.05 -0.60 -0.38 -0.17 0.20 40775 1
## att[20] -1.03 0 0.43 -1.93 -1.30 -1.01 -0.74 -0.26 37245 1
## def[1] -0.54 0 0.41 -1.42 -0.80 -0.51 -0.25 0.20 35246 1
## def[2] -0.13 0 0.26 -0.66 -0.30 -0.13 0.05 0.35 39088 1
## def[3] 0.22 0 0.20 -0.18 0.09 0.22 0.36 0.61 41929 1
## def[4] -0.28 0 0.27 -0.82 -0.46 -0.28 -0.10 0.22 39696 1
## def[5] 0.18 0 0.22 -0.25 0.04 0.18 0.33 0.60 39619 1
## def[6] 0.41 0 0.23 -0.05 0.25 0.41 0.56 0.85 36563 1
## def[7] -0.65 0 0.39 -1.48 -0.90 -0.63 -0.38 0.06 37040 1
## def[8] -0.13 0 0.27 -0.69 -0.31 -0.12 0.06 0.39 40010 1
## def[9] 0.28 0 0.19 -0.09 0.16 0.29 0.41 0.66 39832 1
## def[10] 0.54 0 0.19 0.16 0.42 0.54 0.67 0.91 37093 1
## def[11] -0.14 0 0.28 -0.71 -0.32 -0.13 0.05 0.39 44269 1
## def[12] 0.34 0 0.24 -0.14 0.19 0.35 0.51 0.80 36020 1
## def[13] 0.37 0 0.23 -0.08 0.22 0.37 0.53 0.80 39245 1
## def[14] -0.85 0 0.35 -1.58 -1.07 -0.83 -0.61 -0.20 43395 1
## def[15] 0.11 0 0.20 -0.27 -0.02 0.11 0.25 0.50 45973 1
## def[16] 0.29 0 0.23 -0.16 0.14 0.29 0.44 0.72 35305 1
## def[17] -1.11 0 0.46 -2.10 -1.40 -1.08 -0.79 -0.28 35375 1
## def[18] 0.54 0 0.18 0.19 0.42 0.54 0.66 0.89 37538 1
## def[19] 0.33 0 0.19 -0.04 0.21 0.34 0.46 0.70 39414 1
## def[20] 0.20 0 0.19 -0.17 0.07 0.20 0.32 0.56 37993 1
##
## Samples were drawn using NUTS(diag_e) at Thu May 23 16:24:21 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Online model
## Inference for Stan model: anon_model.
## 4 chains, each with iter=11000; warmup=1000; thin=1;
## post-warmup draws per chain=10000, total post-warmup draws=40000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## mu -0.32 0 0.02 -0.35 -0.33 -0.32 -0.31 -0.29 57363 1
## home_advantage 0.15 0 0.02 0.12 0.14 0.15 0.16 0.18 55709 1
## att[1] 0.78 0 0.03 0.71 0.75 0.78 0.80 0.84 54325 1
## att[2] 0.11 0 0.05 0.02 0.08 0.11 0.15 0.21 57006 1
## att[3] 0.01 0 0.05 -0.09 -0.02 0.01 0.05 0.12 54203 1
## att[4] 0.03 0 0.05 -0.06 0.00 0.03 0.06 0.12 54559 1
## att[5] 0.21 0 0.05 0.11 0.17 0.21 0.24 0.30 51632 1
## att[6] 0.37 0 0.05 0.27 0.33 0.37 0.40 0.47 50910 1
## att[7] 0.49 0 0.04 0.42 0.46 0.49 0.51 0.56 57498 1
## att[8] 0.44 0 0.04 0.36 0.41 0.44 0.47 0.52 53710 1
## att[9] -0.64 0 0.08 -0.79 -0.69 -0.64 -0.59 -0.49 51317 1
## att[10] 0.26 0 0.05 0.16 0.22 0.26 0.29 0.36 58589 1
## att[11] 0.40 0 0.04 0.32 0.37 0.40 0.43 0.48 58019 1
## att[12] 0.79 0 0.04 0.72 0.76 0.79 0.81 0.86 52849 1
## att[13] 0.57 0 0.04 0.49 0.54 0.57 0.60 0.65 57625 1
## att[14] 0.11 0 0.04 0.02 0.08 0.11 0.14 0.20 54968 1
## att[15] -2.18 0 0.12 -2.41 -2.26 -2.18 -2.10 -1.95 50041 1
## att[16] 0.01 0 0.05 -0.09 -0.02 0.01 0.05 0.12 53968 1
## att[17] 0.40 0 0.04 0.32 0.37 0.40 0.43 0.48 49641 1
## att[18] -0.69 0 0.09 -0.87 -0.75 -0.69 -0.63 -0.52 51054 1
## att[19] -0.42 0 0.07 -0.56 -0.47 -0.42 -0.38 -0.29 51023 1
## att[20] -1.03 0 0.22 -1.47 -1.18 -1.03 -0.88 -0.61 62394 1
## def[1] -0.62 0 0.07 -0.75 -0.66 -0.62 -0.57 -0.49 56236 1
## def[2] -0.16 0 0.05 -0.26 -0.20 -0.17 -0.13 -0.07 48733 1
## def[3] 0.41 0 0.04 0.33 0.38 0.41 0.43 0.48 51272 1
## def[4] -0.25 0 0.05 -0.35 -0.28 -0.25 -0.21 -0.15 53831 1
## def[5] 0.41 0 0.04 0.33 0.38 0.41 0.43 0.48 53237 1
## def[6] 0.34 0 0.04 0.25 0.31 0.34 0.37 0.42 53148 1
## def[7] -0.83 0 0.06 -0.95 -0.87 -0.83 -0.79 -0.71 50939 1
## def[8] 0.01 0 0.05 -0.08 -0.02 0.01 0.04 0.10 52022 1
## def[9] 0.32 0 0.04 0.25 0.30 0.32 0.35 0.39 56688 1
## def[10] 0.53 0 0.04 0.45 0.50 0.53 0.55 0.60 50807 1
## def[11] -0.31 0 0.06 -0.42 -0.35 -0.31 -0.28 -0.21 54624 1
## def[12] 0.43 0 0.04 0.34 0.40 0.43 0.45 0.51 52231 1
## def[13] 0.30 0 0.04 0.22 0.27 0.30 0.33 0.39 57413 1
## def[14] -0.98 0 0.06 -1.11 -1.02 -0.98 -0.94 -0.86 53674 1
## def[15] 0.06 0 0.04 -0.01 0.04 0.06 0.09 0.14 49967 1
## def[16] -0.10 0 0.05 -0.19 -0.13 -0.10 -0.07 0.00 53460 1
## def[17] -0.78 0 0.06 -0.90 -0.82 -0.78 -0.74 -0.66 50585 1
## def[18] 0.62 0 0.03 0.56 0.60 0.62 0.65 0.69 59261 1
## def[19] 0.36 0 0.04 0.29 0.33 0.36 0.38 0.43 53456 1
## def[20] 0.25 0 0.13 -0.01 0.16 0.25 0.34 0.50 64327 1
##
## Samples were drawn using NUTS(diag_e) at Thu May 23 15:21:00 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
In the summaries, all the Rhat
values are equal to 1,
indicating convergence of the chains (remember the rule of thumb, \(\hat{R}< 1.1\)). Additionally, the
effective sample size (n_eff
) values are high, which is a
positive indicator. Notably, the n_eff
values for the
online model are generally higher than those for the offline models,
which was something that I somehow expected. Interestingly, many
n_eff
values (all, actually, for the online model) are
greater than the total number of samples set during the fitting process.
At first, this seemed paradoxical to me. However, after reading various
discussions on the Stan developers forum, I found the following
answer:“Hi! It means that the draws that Stan is producing are
better than independent draws for those parameters. Sounds like a joke,
but apparently Stan is actually that awesome. :)”. (I leave the
link to the discussion here)
Thanks to the bayesplot
package, we can also plot the
traceplots and visually inspect for divergent transitions. In our case,
no chain had diverged, so the plots are not particularly interesting.
However, if there had been any divergent chains, they would be evident
in the plots below. As an example, I’ll present only the traceplots for
the attack and defense coefficients of the online model:
## No divergences to plot.
## No divergences to plot.
After examining the Gelman-Rubin statistic and the effective sample size, another important diagnostic for assessing convergence is the autocorrelation function (ACF) plot. This plot helps to inspect the autocorrelations within the chain samples. Upon reviewing all the ACF plots, I did not observe any particularly concerning patterns. To provide a comprehensive overview, I will present ACF plots just for two parameters: one with greater variability (the defense coefficient for Venezia FC) and one with less variability (the attack coefficient for Inter). These examples illustrate that regardless of the variability, there are no significant autocorrelation patterns, indicating good convergence properties of the model.
Model checking and comparison
So far we have constructed a Bayesian model, computed the posterior distribution of all estimates, and conducted diagnostic tests —both quantitative and qualitative— regarding the convergence of the chains. Now, our attention turns towards evaluating the fit of the model to the data and our substantive knowledge.
Posterior predictive checks (in-sample replications)
As illustrated in the Bayesian Data Analysis book, a key technique for assessing model fit involves drawing simulated values from the posterior predictive distribution and comparing them to the observed data. More specifically, we’ll perform now an in-sample replication analysis. Here, we aim to assess whether the model adequately captures the observed data by assessing the plausibility of the observed data under the posterior predictive distribution. If the model is well-fitted, replicated data generated under its framework should closely resemble the observed data, while any discrepancy between the replicated and observed data may indicate model misfit or chance variations. If the model fits, then replicated data generated under the model should look similar to observed data. In short, the observed data should look plausible under the posterior predictive distribution.
The analysis of model checking can be effectively conducted in a
graphical/qualitative manner. Fortunately, our friend
bayesplot
offers a suite of graphical solutions to assess
the posterior predictive distribution for our in-sample analysis. Before
delving into the graphical model checking plots, a preliminary step is
required. Due to limitations in defining the generated quantities block
in Stan, as discussed earlier, it was necessary to simulate these
quantities in R (as shown in the chunks below).
Note: In conducting the in-sample replication procedure, we assess the plausibility of observed data relative to replications drawn from the model trained with such data. While it is possible to perform this check match day by match day, I opted to utilize our more comprehensive model—the one estimated after the final match of the season. By employing this approach, the observed results encompass the entirety of the season’s data, providing a robust assessment of model fit across the entire dataset.
PPC for offline models
#-------------------------------------------------------------------------------
# Dataframe were to store the simulated GD
GD_df<- data.frame(matrix(NA,nrow = N_CHAINS*(N_ITERS-N_WARMUP),ncol=n_games))
# Load the final model (end of season)
load(paste0(OFFLINE_MODELS_DIR,"matchday",n_matchdays,
"/KN_matchday",n_matchdays,".rds"))
#-------------------------------------------------------------------------------
# Retrieve some "global" parameters
posterior<- as.array(KN_model)
mu = posterior[,,"mu"]
home=posterior[,,"home_advantage"]
p = posterior[,,"p"]
#-------------------------------------------------------------------------------
for (m in 1:n_games){
cat(m,"\n")
ht=SerieA_data$ht[m]
at=SerieA_data$at[m]
attH=posterior[,,paste0("att[",ht,"]")]
defH=posterior[,,paste0("def[",ht,"]")]
attA=posterior[,,paste0("att[",at,"]")]
defA=posterior[,,paste0("def[",at,"]")]
theta_H = exp(mu+home+attH+defA)
theta_A = exp(mu+attA+defH)
GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
GD_df[,m]<- GD
colnames(GD_df)[m]<-paste0(teams[ht],"-vs-",teams[at])
}
#-------------------------------------------------------------------------------
color_scheme_set("teal")
ppc_dens_overlay_offline <- ppc_dens_overlay(y=SerieA_data$GD,
yrep=as.matrix(GD_df[1:4000,]))
ppc_ecdf_overlay_offline <- ppc_ecdf_overlay(y=SerieA_data$GD,
yrep=as.matrix(GD_df[1:10000,]))
ppc_stat_mean_offline<- ppc_stat(y = SerieA_data$GD,
yrep=,as.matrix(GD_df[1:40000,]),
stat = "mean",bins = 20)
propzero <- function(x) mean(x == 0)
ppc_stat_propzero_offline <- ppc_stat(y = SerieA_data$GD,
yrep=as.matrix(GD_df[1:40000,]),
stat = "propzero",bins = 20)
ppc_stat_2d_offline <- ppc_stat_2d(y = SerieA_data$GD,
yrep=as.matrix(GD_df[1:40000,]),
stat = c("mean","sd"))
PPC for online models
#-------------------------------------------------------------------------------
# Dataframe were to store the simulated GD
GD_df<- data.frame(matrix(NA,nrow = N_CHAINS*(N_ITERS-N_WARMUP),ncol=n_games))
# Load the final model (end of season)
load(paste0(ONLINE_MODELS_DIR,"matchday",n_matchdays,
"/KN_matchday",n_matchdays,".rds"))
#-------------------------------------------------------------------------------
# Retrieve some "global" parameters
posterior<- as.array(KN_model)
mu = posterior[,,"mu"]
home=posterior[,,"home_advantage"]
p = posterior[,,"p"]
#-------------------------------------------------------------------------------
for (m in 1:n_games){
ht=SerieA_data$ht[m]
at=SerieA_data$at[m]
attH=posterior[,,paste0("att[",ht,"]")]
defH=posterior[,,paste0("def[",ht,"]")]
attA=posterior[,,paste0("att[",at,"]")]
defA=posterior[,,paste0("def[",at,"]")]
theta_H = exp(mu+home+attH+defA)
theta_A = exp(mu+attA+defH)
GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
GD_df[,m]<- GD
colnames(GD_df)[m]<-paste0(teams[ht],"-vs-",teams[at])
}
#-------------------------------------------------------------------------------
color_scheme_set("purple")
ppc_dens_overlay_online <- ppc_dens_overlay(y=SerieA_data$GD,
yrep=as.matrix(GD_df[1:4000,]))
ppc_ecdf_overlay_online<- ppc_ecdf_overlay(y=SerieA_data$GD,
yrep=as.matrix(GD_df[1:10000,]))
ppc_stat_mean_online<- ppc_stat(y = SerieA_data$GD,
yrep=,as.matrix(GD_df[1:40000,]),
stat = "mean",bins = 20)
propzero <- function(x) mean(x == 0)
ppc_stat_propzero_online <- ppc_stat(y = SerieA_data$GD,
yrep=as.matrix(GD_df[1:40000,]),
stat = "propzero",bins = 20)
ppc_stat_2d_online <- ppc_stat_2d(y = SerieA_data$GD,
yrep=as.matrix(GD_df[1:40000,]),
stat = c("mean","sd"))
Note: here,the plots on the left (right) side are referred to the offline (online) model!
Posterior predictive checks (out-sample replications)
Having completed the in-sample posterior predictive checks, we now turn our attention to out-of-sample replication. This process involves testing the model on future observable values or vectors of new observable quantities. Specifically, for our football league analysis, I employed a sequential prediction strategy based on matchdays. Recall that we initially estimated models matchday by matchday, starting from the end of the first half of the league.
With the model fitted using data up to matchday \(m\), I used it to predict the outcomes of matchday \(m+1\) (which are future, unknown data!). The Bayesian approach, with its chain samples, offers a significant advantage over classical statistics: instead of estimating a single value (either as a point estimate or with confidence intervals), we can obtain a distribution for our quantities of interest. In this case, using the chain samples, I simulated the goal differences for each match, obtaining numerous samples from the distribution of goal differences for a given match.
Such distribution then allowed me to assign probability values to the final outcomes, encoded as “H” (Home win), “D” (Draw) and “A” (Away win). I selected the most frequent class as a guess for each match’s final outcome and, by repeating this for all matches in the second half of the league, I constructed confusion matrices to compare the model’s accuracy. Additionally, I computed the Brier score matchday by matchday to further assess the predictive power of my models.
Accuracy and Brier Score
Let’s start from the accuracy:
Offline model:
SerieA_data$PredictedGD=NA
for (m in 20:n_matchdays){
cat(paste0("...Simulation of matchday n. ",m,"...\n"))
#-------------------------------------------------
# (1) Get the current matchday to predict
test_set=SerieA_data[(10*m -9):(10*m),]
#-------------------------------------------------
# (2) Load the most recent model and some of its stuff
cat("...Loading the model and retrieving useful info...\n")
load(paste0(OFFLINE_MODELS_DIR,"matchday",m-1,"/KN_matchday",m-1,".rds"))
posterior<- as.array(KN_model)
mu = posterior[,,"mu"]
home=posterior[,,"home_advantage"]
p = posterior[,,"p"]
#-------------------------------------------------
# (3) Make predictions
cat("...Computing predictions and updating rankings...\n")
PredictedGD<- vector(mode="numeric",length = nrow(test_set))
for(mm in 1:nrow(test_set)){
ht=test_set$ht[mm]
at=test_set$at[mm]
attH=posterior[,,paste0("att[",ht,"]")]
defH=posterior[,,paste0("def[",ht,"]")]
attA=posterior[,,paste0("att[",at,"]")]
defA=posterior[,,paste0("def[",at,"]")]
theta_H = exp(mu+home+attH+defA)
theta_A = exp(mu+attA+defH)
GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
PredictedGD[mm]<- GD %>% mean() %>% round()
}
# (4) Add predicted GD to the dataset
SerieA_data$PredictedGD[(10*m -9):(10*m)]<- PredictedGD
cat("-------------------------------------------------\n")
}
ConfusionMatrix_offline_df<- SerieA_data %>% slice(191:n_games) %>% select(GD,PredictedGD)
ConfusionMatrix_offline_df<- ConfusionMatrix_offline_df %>% mutate(
Actual = ifelse(GD > 0, "H", ifelse(GD < 0, "A", "D")),
Predicted = ifelse(PredictedGD > 0, "H", ifelse(PredictedGD < 0, "A", "D"))
)
CM_offline <- confusion_matrix(
targets = factor(ConfusionMatrix_offline_df$Actual, levels = c("H", "D", "A")),
predictions = factor(ConfusionMatrix_offline_df$Predicted, levels = c("H", "D", "A"))
)
Online model
The same procedure can be performed for the online model (we just
need to change OFFLINE_MODELS_DIR
with
ONLINE_MODELS_DIR
in the load()
function), so
I am not going to show the full code again. In the end, we get a
confusion matrix also for the online model:
ConfusionMatrix_online_df<- ConfusionMatrix_online_df %>% mutate(
Actual = ifelse(GD > 0, "H", ifelse(GD < 0, "A", "D")),
Predicted = ifelse(PredictedGD > 0, "H", ifelse(PredictedGD < 0, "A", "D"))
)
CM_online <- confusion_matrix(
targets = factor(ConfusionMatrix_online_df$Actual, levels = c("H", "D", "A")),
predictions = factor(ConfusionMatrix_online_df$Predicted, levels = c("H", "D", "A"))
)
Comparison of the two Confusion matrices:
## >> Balanced accuracy of the offline model: 0.6412566
## >> Balanced accuracy of the online model : 0.6337596
The two confusion matrices highlight the similarity in predictions between the models. Notably, the (balanced) accuracy of the offline model is slightly higher than the one of the online model. However, keep in mind that while accuracy is an important measure for assessing a model, it is not the sole factor in determining its suitability.
Additionally, the similarity in performance between the models is further evidenced by their Brier scores, which is an evaluation metric defined as:
\[BS = \frac{1}{M} \sum_{m=1}^{M}{\sum_{i}{(p_{mi} - d_{mi})^2}} \hspace{0.2cm}\; \hspace{0.2cm}\text{with } i\in \{H,D,A\}\] Note: for the Brier score it holds the rule The lower, the better!
Despite the BS being an overall measure of performance (like accuracy, though), I found it interesting to compute it (and to plot it) matchday by matchday:
BS_ts<- read.csv("BS_ts.csv")
BS_ts %>%
pivot_longer(cols=c(BS_offline,BS_online),
names_to = "BS",
values_to = "Value") %>%
ggplot(aes(x=matchday,y=Value,color=BS))+
geom_line(linewidth=1)+
ggtitle("Brier score for predictions mathday by matchday")
Following the Brier score plot, we observe that both approaches yield very similar results in terms of predictive performance. However, it is important to notice that, in general, these quantitative measures do not provide the full picture. The ideal approach is to consider the entirety of our analysis, integrating various aspects, both quantitative and qualitative.
Final rankings simulations
In the previous section we highlighted how an ideal approach should integrate various aspects, both quantitative and qualitative. As an example, I did something (hopefully) interesting to evaluate and compare the models: I implemented a simulation of the final rankings. Starting from the situation at the halfway point of the season, I began simulating matches. With the model fitted after the \(m\)-th match, I made predictions for match \(m+1\). Thanks to the Bayesian approach, I obtained a distribution of the goal differences and, from this distribution, it was relatively straightforward to derive a distribution of the points for each team. By proceeding in this manner over all the matchdays, I obtained the distribution of the final points at the end of the season. Having a distribution allowed me to provide the expected number of points at the end of the season, along with a confidence band for the result. This approach offers a comprehensive view of the potential outcomes, capturing the inherent uncertainty and variability in our two models.
To perform this task, I implemented a custom function for retrieving the final points distribution:
final_pts_distribution<- function(teams,start=20,end=38,
n_chains=4,n_iters=11000,n_warmup=1000,
models_dir=NULL) {
"
Function to simulate the points at the end of the season basing on coef estimates
Returns a df s.t.
- each column is associated to a team
- nrows: n_chains*(n_iters-n_warmup) to be friendly with the MCMC posteriors
"
if(is.null(models_dir) || !file.exists(models_dir)){
stop("Error! Models directory path not provided or does not exist")
}
#-----------------------------------------------------------------------------
# Rankings after the first half of the league
n_teams=length(teams)
teams_pts <- data.frame(matrix(NA,nrow = n_chains*(n_iters-n_warmup),ncol=n_teams))
colnames(teams_pts)=teams
for (t in 1:n_teams){
n_wins= SerieA_data[1:190,] %>%
filter(((ht==t) &(FTHG>FTAG)) | ((at==t) & (FTHG<FTAG)))%>% nrow()
n_draws= SerieA_data[1:190,] %>%
filter(((ht==t) | (at==t)) & (FTHG == FTAG))%>% nrow()
teams_pts[,t]= 3*n_wins+n_draws
}
#-----------------------------------------------------------------------------
# Iteration over the 2nd half
for (m in start:end){
cat(paste0("...Simulation of matchday n. ",m,"...\n"))
#...........................................................................
# (1) Get the current matchday to predict
test_set=SerieA_data[(10*m -9):(10*m),]
test_set=na.omit(test_set) #just to manage postponed matches in last batch
#...........................................................................
# (2) Load the most recent model and some of its stuff
cat("...Loading the model and retrieving useful info...\n")
load(paste0(models_dir,"matchday",m-1,"/KN_matchday",m-1,".rds"))
posterior<- as.array(KN_model)
n_iters=dim(posterior)[1]
n_chains=dim(posterior)[2]
mu = posterior[,,"mu"]
home=posterior[,,"home_advantage"]
p = posterior[,,"p"]
#...........................................................................
# (3) Make predictions
cat("...Computing predictions and updating rankings...\n")
for(mm in 1:nrow(test_set)){
ht=test_set$ht[mm]
at=test_set$at[mm]
attH=posterior[,,paste0("att[",ht,"]")]
defH=posterior[,,paste0("def[",ht,"]")]
attA=posterior[,,paste0("att[",at,"]")]
defA=posterior[,,paste0("def[",at,"]")]
theta_H = exp(mu+home+attH+defA)
theta_A = exp(mu+attA+defH)
GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
#.........................................................................
# (4) Calculate points according to predicted goal diffs
pts_ht = ifelse(GD > 0, 3, ifelse(GD < 0, 0, 1))
pts_at = ifelse(GD < 0, 3, ifelse(GD > 0, 0, 1))
#.........................................................................
# (5) Sum them to the previous points
teams_pts[,ht] = teams_pts[,ht]+pts_ht
teams_pts[,at] = teams_pts[,at]+pts_at
#.........................................................................
}
cat("-------------------------------------------------\n")
}
#-----------------------------------------------------------------------------
# (4) Return the df
return(teams_pts)
}
final_pts_offline<- final_pts_distribution(teams,start=20,end=38,
n_chains=N_CHAINS,
n_iters=N_ITERS,
n_warmup=N_WARMUP,
models_dir =OFFLINE_MODELS_DIR)
final_pts_online<- final_pts_distribution(teams,start=20,end=38,
n_chains=N_CHAINS,
n_iters=N_ITERS,
n_warmup=N_WARMUP,
models_dir =ONLINE_MODELS_DIR)
offline_rankings<- data.frame(Team= teams,
Pts_mean=NA,
Pts_mode=NA,
Pts_lower.95=NA,
Pts_lower.50=NA,
Pts_upper.50=NA,
Pts_upper.95=NA)
online_rankings<- data.frame(Team= teams,
Pts_mean=NA,
Pts_mode=NA,
Pts_lower.95=NA,
Pts_lower.50=NA,
Pts_upper.50=NA,
Pts_upper.95=NA)
for(t in 1:n_teams){
# Update rankings for offline model
offline_rankings$Pts_mean[t]<- final_pts_offline[,t] %>% mean() %>% round()
offline_rankings$Pts_mode[t]<- final_pts_offline[,t] %>% DescTools::Mode() %>% mean() %>% round() #mean just to manage multimodal case
offline_rankings$Pts_lower.95[t]<- final_pts_offline[,t] %>% quantile(probs = 0.025) %>% round()
offline_rankings$Pts_lower.50[t]<- final_pts_offline[,t] %>% quantile(probs = 0.25) %>% round()
offline_rankings$Pts_upper.50[t]<- final_pts_offline[,t] %>% quantile(probs = 0.75) %>% round()
offline_rankings$Pts_upper.95[t]<- final_pts_offline[,t] %>% quantile(probs = 0.975) %>% round()
# Update rankings for online model
online_rankings$Pts_mean[t]<- final_pts_online[,t] %>% mean() %>% round()
online_rankings$Pts_mode[t]<- final_pts_online[,t] %>% DescTools::Mode() %>% mean() %>% round() #mean just to manage multimodal case
online_rankings$Pts_lower.95[t]<- final_pts_online[,t] %>% quantile(probs = 0.025) %>% round()
online_rankings$Pts_lower.50[t]<- final_pts_online[,t] %>% quantile(probs = 0.25) %>% round()
online_rankings$Pts_upper.50[t]<- final_pts_online[,t] %>% quantile(probs = 0.75) %>% round()
online_rankings$Pts_upper.95[t]<- final_pts_online[,t] %>% quantile(probs = 0.975) %>% round()
}
table_offline<- offline_rankings %>%
arrange(desc(Pts_mode)) %>%
mutate(Position = row_number()) %>%
select(Position,Team,Pts_mean,Pts_mode,Pts_lower.95,Pts_lower.50,Pts_upper.50,Pts_upper.95) %>%
kbl(caption = "Serie A 2021-2022 simulated rankings (offline models)", align = 'c') %>%
kable_styling(font_size = 12, full_width = F) %>%
row_spec(0, background = "#FFFFFF", color = "#000000",bold=T) %>%
# Champions League
row_spec(1:4, background = "#8AD1F3") %>%
# Europa League
row_spec(5:6, background = "#EEA256") %>%
# Conference League
row_spec(7, background = "#E1D625") %>%
# Relegation
row_spec((nrow(offline_rankings)-2):nrow(offline_rankings), background = "#FF9E94") %>%
row_spec(1:nrow(offline_rankings), extra_css = "border-top: 1px solid black;")
table_online<- online_rankings %>%
arrange(desc(Pts_mode)) %>%
mutate(Position = row_number()) %>%
select(Position,Team,Pts_mean,Pts_mode,Pts_lower.95,Pts_lower.50,Pts_upper.50,Pts_upper.95) %>%
kbl(caption = "Serie A 2021-2022 simulated rankings (online models)", align = 'c') %>%
kable_styling(font_size = 12, full_width = F) %>%
row_spec(0, background = "#FFFFFF", color = "#000000",bold=T) %>%
# Champions League
row_spec(1:4, background = "#8AD1F3") %>%
# Europa League
row_spec(5:6, background = "#EEA256") %>%
# Conference League
row_spec(7, background = "#E1D625") %>%
# Relegation
row_spec((nrow(online_rankings)-2):nrow(online_rankings), background = "#FF9E94") %>%
row_spec(1:nrow(online_rankings), extra_css = "border-top: 1px solid black;")
Position | Team | Pts_mean | Pts_mode | Pts_lower.95 | Pts_lower.50 | Pts_upper.50 | Pts_upper.95 |
---|---|---|---|---|---|---|---|
1 | Inter | 88 | 88 | 79 | 85 | 91 | 97 |
2 | Milan | 80 | 80 | 70 | 77 | 83 | 89 |
3 | Napoli | 78 | 78 | 68 | 74 | 81 | 87 |
4 | Atalanta | 72 | 73 | 62 | 68 | 75 | 82 |
5 | Juventus | 69 | 68 | 59 | 65 | 72 | 78 |
6 | Roma | 63 | 64 | 53 | 60 | 67 | 73 |
7 | Lazio | 62 | 62 | 52 | 58 | 65 | 71 |
8 | Fiorentina | 62 | 62 | 52 | 59 | 66 | 72 |
9 | Torino | 53 | 53 | 43 | 49 | 56 | 63 |
10 | Verona | 52 | 52 | 42 | 48 | 55 | 62 |
11 | Bologna | 49 | 49 | 40 | 46 | 53 | 59 |
12 | Sassuolo | 49 | 48 | 39 | 45 | 52 | 59 |
13 | Empoli | 47 | 47 | 38 | 44 | 51 | 57 |
14 | Udinese | 47 | 47 | 37 | 44 | 51 | 57 |
15 | Sampdoria | 39 | 39 | 31 | 36 | 43 | 49 |
16 | Spezia | 32 | 32 | 24 | 29 | 35 | 41 |
17 | Venezia | 30 | 30 | 23 | 28 | 33 | 39 |
18 | Cagliari | 26 | 25 | 19 | 23 | 29 | 35 |
19 | Genoa | 22 | 21 | 16 | 20 | 24 | 29 |
20 | Salernitana | 20 | 19 | 13 | 17 | 22 | 28 |
Position | Team | Pts_mean | Pts_mode | Pts_lower.95 | Pts_lower.50 | Pts_upper.50 | Pts_upper.95 |
---|---|---|---|---|---|---|---|
1 | Inter | 90 | 90 | 81 | 87 | 93 | 98 |
2 | Milan | 80 | 81 | 71 | 77 | 84 | 89 |
3 | Napoli | 79 | 78 | 69 | 75 | 82 | 87 |
4 | Atalanta | 73 | 73 | 64 | 70 | 77 | 83 |
5 | Juventus | 69 | 69 | 59 | 65 | 72 | 78 |
6 | Roma | 63 | 63 | 53 | 60 | 67 | 73 |
7 | Fiorentina | 63 | 62 | 53 | 59 | 66 | 72 |
8 | Lazio | 62 | 61 | 52 | 58 | 65 | 71 |
9 | Torino | 53 | 54 | 44 | 50 | 57 | 63 |
10 | Verona | 51 | 51 | 42 | 48 | 55 | 61 |
11 | Bologna | 50 | 50 | 40 | 46 | 53 | 59 |
12 | Empoli | 48 | 48 | 39 | 45 | 51 | 58 |
13 | Sassuolo | 49 | 48 | 39 | 45 | 52 | 58 |
14 | Udinese | 47 | 47 | 37 | 43 | 50 | 57 |
15 | Sampdoria | 39 | 39 | 31 | 36 | 42 | 49 |
16 | Spezia | 31 | 31 | 23 | 28 | 34 | 40 |
17 | Venezia | 30 | 29 | 23 | 27 | 33 | 38 |
18 | Cagliari | 26 | 25 | 19 | 23 | 28 | 34 |
19 | Genoa | 20 | 20 | 15 | 18 | 22 | 27 |
20 | Salernitana | 18 | 17 | 12 | 16 | 20 | 26 |
We can also plot the final rankings distributions and compare them with the real observed points
real_points= data.frame(
Team=teams,
Pts=NA
)
for (t in 1:n_teams){
n_wins= SerieA_data %>% filter(((ht==t) &(FTHG>FTAG)) | ((at==t) & (FTHG<FTAG)))%>% nrow()
n_draws= SerieA_data %>% filter(((ht==t) | (at==t)) & (FTHG == FTAG))%>% nrow()
real_points$Pts[t]= 3*n_wins+n_draws
}
Upon reviewing the simulated rankings generated by the two models against the actual Serie A 2021-2022 standings, several key observations can be made:
Overall Picture
- Both models effectively captured the overall situation of the league. Specifically, they identified the correct trend for many teams, even though the exact positions was not always correct.
- Lower Table: The points for lower-ranked teams were generally underestimated, as these teams secured more points than predicted.
- Upper Table: The points for top-ranked teams were slightly overestimated but not significantly.
Qualifications to the European competitions
The models’ predictions for the top 7 positions, which determine qualification for European competitions (Champions League, UEFA Europa League, and Conference League), showed notable accuracy. In particular, both the offline model and the online model correctly identified 6 out of 7 teams. However, the exact rankings of these teams were not as accurate. For instance, both models predicted Inter to top the league, while Milan actually won the championship. This discrepancy can be attributed to the competitive nature of the season. Additionally, Milan’s final matches were much more challenging than Inter’s ones, which could explain why the models predicted Inter to win.
Some notable examples
- Inter: Predicted as the top team by both models but actually finished second. The season was highly competitive, and Milan’s more challenging end-of-season schedule was a significant factor in the outcome.
- Atalanta: Predicted to qualify for the Champions League in both models. However, after a strong first half of the season, Atalanta experienced a significant drop in form, which the models did not account for.
- Relegation Zone: both models accurately predicted Genoa and Cagliari as relegated teams. Venezia was predicted to avoid relegation, whereas in reality, they did not. Additionally, both models predicted Salernitana to finish last. Given the team’s position at the 32nd matchday, where they were at the bottom of the table and seemingly destined for relegation, this prediction was reasonable (Salernitana’s miraculous escape under coach Davide Nicola contradicted everyone’s expectations on earth).
Nevertheless, it is important to note that football is inherently unpredictable. Our models, despite their simplicity, managed to capture the broader trends effectively, which is admirable considering the inherent randomness and volatility of this beautiful sport.
Conclusions and further developments
Model Insights: Despite the simplicity of the model, it provided numerous interesting insights into the Serie A season analyzed. The estimates and predictive checks were generally consistent with real data, demonstrating the model’s robustness.
Bayesian Approach Advantages: Compared to previous projects using goal-based models, the Bayesian approach offered significant advantages. Dealing with distributions of parameters instead of just point estimates (with or without confidence intervals) allowed for:
- Simulating matches with greater detail.
- Estimating probabilities through simulations.
- Checking the goodness of results via posterior predictive checks
Comparison of Approaches: The comparison between the online and offline approaches revealed similar predictive performances for the analyzed season, in terms of both (balanced) accuracy and Brier score. However, it was particularly evident that the online learning approach provides narrower variability bands in coefficient estimates over time, suggesting more precise and stable estimates.
Future Comparisons: It would be beneficial to compare the two approaches using multiple seasons’ data. My personal guess is that the online approach might outperform the offline approach in terms of quantitative metrics, especially when dealing with evolving data over several seasons.
Interpretative Tool: While the model is simple and not suitable for betting or making precise predictions (mainly due to the inherent randomness of football itself), it serves as an excellent interpretative tool. It captures the dynamics over time and provides valuable insights into the factors influencing team performance. Even more complex models in the literature struggle to make accurate predictions in football, underscoring the sport’s unpredictability.
Further Developments:
- Variational Inference: One of the limitations encountered was the slow sampling process. Incorporating variational inference procedures could significantly speed up the computations. Due to time constraints, only a preliminary exploration of this method was possible, but it will be for sure one of the points I will explore in next months.
- Multiple Seasons Analysis: Extending the analysis to multiple seasons might provide a more comprehensive understanding of team performance and model accuracy. This would also test the robustness of the online learning approach in capturing long-term trends.
- Model Enhancements: Future work could also explore enhancing the model’s complexity, such as incorporating more advanced statistical techniques or additional variables that could impact team performance (e.g., player injuries, manager changes, etc.).
In summary, this project demonstrated the utility of a Bayesian approach in sports analytics, providing a solid foundation for further exploration and development. The model’s simplicity and interpretative power make it a valuable tool for understanding football dynamics, while future enhancements could improve both its computational costs and its predictive capabilities.