I used to do a lot of statistical analyses of sports data where there was a latent parameter for the player’s ability. You can see an example here.
It was a natural choice to use a Bayesian hierarchical model where the abilities have a location/scale type distribution with support on the whole real line, and in fact this worked pretty well! You would see the kind of players at the top and bottom of the ability scale that you would expect.
Still, there were a few problems. The one that I thought about the most was that the data are systematically unbalanced because better players tend to produce more data than worse players. The result of this was that my models would often inappropriately think the bad players were like the good players: they would not only tend to be too certain about the abilities of low-data players, but also be biased, thinking that these players are probably a bit better than they actually are. I never came up with a good way to solve this problem, despite trying a lot of things!
Even though I don’t work with sports data very much any more, the problem still haunted me, so when I read this great case study about geomagnetic storms it gave me an idea for yet another potential solution.
The idea was this: just as data about intense storms that cause electricity problems tell us about a tail of the bigger solar magnetism distribution, maybe data about professional sportspeople is best thought about as coming from a tail of the general sports ability distribution. If so, maybe something like the generalised pareto distribution might be better than the bog standard normal distribution for describing the pros’ abilities.
I thought I’d test this out with some sports data, and luckily there is a really nice baseball example on the Stan case studies website, complete with data from the 2006 Major league season.
What follows is some code that fetches this data and then uses Stan via cmdstanpy to analyse it using a model with a latent generalised Pareto distribution for abilities. Check out this analysis’s github repository for all the details.
These are the Python dependencies we are going to need. They should go in the
file requirements.txt
.
arviz
pandas
cmdstanpy
matplotlib
These are some optional python libraries that I like to install (the file is called requirements-tooling.txt
):
ipython
jupyter
black
pandas-stubs
types-toml
flake8
flake8-bugbear
flake8-docstrings
mypy
python-lsp-server[all]
python-lsp-black
pylsp-mypy
pyls-isort
A .gitignore
file:
# ignore binary files
*
!*.*
!*/
# ignore hpp files
*.hpp
# ignore arviz files
idata*
# ignore dotfiles
.*
This makefile lets us run the analysis with the command make analysis
.
.PHONY: clean-arviz clean-plots clean-stan clean-all analysis
.ONESHELL :
ACTIVATE_VENV = .venv/bin/activate
REQUIREMENTS_FILE = requirements.txt
DATA_FILE = baseball-hits-2006.csv
ENV = env.txt
ifeq ($(OS),Windows_NT)
INSTALL_CMDSTAN_FLAGS = --compiler
ACTIVATE_VENV = .venv/Scripts/activate
else
INSTALL_CMDSTAN_FLAGS =
endif
$(ACTIVATE_VENV):
python -m venv .venv --prompt=baseball
$(ENV): $(ACTIVATE_VENV) $(REQUIREMENTS_FILE)
. $(ACTIVATE_VENV) && (\
python -m pip install --upgrade pip; \
python -m pip install -r $(REQUIREMENTS_FILE); \
install_cmdstan $(INSTALL_CMDSTAN_FLAGS); \
echo "environment updated" > $(ENV); \
)
$(DATA_FILE): $(ENV)
. $(ACTIVATE_VENV) && (\
python fetch_data.py; \
)
analysis: $(ENV) $(DATA_FILE)
. $(ACTIVATE_VENV) && (\
python sample.py; \
python analyse.py; \
)
clean-stan:
$(RM) $(shell find . -perm +100 -type f) # remove binary files
$(RM) *.hpp
clean-arviz:
$(RM) idata*.json
clean-plots:
$(RM) *.png
clean-all: clean-stan clean-arviz clean-plots
A script for fetching data called fetch_data.py
:
import pandas as pd
URL = "https://raw.githubusercontent.com/stan-dev/example-models/master/knitr/pool-binary-trials/baseball-hits-2006.csv"
FILE_OUT = "baseball-hits-2006.csv"
if __name__ == "__main__":
print(f"Fetching data from {URL}")
data = pd.read_csv(URL, comment="#")
print(f"Writing data to {FILE_OUT}")
data.to_csv(FILE_OUT)
Since Stan doesn’t implement the generalised pareto distribution yet we need to
do so with a user-defined function. Luckily we can just copy the relevant code
from the geomagnetic storms analysis and save it in the file gpareto.stan
.
For this analysis we only need the function gpareto_lpdf
, which goes in a file called gpareto.stan
.
real gpareto_lpdf(vector y, real ymin, real k, real sigma) {
// generalised Pareto log pdf
int N = rows(y);
real inv_k = inv(k);
if (k<0 && max(y-ymin)/sigma > -inv_k)
reject("k<0 and max(y-ymin)/sigma > -1/k; found k, sigma =", k, ", ", sigma);
if (sigma<=0)
reject("sigma<=0; found sigma =", sigma);
if (fabs(k) > 1e-15)
return -(1+inv_k)*sum(log1p((y-ymin) * (k/sigma))) -N*log(sigma);
else
return -sum(y-ymin)/sigma -N*log(sigma); // limit k->0
}
For a comparison I thought a good choice would be the best performing model from the original baseball case study. This model is a lot like the ones I used to use to analyse football (soccer) data and as we will see exhibits some of the problems that I ran into.
I copied the code from the baseball case study into the file model-normal.stan
, deleting a little bit in order to keep things simple.
data {
int<lower=0> N; // items
array[N] int<lower=0> K; // trials
array[N] int<lower=0> y; // successes
}
transformed data {
vector[N] log_K = log(to_vector(K));
vector[N] log_K_std = (log_K - mean(log_K)) / sd(log_K);
}
parameters {
real mu; // population mean of success log-odds
real<lower=0> sigma; // population sd of success log-odds
real a_K;
vector[N] alpha_std; // success log-odds (standardized)
}
model {
a_K ~ normal(0, 0.1);
mu ~ normal(-1, 1); // hyperprior
sigma ~ normal(0, 1); // hyperprior
alpha_std ~ normal(0, 1); // prior (hierarchical)
y ~ binomial_logit(K, mu + a_K * log_K_std + sigma * alpha_std); // likelihood
}
generated quantities {
vector[N] alpha = mu + a_K * log_K_std + sigma * alpha_std;
}
The new model is mostly the same as the normal model, but the prior distribution for the latent alpha
parameters is generalised pareto instead of normal.
It goes in a file called model-gpareto.stan
.
functions {
#include gpareto.stan
}
data {
int<lower=0> N; // items
array[N] int<lower=0> K; // trials
array[N] int<lower=0> y; // successes
real min_alpha; // noone worse than this would be in the dataset
real max_alpha;
}
parameters {
real<lower=0> sigma; // scale parameter of generalised pareto distribution
real<lower=-sigma/(max_alpha-min_alpha)> k; // shape parameter of generalised pareto distribution
vector<lower=min_alpha,upper=max_alpha>[N] alpha; // success log-odds
}
model {
sigma ~ normal(0, 1); // hyperprior
alpha ~ gpareto(min_alpha, k, sigma); // prior (hierarchical)
y ~ binomial_logit(K, alpha); // likelihood
// note no explicit prior for k
}
This code goes in a python script called sample.py
, and will run sampling for
both models against the 2006 data, put the results in arviz objects and save
them as json
files.
Note that this script hard-codes some minimum and maximum true batting averages that are required by the generalised pareto model. I think 0.5% and 99% are pretty reasonable choices: from my limited understanding of baseball 99% is basically impossible, and even the worst pro could probably get on base more often than one time out of 200. A normal person, on the other hand, would just about never reach even this low threshold.
import arviz as az
import cmdstanpy
import pandas as pd
from scipy.special import logit
STAN_FILE_NORMAL = "model-normal.stan"
STAN_FILE_GPARETO = "model-gpareto.stan"
DATA_FILE = "baseball-hits-2006.csv"
SAMPLE_KWARGS = {
"chains": 4,
"iter_warmup": 1000,
"iter_sampling": 1000,
"show_progress": False
}
SAMPLE_KWARGS_GPARETO = {
"max_treedepth": 12,
"adapt_delta": 0.99,
}
MIN_ALPHA = logit(0.005) # you probably need a true average >0.5% to get in the dataset
MAX_ALPHA = logit(0.99) # noone has a true average of 99%
def get_summary(idata):
summary_ss = az.summary(idata.sample_stats, var_names=["lp", "diverging"])
summary_vars = az.summary(idata, var_names="~alpha", filter_vars="like")
return pd.concat([summary_ss, summary_vars])
def fit_models(model_file_dict, data_df):
model_dict = {
name: cmdstanpy.CmdStanModel(stan_file=stan_file)
for name, stan_file in model_file_dict.items()
}
data_dict = {
"N": data_df.shape[0],
"y": data_df["y"].tolist(),
"K": data_df["K"].tolist(),
"min_alpha": MIN_ALPHA,
"max_alpha": MAX_ALPHA,
}
for name, model in model_dict.items():
sample_kwargs = (
SAMPLE_KWARGS
if "gpareto" not in name
else {**SAMPLE_KWARGS, **SAMPLE_KWARGS_GPARETO}
)
print(f"Fitting model {name}")
mcmc = model.sample(data=data_dict, **sample_kwargs)
idata = az.from_cmdstanpy(mcmc)
print(get_summary(idata))
idata_file = f"idata-{name}.json"
print(f"Saving idata to {idata_file}")
idata.to_json(idata_file)
if __name__ == "__main__":
data_df = pd.read_csv(DATA_FILE)
fit_models(
{"normal": STAN_FILE_NORMAL, "gpareto": STAN_FILE_GPARETO},
data_df
)
From the results of running this script we can see that both models survive cmdstanpy’s built in diagnostic checks: now it’s time to analyse the results.
The next script, analyse.py
, loads the results of the sampling using arviz and
creates a plot of each model’s alpha parameters, transformed onto the more
meaningful probability scale where they represent what each model thinks about
each player’s true batting average.
import arviz as az
import pandas as pd
from matplotlib import pyplot as plt
from scipy.special import expit
DATA_FILE = "baseball-hits-2006.csv"
ALPHA_PLOT_FILE = "alpha-plot.png"
def draw_alpha_plot(idata_gpareto, idata_normal, data):
alpha_qs_gpareto, alpha_qs_normal = (
idata.posterior["alpha"]
.quantile([0.05, 0.95], dim=("chain", "draw"))
.to_series()
.pipe(expit)
.unstack("quantile")
.add_prefix(name + "_")
for idata, name in zip([idata_gpareto, idata_normal], ["gpareto", "normal"])
)
data = data.copy().join(alpha_qs_gpareto).join(alpha_qs_normal)
f, ax = plt.subplots(figsize=[12, 5])
ax.scatter(data["K"], data["y"] / data["K"], label="Obs", color="black")
for model, color in [("gpareto", "tab:blue"), ("normal", "tab:orange")]:
ax.vlines(
data["K"],
data[f"{model}_0.05"],
data[f"{model}_0.95"],
label=model.capitalize() + " model 5%-95% posterior interval",
color=color,
zorder=0,
)
ax.set(
title="Observed vs modelled batting averages",
ylabel="Hit probability",
xlabel="Number of at-bats",
)
ax.legend(frameon=False)
return f, ax
if __name__ == "__main__":
data = pd.read_csv(DATA_FILE)
idata_gpareto = az.from_json("idata-gpareto.json")
idata_normal = az.from_json("idata-normal.json")
f, ax = draw_alpha_plot(idata_gpareto, idata_normal, data)
f.savefig(ALPHA_PLOT_FILE, bbox_inches="tight")
From this plot we can see that the normal model is somewhat over-regularised: it thinks all the batters have a true average of about 0.3, which is unlikely. It also thinks the players with few at-bats tend to be a bit better than their results would suggest: there are more black dots below the orange band than above in the 0 to 100 region.
The generalised pareto model, on the other hand, has very big differences in how certain it is about particular players. The key thing is that it is far more uncertain about the players with fewer at-bats: this is the thing I had never been able to achieve before with a hierarchical model and made me pretty happy with this experiment.
The generalised Pareto model perhaps has a bit of the opposite problem to the
normal model, under-regularising to the point where it thinks that some players
might have unrealistically high true averages in the 0.7+ range. However in my
opinion it is still closer than the normal model to how you might intuitively
respond to the data. If desired, more regularisation could be achieved by just
adding a line like alpha ~ normal(inv_logit(0.2), some-appropriate-sd);
to the
Stan program.
As I mentioned, I was pretty happy with how the results of the new model look in the graph above. It’s also nice that the sigma and k parameters of the generalised pareto distribution were fixed fairly narrowly. However it would be nice to understand a bit better why the two models behave so differently. Here are a few things that would be nice to look into:
- Compare against a simple non-hierarchical model (for example keep the normal
model but hardcode the parameter
sigma
to some large value): do the results come out about the same as the generalised pareto model? - Do some out of sample testing.
- Is the reason for the difference the fact that the generalised pareto distribution is non-symmetrical? If so, could we achieve the same results using a different non-symmetrical distribution?
- Answer some extreme-value type questions like “what is the probability of a batter having a true average greater than 0.6?”
After I posted this analysis on the Stan discourse some people suggested that a) the generalised pareto model was under-regularised and b) that the normal model should have an effect for number of at-bats. I thought I’d better try out both of those ideas.
data {
int<lower=0> N; // items
array[N] int<lower=0> K; // trials
array[N] int<lower=0> y; // successes
}
transformed data {
vector[N] log_K = log(to_vector(K));
vector[N] log_K_std = (log_K - mean(log_K)) / sd(log_K);
}
parameters {
real mu; // population mean of success log-odds
real<lower=0> sigma; // population sd of success log-odds
real a_K;
vector[N] alpha_std; // success log-odds (standardized)
}
model {
a_K ~ normal(0, 0.1);
mu ~ normal(-1, 1); // hyperprior
sigma ~ normal(0, 1); // hyperprior
alpha_std ~ normal(0, 1); // prior (hierarchical)
y ~ binomial_logit(K, mu + a_K * log_K_std + sigma * alpha_std); // likelihood
}
generated quantities {
vector[N] alpha = mu + a_K * log_K_std + sigma * alpha_std;
}
functions {
#include gpareto.stan
}
data {
int<lower=0> N; // items
array[N] int<lower=0> K; // trials
array[N] int<lower=0> y; // successes
real min_alpha; // noone worse than this would be in the dataset
real max_alpha;
}
parameters {
real<lower=0> sigma; // scale parameter of generalised pareto distribution
real<lower=-sigma/(max_alpha-min_alpha)> k; // shape parameter of generalised pareto distribution
vector<lower=min_alpha,upper=max_alpha>[N] alpha; // success log-odds
}
model {
sigma ~ normal(0, 1); // hyperprior
alpha ~ gpareto(min_alpha, k, sigma); // prior (hierarchical)
alpha ~ normal(-1.38, 0.4); // prob between 0.1 and 0.4 on prob scale
y ~ binomial_logit(K, alpha); // likelihood
// note no explicit prior for k
}
import pandas as pd
from sample import fit_models, DATA_FILE
STAN_FILE_GPARETO_REG = "model-gpareto-reg.stan"
STAN_FILE_NORMAL_AT_BATS = "model-normal-at-bats.stan"
if __name__ == "__main__":
data_df = pd.read_csv(DATA_FILE)
fit_models(
{
"gpareto-reg": STAN_FILE_GPARETO_REG,
"normal-at-bats": STAN_FILE_NORMAL_AT_BATS
},
data_df
)
import arviz as az
from analyse import draw_alpha_plot
PLOT_FILE = "alpha_plot_extra.png"
if __name__ == "__main__":
data = pd.read_csv(DATA_FILE)
idata_gpareto = az.from_json("idata-gpareto-reg.json")
idata_normal = az.from_json("idata-normal-at-bats.json")
f, ax = draw_alpha_plot(idata_gpareto, idata_normal, data)
f.savefig(PLOT_FILE, bbox_inches="tight")
Both of the interventions seem to have resulted in improvements - there is no longer an obvious bias in the orange bands as a function of number of at-bats, and the generalised pareto model no longer thinks that some players are capable of batting averages close to 1.
I still don’t think the orange model handles the low-data players correctly: it should be more uncertain about them I think.