Skip to content

Experimenting with extreme value distributions for sports modelling and literate programming

Notifications You must be signed in to change notification settings

teddygroves/baseball

Repository files navigation

Extreme value distributions for sports models

Introduction

I used to do a lot of statistical analyses of sports data where there was a latent parameter for the player’s ability. You can see an example here.

It was a natural choice to use a Bayesian hierarchical model where the abilities have a location/scale type distribution with support on the whole real line, and in fact this worked pretty well! You would see the kind of players at the top and bottom of the ability scale that you would expect.

Still, there were a few problems. The one that I thought about the most was that the data are systematically unbalanced because better players tend to produce more data than worse players. The result of this was that my models would often inappropriately think the bad players were like the good players: they would not only tend to be too certain about the abilities of low-data players, but also be biased, thinking that these players are probably a bit better than they actually are. I never came up with a good way to solve this problem, despite trying a lot of things!

Even though I don’t work with sports data very much any more, the problem still haunted me, so when I read this great case study about geomagnetic storms it gave me an idea for yet another potential solution.

The idea was this: just as data about intense storms that cause electricity problems tell us about a tail of the bigger solar magnetism distribution, maybe data about professional sportspeople is best thought about as coming from a tail of the general sports ability distribution. If so, maybe something like the generalised pareto distribution might be better than the bog standard normal distribution for describing the pros’ abilities.

I thought I’d test this out with some sports data, and luckily there is a really nice baseball example on the Stan case studies website, complete with data from the 2006 Major league season.

What follows is some code that fetches this data and then uses Stan via cmdstanpy to analyse it using a model with a latent generalised Pareto distribution for abilities. Check out this analysis’s github repository for all the details.

Set up environment

These are the Python dependencies we are going to need. They should go in the file requirements.txt.

arviz
pandas
cmdstanpy
matplotlib

These are some optional python libraries that I like to install (the file is called requirements-tooling.txt):

ipython
jupyter
black
pandas-stubs
types-toml
flake8
flake8-bugbear
flake8-docstrings
mypy
python-lsp-server[all]
python-lsp-black
pylsp-mypy
pyls-isort

A .gitignore file:

# ignore binary files
*
!*.*
!*/

# ignore hpp files
*.hpp

# ignore arviz files
idata*

# ignore dotfiles
.*

Code for running the analysis

This makefile lets us run the analysis with the command make analysis.

.PHONY: clean-arviz clean-plots clean-stan clean-all analysis
.ONESHELL :

ACTIVATE_VENV = .venv/bin/activate
REQUIREMENTS_FILE = requirements.txt
DATA_FILE = baseball-hits-2006.csv
ENV = env.txt

ifeq ($(OS),Windows_NT)
        INSTALL_CMDSTAN_FLAGS = --compiler
        ACTIVATE_VENV = .venv/Scripts/activate
else
        INSTALL_CMDSTAN_FLAGS =
endif

$(ACTIVATE_VENV):
        python -m venv .venv --prompt=baseball

$(ENV): $(ACTIVATE_VENV) $(REQUIREMENTS_FILE)
        . $(ACTIVATE_VENV) && (\
          python -m pip install --upgrade pip; \
          python -m pip install -r $(REQUIREMENTS_FILE); \
          install_cmdstan $(INSTALL_CMDSTAN_FLAGS); \
          echo "environment updated" > $(ENV); \
        )

$(DATA_FILE): $(ENV)
        . $(ACTIVATE_VENV) && (\
          python fetch_data.py; \
        )

analysis: $(ENV) $(DATA_FILE)
        . $(ACTIVATE_VENV) && (\
          python sample.py; \
          python analyse.py; \
        )

clean-stan:
        $(RM) $(shell find . -perm +100 -type f) # remove binary files
        $(RM) *.hpp

clean-arviz:
        $(RM) idata*.json

clean-plots:
        $(RM) *.png

clean-all: clean-stan clean-arviz clean-plots

Fetch data

A script for fetching data called fetch_data.py:

import pandas as pd

URL = "https://raw.githubusercontent.com/stan-dev/example-models/master/knitr/pool-binary-trials/baseball-hits-2006.csv"
FILE_OUT = "baseball-hits-2006.csv"

if __name__ == "__main__":
    print(f"Fetching data from {URL}")
    data = pd.read_csv(URL, comment="#")
    print(f"Writing data to {FILE_OUT}")
    data.to_csv(FILE_OUT)

Define Stan generalised Pareto distribution functions

Since Stan doesn’t implement the generalised pareto distribution yet we need to do so with a user-defined function. Luckily we can just copy the relevant code from the geomagnetic storms analysis and save it in the file gpareto.stan.

For this analysis we only need the function gpareto_lpdf, which goes in a file called gpareto.stan.

real gpareto_lpdf(vector y, real ymin, real k, real sigma) {
  // generalised Pareto log pdf 
  int N = rows(y);
  real inv_k = inv(k);
  if (k<0 && max(y-ymin)/sigma > -inv_k)
    reject("k<0 and max(y-ymin)/sigma > -1/k; found k, sigma =", k, ", ", sigma);
  if (sigma<=0)
    reject("sigma<=0; found sigma =", sigma);
  if (fabs(k) > 1e-15)
    return -(1+inv_k)*sum(log1p((y-ymin) * (k/sigma))) -N*log(sigma);
  else
    return -sum(y-ymin)/sigma -N*log(sigma); // limit k->0
}

Define model with normal distribution for abilities

For a comparison I thought a good choice would be the best performing model from the original baseball case study. This model is a lot like the ones I used to use to analyse football (soccer) data and as we will see exhibits some of the problems that I ran into.

I copied the code from the baseball case study into the file model-normal.stan, deleting a little bit in order to keep things simple.

data {
  int<lower=0> N; // items
  array[N] int<lower=0> K; // trials
  array[N] int<lower=0> y; // successes  
}
transformed data {
  vector[N] log_K = log(to_vector(K));
  vector[N] log_K_std = (log_K - mean(log_K)) / sd(log_K);
}
parameters {
  real mu; // population mean of success log-odds
  real<lower=0> sigma; // population sd of success log-odds
  real a_K;
  vector[N] alpha_std; // success log-odds (standardized)
}
model {
  a_K ~ normal(0, 0.1);
  mu ~ normal(-1, 1); // hyperprior
  sigma ~ normal(0, 1); // hyperprior
  alpha_std ~ normal(0, 1); // prior (hierarchical)
  y ~ binomial_logit(K, mu + a_K * log_K_std + sigma * alpha_std); // likelihood
}
generated quantities {
  vector[N] alpha = mu + a_K * log_K_std + sigma * alpha_std;
}

Define model with generalised pareto distribution for abilities

The new model is mostly the same as the normal model, but the prior distribution for the latent alpha parameters is generalised pareto instead of normal.

It goes in a file called model-gpareto.stan.

functions {
#include gpareto.stan
}
data {
  int<lower=0> N; // items
  array[N] int<lower=0> K; // trials
  array[N] int<lower=0> y; // successes
  real min_alpha; // noone worse than this would be in the dataset
  real max_alpha;
}
parameters {
  real<lower=0> sigma; // scale parameter of generalised pareto distribution
  real<lower=-sigma/(max_alpha-min_alpha)> k; // shape parameter of generalised pareto distribution
  vector<lower=min_alpha,upper=max_alpha>[N] alpha; // success log-odds
}
model {
  sigma ~ normal(0, 1); // hyperprior
  alpha ~ gpareto(min_alpha, k, sigma); // prior (hierarchical)
  y ~ binomial_logit(K, alpha); // likelihood
  // note no explicit prior for k
}

Do sampling

This code goes in a python script called sample.py, and will run sampling for both models against the 2006 data, put the results in arviz objects and save them as json files.

Note that this script hard-codes some minimum and maximum true batting averages that are required by the generalised pareto model. I think 0.5% and 99% are pretty reasonable choices: from my limited understanding of baseball 99% is basically impossible, and even the worst pro could probably get on base more often than one time out of 200. A normal person, on the other hand, would just about never reach even this low threshold.

import arviz as az
import cmdstanpy
import pandas as pd
from scipy.special import logit

STAN_FILE_NORMAL = "model-normal.stan"
STAN_FILE_GPARETO = "model-gpareto.stan"
DATA_FILE = "baseball-hits-2006.csv"
SAMPLE_KWARGS = {
    "chains": 4,
    "iter_warmup": 1000,
    "iter_sampling": 1000,
    "show_progress": False
}
SAMPLE_KWARGS_GPARETO = {
    "max_treedepth": 12,
    "adapt_delta": 0.99,
}
MIN_ALPHA = logit(0.005) # you probably need a true average >0.5% to get in the dataset
MAX_ALPHA = logit(0.99)  # noone has a true average of 99%

def get_summary(idata):
    summary_ss = az.summary(idata.sample_stats, var_names=["lp", "diverging"])
    summary_vars = az.summary(idata, var_names="~alpha", filter_vars="like")
    return pd.concat([summary_ss, summary_vars])


def fit_models(model_file_dict, data_df):
    model_dict = {
        name: cmdstanpy.CmdStanModel(stan_file=stan_file)
        for name, stan_file in model_file_dict.items()
    }
    data_dict = {
        "N": data_df.shape[0],
        "y": data_df["y"].tolist(),
        "K": data_df["K"].tolist(),
        "min_alpha": MIN_ALPHA,
        "max_alpha": MAX_ALPHA,
    }
    for name, model in model_dict.items():
        sample_kwargs = (
            SAMPLE_KWARGS
            if "gpareto" not in name
            else {**SAMPLE_KWARGS, **SAMPLE_KWARGS_GPARETO}
        )
        print(f"Fitting model {name}")
        mcmc = model.sample(data=data_dict, **sample_kwargs)
        idata = az.from_cmdstanpy(mcmc)
        print(get_summary(idata))
        idata_file = f"idata-{name}.json"
        print(f"Saving idata to {idata_file}")
        idata.to_json(idata_file)


if __name__ == "__main__":
    data_df = pd.read_csv(DATA_FILE)
    fit_models(
        {"normal": STAN_FILE_NORMAL, "gpareto": STAN_FILE_GPARETO},
        data_df
    )

Analyse

From the results of running this script we can see that both models survive cmdstanpy’s built in diagnostic checks: now it’s time to analyse the results.

The next script, analyse.py, loads the results of the sampling using arviz and creates a plot of each model’s alpha parameters, transformed onto the more meaningful probability scale where they represent what each model thinks about each player’s true batting average.

import arviz as az
import pandas as pd
from matplotlib import pyplot as plt
from scipy.special import expit

DATA_FILE = "baseball-hits-2006.csv"
ALPHA_PLOT_FILE = "alpha-plot.png"


def draw_alpha_plot(idata_gpareto, idata_normal, data):
    alpha_qs_gpareto, alpha_qs_normal = (
        idata.posterior["alpha"]
        .quantile([0.05, 0.95], dim=("chain", "draw"))
        .to_series()
        .pipe(expit)
        .unstack("quantile")
        .add_prefix(name + "_")
        for idata, name in zip([idata_gpareto, idata_normal], ["gpareto", "normal"])
    )
    data = data.copy().join(alpha_qs_gpareto).join(alpha_qs_normal)
    f, ax = plt.subplots(figsize=[12, 5])
    ax.scatter(data["K"], data["y"] / data["K"], label="Obs", color="black")
    for model, color in [("gpareto", "tab:blue"), ("normal", "tab:orange")]:
        ax.vlines(
            data["K"],
            data[f"{model}_0.05"],
            data[f"{model}_0.95"],
            label=model.capitalize() + " model 5%-95% posterior interval",
            color=color,
            zorder=0,
        )
    ax.set(
        title="Observed vs modelled batting averages",
        ylabel="Hit probability",
        xlabel="Number of at-bats",
    )
    ax.legend(frameon=False)
    return f, ax


if __name__ == "__main__":
    data = pd.read_csv(DATA_FILE)
    idata_gpareto = az.from_json("idata-gpareto.json")
    idata_normal = az.from_json("idata-normal.json")
    f, ax = draw_alpha_plot(idata_gpareto, idata_normal, data)
    f.savefig(ALPHA_PLOT_FILE, bbox_inches="tight")

./alpha-plot.png

From this plot we can see that the normal model is somewhat over-regularised: it thinks all the batters have a true average of about 0.3, which is unlikely. It also thinks the players with few at-bats tend to be a bit better than their results would suggest: there are more black dots below the orange band than above in the 0 to 100 region.

The generalised pareto model, on the other hand, has very big differences in how certain it is about particular players. The key thing is that it is far more uncertain about the players with fewer at-bats: this is the thing I had never been able to achieve before with a hierarchical model and made me pretty happy with this experiment.

The generalised Pareto model perhaps has a bit of the opposite problem to the normal model, under-regularising to the point where it thinks that some players might have unrealistically high true averages in the 0.7+ range. However in my opinion it is still closer than the normal model to how you might intuitively respond to the data. If desired, more regularisation could be achieved by just adding a line like alpha ~ normal(inv_logit(0.2), some-appropriate-sd); to the Stan program.

Conclusion

As I mentioned, I was pretty happy with how the results of the new model look in the graph above. It’s also nice that the sigma and k parameters of the generalised pareto distribution were fixed fairly narrowly. However it would be nice to understand a bit better why the two models behave so differently. Here are a few things that would be nice to look into:

  • Compare against a simple non-hierarchical model (for example keep the normal model but hardcode the parameter sigma to some large value): do the results come out about the same as the generalised pareto model?
  • Do some out of sample testing.
  • Is the reason for the difference the fact that the generalised pareto distribution is non-symmetrical? If so, could we achieve the same results using a different non-symmetrical distribution?
  • Answer some extreme-value type questions like “what is the probability of a batter having a true average greater than 0.6?”

Postscript: some more models!

After I posted this analysis on the Stan discourse some people suggested that a) the generalised pareto model was under-regularised and b) that the normal model should have an effect for number of at-bats. I thought I’d better try out both of those ideas.

Normal model with effect for number of at-bats

data {
  int<lower=0> N; // items
  array[N] int<lower=0> K; // trials
  array[N] int<lower=0> y; // successes  
}
transformed data {
  vector[N] log_K = log(to_vector(K));
  vector[N] log_K_std = (log_K - mean(log_K)) / sd(log_K);
}
parameters {
  real mu; // population mean of success log-odds
  real<lower=0> sigma; // population sd of success log-odds
  real a_K;
  vector[N] alpha_std; // success log-odds (standardized)
}
model {
  a_K ~ normal(0, 0.1);
  mu ~ normal(-1, 1); // hyperprior
  sigma ~ normal(0, 1); // hyperprior
  alpha_std ~ normal(0, 1); // prior (hierarchical)
  y ~ binomial_logit(K, mu + a_K * log_K_std + sigma * alpha_std); // likelihood
}
generated quantities {
  vector[N] alpha = mu + a_K * log_K_std + sigma * alpha_std;
}

Generalised Pareto model with more regularisation

functions {
#include gpareto.stan
}
data {
  int<lower=0> N; // items
  array[N] int<lower=0> K; // trials
  array[N] int<lower=0> y; // successes
  real min_alpha; // noone worse than this would be in the dataset
  real max_alpha;
}
parameters {
  real<lower=0> sigma; // scale parameter of generalised pareto distribution
  real<lower=-sigma/(max_alpha-min_alpha)> k; // shape parameter of generalised pareto distribution
  vector<lower=min_alpha,upper=max_alpha>[N] alpha; // success log-odds
}
model {
  sigma ~ normal(0, 1); // hyperprior
  alpha ~ gpareto(min_alpha, k, sigma); // prior (hierarchical)
  alpha ~ normal(-1.38, 0.4);  // prob between 0.1 and 0.4 on prob scale
  y ~ binomial_logit(K, alpha); // likelihood
  // note no explicit prior for k
}

Fit

import pandas as pd
from sample import fit_models, DATA_FILE

STAN_FILE_GPARETO_REG = "model-gpareto-reg.stan"
STAN_FILE_NORMAL_AT_BATS = "model-normal-at-bats.stan"

if __name__ == "__main__":
    data_df = pd.read_csv(DATA_FILE)
    fit_models(
        {
            "gpareto-reg": STAN_FILE_GPARETO_REG,
            "normal-at-bats": STAN_FILE_NORMAL_AT_BATS
        },
        data_df
    )

Analyse

import arviz as az
from analyse import draw_alpha_plot

PLOT_FILE = "alpha_plot_extra.png"

if __name__ == "__main__":
    data = pd.read_csv(DATA_FILE)
    idata_gpareto = az.from_json("idata-gpareto-reg.json")
    idata_normal = az.from_json("idata-normal-at-bats.json")
    f, ax = draw_alpha_plot(idata_gpareto, idata_normal, data)
    f.savefig(PLOT_FILE, bbox_inches="tight")    

./alpha_plot_extra.png

Both of the interventions seem to have resulted in improvements - there is no longer an obvious bias in the orange bands as a function of number of at-bats, and the generalised pareto model no longer thinks that some players are capable of batting averages close to 1.

I still don’t think the orange model handles the low-data players correctly: it should be more uncertain about them I think.

About

Experimenting with extreme value distributions for sports modelling and literate programming

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published