Author: Christian Lorentzen

Update of Swiss Mortality
In 2021, I wrote a blog post about Swiss mortality and it turned out to be among the most read posts I have written so far. Four years later, I think it’s time for an update with the following improvements:
- 4 years more observations, in particular after Covid-19
- chart with weekly CDR for seasonality effects
- polars
- interactive altair/vega charts here in the wordpress post 🚀
We use the exact same data sources, all publicly available:
- Short-term Mortality Fluctuations from the Human Mortality Database
- deaths per month, BFS-Nummer px-x-0102020206_111, from the Federal Statistical Office of Switzerland
- population BFS-Nummer px-x-0102030000_101, from the Federal Statistical Office of Switzerland
As 4 years ago, I caution against any misinterpretation: I show only crude death rates (CDR) which do not take into account any demographic shift like changing distributions of age.

The first figure shows the CDR per year for several countries, Switzerland (CHE) among them. We fetch the data from the internet, pick some countries of interest, filter on combined gender only (pl.col("Sex") == pl.lit("b") with “b” for both), aggregate and plot. Thanks to this blog post , I was able to integrate the altair/vega-light charts created in Python directly into this wordpress text. The difference is that I exported the altair charts as html and directly copy&pasted it into this text as html block because the html also contains the data to be plotted (as opposed to the default json output).
```
from datetime import datetime
import polars as pl
import altair as alt

# https://altair-viz.github.io/user_guide/large_datasets.html
alt.data_transformers.enable("vegafusion")

df_original = pl.read_csv(
    "https://www.mortality.org/File/GetDocument/Public/STMF/Outputs/stmf.csv",
    skip_rows=2,
    # Help polars a bit:
    schema_overrides={
        "D65_74": pl.Float64,
        "D75_84": pl.Float64,
        "D85p": pl.Float64,
        "DTotal": pl.Float64,
    },
)

df_mortality = df_original.filter(
    # Select country of interest and only "both" sexes.
    # Note: Germany "DEUTNP" and "USA" have short time series.
    pl.col("CountryCode").is_in(["CAN", "CHE", "FRATNP", "GBRTENW", "SWE"]),
    pl.col("Sex") == pl.lit("b"),
).with_columns(
    # Change to ISO-3166-1 ALPHA-3 codes
    CountryCode=pl.col("CountryCode").replace(
        {"FRATNP": "FRA", "GBRTENW": "England & Wales"},
    ),
    # Create population pro rata temporis (exposure) to ease aggregation
    Population=pl.col("DTotal") / pl.col("RTotal"),
).with_columns(
    # We think that the data uses ISO 8601 week dates and we set the weekday
    # to 1, i.e., Monday.
    Date=(
        pl.col("Year").cast(pl.String)
        + "-W" + pl.col("Week").cast(pl.String).str.zfill(2)
        + "-1"
    ).str.to_date(format="%G-W%V-%u")
)

chart = (
    alt.Chart(
        df_mortality.filter(pl.col("Year") <= 2024)
        # The Covid-19 peaks in 2020 are better seen on weekly resolution.
        .group_by("Year", "CountryCode")
        .agg(pl.col("Population").sum(), pl.col("DTotal").sum())
        .with_columns(
            CDR=pl.col("DTotal") / pl.col("Population"),
        )
    )
    .mark_line(tooltip=True)
    .encode(
        x="Year:T",
        y=alt.Y("CDR:Q", scale=alt.Scale(zero=False)),
        color="CountryCode:N",
    )
    .properties(
        title="Crude Death Rate per Year",
        width=400,  # default 300
    )
    .interactive()
)
# chart.save("crude_death_rate.html")
chart
```
Crude death rate (CDR) for Canada (CAN), Switzerland (CHE), England & Wales, France (FRA) and Sweden (SWE). Data as of 05.07.2025.

Note that the y-axis does not start at zero. Nevertheless, we see values between 0.007 and 0.0105. The big spike that we observed in the beginning of 2021 in the old post is now flattened. In 2021 all those countries showed a CDR of over 0.01, now most are below 0.09 in 2020. This shows that the data as of February 2021 was incomplete as I mentioned. Now we have the complete picture and it looks better—fortunately!

This time, I also add a chart with weekly CDRs to demonstrate the seasonality effects.
```
chart = (
    alt.Chart(
        df_mortality.filter(
            pl.col("CountryCode") <= pl.lit("CHE"),
            # Last 12 years
            pl.col("Year") > pl.col("Year").max() - 12,
        ).with_columns(
            CDR=pl.col("DTotal") / pl.col("Population"),
        )
    )
    .mark_line(tooltip=True)
    .encode(
        x="Date:T",
        y=alt.Y("CDR:Q", scale=alt.Scale(zero=True)),
    )
    .properties(
        title="Crude Death Rate per Week for Switzerland",
        width=400,  # default 300
    )
    .interactive()
)
# chart.save("crude_death_rate_per_week.html")
chart
```
Weekly crude death rate (CDR) of Switzerland

This shows a very regular seasonal pattern with a peak in every winter.

As last time, we also collect data the deaths and population of Switzerland for more than the past 100 years. Thanks to the good Swiss government institutes that make that possible. Again, the CDR of both data sources agree within less than 1% relative error.

Have a look at the notebook linked below to see the code for this chart.

Crude death rate (CDR) for Switzerland from 1901 to 2023

Note again that the left y-axis does not start at zero, but the right y-axis does. One can see several interesting facts:
- The Swiss population is and always was growing for the last 120 years—with the only exception around 1976.
- The Spanish flu between 1918 and 1920 caused by far the largest peak in mortality in the last 120 years.
- Covid-19 caused a significant increase of mortality in 2020-2022 that seems now gone (should have added the year 2024 in the last chart, but have a look at the first one).
- The second world war is not visible in the mortality of Switzerland.
- Overall, the mortality is decreasing, but this decrease seems to have flattened in the last decade.
The Python notebook can be found at https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2025-07-06%20swiss_mortality.ipynb.
2025-07-06

Model Diagnostics: Statistics vs Machine Learning

In this post, we show how different use cases require different model diagnostics. In short, we compare (statistical) inference and prediction.

As an example, we use a simple linear model for the Munich rent index dataset, which was kindly provided by the authors of Regression – Models, Methods and Applications 2nd ed. (2021). This dataset contains monthy rents in EUR (rent) for about 3000 apartments in Munich, Germany, from 1999. The apartments have several features such as living area (area) in squared meters, year of construction (yearc), quality of location (location, 0: average, 1: good, 2: top), quality of bath rooms (bath, 0:standard, 1: premium), quality of kitchen (kitchen, 0: standard, 1: premium), indicator for central heating (cheating).

The target variable is $Y=\text{rent}$ and the goal of our model is to predict the mean rent, $E[Y]$ (we omit the conditioning on X for brevity).

Disclaimer: Before presenting the use cases, let me clearly state that I am not in the apartment rent business and everything here is merely for the purpose of demonstrating statistical good practice.

Inference

The first use case is about inference of the effect of the features. Imagine the point of view of an investor who wants to know whether the installation of a central heating is worth it (financially). To lay the ground on which to base a decision, a statistician must have answers to:

What is the effect of the variable cheating on the rent.
Is this effect statistically significant?

Prediction

The second use case is about prediction. This time, we take the point of view of someone looking out for a new apartment to rent. In order to know whether the proposed rent by the landlord is about right or improper (too high), a reference value would be very convenient. One can either ask the neighbors or ask a model to predict the rent of the apartment in question.

Model Fit

Before answering the above questions and doing some key diagnostics, we must load the data and fit a model. We choose a simple linear model and directly model rent.

Notes:

For rent indices as well as house prices, one often log-transforms the target variable before modelling or one uses a log-link and an appropriate loss function (e.g. Gamma deviance).
Our Python version uses GeneralizedLinearRegressor from the package glum. We could as well have chosen other implementations like statsmodels.regression.linear_model.OLS. This way, we have to implement the residual diagnostics ourselves which makes it clear what is plotted.

For brevity, we skip imports and data loading. Our model is then fit by:

Python

lm = glum.GeneralizedLinearRegressor(
    alpha=0,
    drop_first=True,  # this is very important if alpha=0
    formula="bs(area, degree=3, df=4) + yearc"
      	" + C(location) + C(bath) + C(kitchen) + C(cheating)"
)
lm.fit(X_train, y_train)

model = lm(
  formula = rent ~ bs(area, degree = 3, df = 4) + yearc + location + bath + kitchen + cheating,
  data = df_train
)

Diagnostics for Inference

The coefficient table will already tell us the effect of the cheating variable. For more involved models like gradient boosted trees or neural nets, one can use partial dependence and shap values to assess the effect of features.

Python

lm.coef_table(X_train, y_train)

summary(model)
confint(model)

Variable	coef	se	p_value	ci_lower	ci_upper
intercept	-3682.5	327.0	0.0	-4323	-3041
bs(area, ..)[1]	88.5	31.3	4.6e-03	27	150
bs(area,..)[2]	316.8	24.5	0.0	269	365
bs(area, ..)[3]	547.7	62.8	0.0	425	671
bs(area, ..)[4]	733.7	91.7	1.3e-15	554	913
yearc	1.9	0.2	0.0	1.6	2.3
C(location)[2]	48.2	5.9	4.4e-16	37	60
C(location)[3]	137.9	27.7	6.6e-07	84	192
C(bath)[1]	50.0	16.5	2.4e-03	18	82
C(kitchen)[1]	98.2	18.5	1.1e-07	62	134
C(cheating)[1]	107.8	10.6	0.0	87.0	128.6

We see that ceteris paribus, meaning all else equal, a central heating increases the monthly rent by about 108 EUR. Not the size of the effect of 108 EUR, but the fact that there is an effect of central heating on the rent seems statistically significant:
This is indicated by the very low probability, i.e. p-value, for the null-hypothesis of cheating having a coefficient of zero.
We also see that the confidence interval with the default confidence level of 95%: [ci_lower, ci_upper] = [87, 129].
This shows the uncertainty of the estimated effect.

For a building with 10 apartments and with an investment horizon of about 10 years, the estimated effect gives roughly a budget of 13000 EUR (range is roughly 10500 to 15500 with 95% confidence).

A good statistician should ask several further questions:

Is the dataset at hand a good representation of the population?
Are there confounders or interaction effects, in particular between cheating and other features?
Are the assumptions for the low p-value and the confidence interval of cheating valid?

Here, we will only address the last question, and even that one only partially. Which assumptions were made? The error term, $\epsilon = Y - E[Y]$ , should be homoscedastic and Normal distributed. As the error is not observable (because the true model for $E[Y]$ is unknown), one replaces $E[Y]$ by the model prediction $\hat{E}[Y]$ , this gives the residuals, $\hat{\epsilon} = Y - \hat{E}[Y] = y - \text{fitted values}$ , instead. For homoscedasticity, the residuals should look like white (random) noise. Normality, on the other hand, becomes less of a concern with larger data thanks to the central limit theorem. With about 3000 data points, we are far away from small data, but it might still be a good idea to check for normality.

The diagnostic tools to check that are residual and quantile-quatile (QQ) plots.

Python

# See notebook for a definition of residual_plot.
import seaborn as sns
fig, axes = plt.subplots(ncols=2, figsize=(4.8 * 2.1, 6.4))
ax = residual_plot(model=lm, X=X_train, y=y_train, ax=axes[0])
sns.kdeplot(
    x=lm.predict(X_train),
    y=residuals(lm, X_train, y_train, kind="studentized"),
    thresh=.02,
    fill=True,
    ax=axes[1],
).set(
    xlabel="fitted",
    ylabel="studentized residuals",
    title="Contour Plot of Residuals",
)

autoplot(model, which = c(1, 2))  # from library ggfortify
# density plot of residuals
ggplot(model, aes(x = .fitted, y = .resid)) + geom_point() +
  geom_density_2d() + geom_density_2d_filled(alpha = 0.5)

The more data points one has the less informative is a scatter plot. Therefore, we put a contour plot on the right.

Visual insights:

There seems to be a larger variability for larger fitted values. This is a hint that the homoscedasticity might be violated.
The residuals seem to be centered around 0. This is a hint that the model is well calibrated (adequate).

Python

# See notebook for a definition of qq_plot.
qq_plot(lm, X_train, y_train)

autoplot(model, which = 2)

The QQ-plot shows the quantiles of the theoretical assumed distribution of the residuals on the x-axis and the ordered values of the residuals on the y-axis. In the Python version, we decided to use the studentized residuals because normality of the error implies a student (t) distribution for these residuals.

Concluding remarks:

We might do similar plots on the test sample, but we don’t necessarily need a test sample to answer the inference questions.
It is good practice to plot the residuals vs each of the features as well.

Diagnostics for Prediction

If we are only interested in predictions of the mean rent, $\hat{E}[Y]$ , we don’t care much about the probability distribution of $Y$ . We just want to know if the predictions are close enough to the real mean of the rent $E[Y]$ . In a similar argument as for the error term and residuals, we have to accept that $E[Y]$ is not observable (it is the quantity that we want to predict). So we have to fall back to the observations of $Y$ in order to judge if our model is well calibrated, i.e., close the the ideal $E[Y]$ .

Very importantly, here we make use of the test sample in all of our diagnostics because we fear the in-sample bias.

We start simple by a look at the unconditional calibration, that is the average (negative) residual $\frac{1}{n}\sum(\hat{E}[Y_i]-Y_i)$ .

Python

compute_bias(
    y_obs=np.concatenate([y_train, y_test]),
    y_pred=lm.predict(pd.concat([X_train, X_test])),
    feature=np.array(["train"] * X_train.shape[0] + ["test"] * X_test.shape[0]),
)

print(paste("Train set mean residual:", mean(resid(model))))
print(paste("Test set mean residual: ", mean(df_test$rent - predict(model, df_test))))

set	mean bias	count	stderr	p-value
train	-3.2e-12	2465	2.8	1.0
test	2.1	617	5.8	0.72

It is no surprise that bias_mean in the train set is almost zero.
This is the balance property of (generalized) linear models (with intercept term). On the test set, however, we detect a small bias of about 2 EUR per apartment on average.

Next, we have a look a reliability diagrams which contain much more information about calibration and bias of a model than the unconditional calibration above. In fact, it assesses auto-calibration, i.e. how well the model uses its own information.
An ideal model would lie on the dotted diagonal line.

Python

fig, axes = plt.subplots(ncols=2, figsize=(4.8 * 2.1, 6.4))
plot_reliability_diagram(y_obs=y_train, y_pred=lm.predict(X_train), n_bootstrap=100, ax=axes[0])
axes[0].set_title(axes[0].get_title() + f" train set (n={X_train.shape[0]})")
plot_reliability_diagram(y_obs=y_test, y_pred=lm.predict(X_test), n_bootstrap=100, ax=axes[1])
axes[1].set_title(axes[1].get_title() + f" test set (n={X_test.shape[0]})")

iso_train = isoreg(x = model$fitted.values, y = df_train$rent)
iso_test = isoreg(x = predict(model, df_test), y = df_test$rent)
bind_rows(
  tibble(set = "train", x = iso_train$x[iso_train$ord], y = iso_train$yf),
  tibble(set = "test", x = iso_test$x[iso_test$ord], y = iso_test$yf),
) |>
  ggplot(aes(x=x, y=y, color=set)) + geom_line() +
  geom_abline(intercept = 0, slope = 1, linetype="dashed") +
  ggtitle("Reliability Diagram")

Visual insights:

Graphs on train and test set look very similar.
The larger uncertainty intervals on the test set stem from the fact that is has a smaller sample size.
The model seems to lie around the diagonal indicating good auto-calibration for the largest part of the range.
Very high predicted values seem to be systematically too low, i.e. the graph is above the diagonal.

Finally, we assess conditional calibration, i.e. the calibration with respect to the features. Therefore, we plot one of our favorite graphs for each feature. It consists of:

average observed value of $Y$ for each (binned) value of the feature
average predicted value
partial dependence
histogram of the feature (grey, right y-axis)

Python

fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 5*4), sharey=True)
for i, col in enumerate(["area", "yearc", "bath", "kitchen", "cheating"]):
    plot_marginal(
        y_obs=y_train,
        y_pred=lm.predict(X_train),
        X=X_train,
        feature_name=col,
        predict_function=lm.predict,
        ax=axes[i][0],
    )
    plot_marginal(
        y_obs=y_test,
        y_pred=lm.predict(X_test),
        X=X_test,
        feature_name=col,
        predict_function=lm.predict,
        ax=axes[i][1],
    )
    axes[i][0].set_title("Train")
    axes[i][1].set_title("Test")
    if i != 0:
        axes[i][0].get_legend().remove()
    axes[i][1].get_legend().remove()
fig.tight_layout()

xvars = c("area", "yearc", "bath", "kitchen", "cheating")
m_train = feature_effects(model, v = xvars, data = df_train, y = df_train$rent)
m_test = feature_effects(model, v = xvars, data = df_test, y = df_test$rent)

c(m_train, m_test) |> 
  plot(
    share_y = "rows",
    ncol = 2,
    byrow = FALSE,
    stats = c("y_mean", "pred_mean", "pd"),
    subplot_titles = FALSE,
    # plotly = TRUE,
    title = "Left: Train - Right: Test",
  )

Visual insights:

On the train set, the categorical features seem to have perfect calibration as average observed equals average predicted. This is again a result of the balance property. On the test set, we see a deviation, especially for the categorical level with smaller sample size. This is a good demonstration why plotting on both train and test set is a good idea.
The numerical features area and year of construction seem fine, but a closer look can’t hurt.

We next perform a bias plot, which is plotting the average difference of predicted minus observed per feature value. The values should be around zero, so we can zoom in on the y-axis.
This is very similar to the residual plot, but the information is better condensed for its purpose.

Python

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 2*4), sharey=True)
axes[0,0].set_ylim(-150, 150)
for i, col in enumerate(["area", "yearc"]):
    plot_bias(
        y_obs=y_train,
        y_pred=lm.predict(X_train),
        feature=X_train[col],
        ax=axes[i][0],
    )
    plot_bias(
        y_obs=y_test,
        y_pred=lm.predict(X_test),
        feature=X_test[col],
        ax=axes[i][1],
    )
    axes[i][0].set_title("Train")
    axes[i][1].set_title("Test")
fig.tight_layout()

c(m_train[c("area", "yearc")], m_test[c("area", "yearc")]) |> 
  plot(
    ylim = c(-150, 150),
    ncol = 2,
    byrow = FALSE,
    stats = "resid_mean",
    subplot_titles = FALSE,
    title = "Left: Train - Right: Test",
    # plotly = TRUE,
    interval = "ci"
  )

Visual insights:

For large values of area and yearc in the 1940s and 1950s, there are only few observations available. Still, the model might be improved for those regions.
The bias of yearc shows a parabolic curve. The simple linear effect in our model seems too simplistic. A refined model could use splines instead, as for area.

Concluding remarks:

The predictions for area larger than around 120 square meters and for year of construction around the 2nd world war are less reliable.
For all the rest, the bias is smaller than 50 EUR on average.
This is therefore a rough estimation of the prediction uncertainty.
It should be enough to prevent improperly high (or low) rents (on average).

The full Python and R code is available under:

2025-05-01

Dictionary for Data Scientists and Statisticians

During my journey through machine learning (ML) and statistics, I was faced some many times with surprisingly different usage of terms. To improve the understanding of data scientists and statisticians, I present a dictionary and hope the humour does not get unnoticed.

data scientist	statistician	comment
sample	observation
(training) set	sample
feature	covariate, predictor	many more terms
label	categorical response
inference	prediction, forecast
statistics	inference
training	fitting
training error	in-sample error
test/validation set	hold-out sample
regression	regression
classification	regression (on categorical response) + decision making	thus the name logistic / multinomial regression!
supervised machine learning	regression
AI	AI for funding, else regression	see EU AI Act article 3
confidence score	predicted probability	confidence scores might not represent probabilities
(binary/multiclass) cross-entropy	(binomial/multinomial) log likelihood	a.k.a. log loss
unbalanced data problem	🤷‍♂️what problem?	if any, a small data problem
SMOTE	devil’s work

Statistics is about the honest interpretation of data, which is much less appealing than less honest interpretation.

by Prof. Simon Wood, a.k.a. Mr GAM/mgcv

2025-04-07

A Tweedie Trilogy — Part III: From Wrights Generalized Bessel Function to Tweedie’s Compound Poisson Distribution
TLDR: The scipy 1.7.0 release introduced Wright’s generalized Bessel function in the Python ecosystem. It is an important ingredient for the density and log-likelihood of Tweedie probabilty distributions. In this last part of the trilogy I’d like to point out why it was important to have this function and share the endeavor of implementing this inconspicuous but highly intractable special function. The fun part is exploiting a free parameter in an integral representation, which can be optimized by curve fitting to the minimal arc length.

This trilogy celebrates the 40th birthday of Tweedie distributions in 2024 and highlights some of their very special properties.

See part i and part ii.
Table of Contents
Tweedie Distributions
Compound Poisson Gamma
Wright’s Generalized Bessel Function
The Integral Representation
Arc Length Minimization
Conclusion
Tweedie Distributions

As pointed out in part I and part II, the family of Tweedie distributions is a very special one with outstanding properties. They are central for estimating expectations with GLMs. The probability distributions have mainly positive (non-negative) support and are skewed, e.g. Poisson, Gamma, Inverse Gaussian and compound Poisson-Gamma.

As members of the exponential dispersion family, a slight extension of the exponential family, the probability density can be written as
```
\begin{align*}
f(y; \theta, \phi) &= c(y, \phi) \exp\left(\frac{y\theta - \kappa(\theta)}{\phi}\right)
\\
\kappa(\theta) &= \kappa_p(\theta) = \frac{1}{2-p}((1-p)\theta)^{\frac{2-p}{1-p}}
\end{align*}
```
It is often more instructive to parametrise the distribution with $p$ , $\mu$ and $\phi$ , using
```
\begin{align*}
\theta &= \begin{cases}
\frac{\mu^{1-p}}{1-p}\,,\quad p\neq 1\\
\log(\mu)\,,\quad p=1
\end{cases}
\\
\kappa(\theta) &= \begin{cases}
\frac{\mu^{2-p}}{2-p}\,,\quad p\neq 2\\
\log(\mu)\,,\quad p=2
\end{cases}
\end{align*}
```
and write
```
\begin{align*}
Y &\sim \mathrm{Tw}_p(\mu, \phi)
\end{align*}
```
Probability density of several Tweedie distributions.

Compound Poisson Gamma

A very special domain for the power parameter is between Poisson and Gamma: $1<p<2$ . This range results in the Compound Poisson distribution which is suitable if you have a random count process and if each count itself has a random amount. A well know example is insurance claims. Typically, there is a random number of insurance claims, and each and every claim has a random amount of claim costs.
```
\begin{align*}
N &\sim \mathrm{Poisson}(\lambda)\\
X_i &\sim \mathrm{Gamma}(a, b)\\
Y &= \sum_{i=0}^N X_i \sim \mathrm{CompPois}(\lambda, a, b)
\end{align*}
```
For Poisson count we have $\operatorname{E}[N]=\lambda$ and $\operatorname{Var}[N]=\lambda=\operatorname{E}[N]$ , for the Gamma amount $\operatorname{E}[X]=\frac{a}{b}$ and $\operatorname{Var}[X]=\frac{a}{b^2}=\frac{1}{a}\operatorname{E}[X]^2$ . For the compound Poisson-Gamma variable, we obtain
```
\begin{align*}
\operatorname{E}[Y] &= \operatorname{E}[N] \operatorname{E}[X] = \lambda\frac{a}{b}=\mu\\
\operatorname{Var}[Y] &=  \operatorname{Var}[N] \operatorname{E}[X]^2 +  \operatorname{E}[N] \operatorname{Var}[X] =  \phi \mu^p\\
p &= \frac{a + 2}{a+1} \in (1, 2)\\
\phi &= \frac{(\lambda a)^{1-p}}{(p-1)b^{2-p}}
\end{align*}
```
What’s so special here is that there is a point mass at zero, i.e., $P(Y=0)=\exp(-\frac{\mu^{2-p}}{\phi(2-p)}) > 0$ . Hence, it is a suitable distribution for non-negative quantities with some exact zeros.

Probability density for compound Poisson Gamma, point masses at zero are marked as points.
Code

import matplotlib.pyplot as plt import numpy as np from scipy.special import wright_bessel def cpg_pmf(mu, phi, p): """Compound Poisson Gamma point mass at zero.""" return np.exp(-np.power(mu, 2 - p) / (phi * (2 - p))) def cpg_pdf(x, mu, phi, p): """Compound Poisson Gamma pdf.""" if not (1 < p < 2): raise ValueError("1 < p < 2 required") theta = np.power(mu, 1 - p) / (1 - p) kappa = np.power(mu, 2 - p) / (2 - p) alpha = (2 - p) / (1 - p) t = ((p - 1) * phi / x)**alpha t /= (2 - p) * phi a = 1 / x * wright_bessel(-alpha, 0, t) return a * np.exp((x * theta - kappa) / phi) fig, axes = plt.subplots(ncols=2, figsize=[6.4 * 1.25, 4.8]) x = np.linspace(1e-9, 10, 200) mu = 2 for p in [1.2, 1.5, 1.8]: for i, phi in enumerate([0.5, 2]): axes[i].plot(x, cpg_pdf(x=x, mu=mu, phi=phi, p=p), label=f"{p=}") axes[i].scatter(0, cpg_pmf(mu=mu, phi=phi, p=p)) axes[i].set_ylim(0, 0.5) axes[i].set_title(f"{phi=}") if i > 0: axes[i].legend() else: axes[i].set_ylabel("pdf(x)") axes[i].set_xlabel("x") fig.suptitle("Tweedie Distributions mu=2")
The rest of this post is about how to compute the density for this parameter range. The easy part is $\exp\left(\frac{y\theta - \kappa(\theta)}{\phi}\right)$ which can be directly implemented. The real obstacle is the term $c(y, \phi)$ which is given by
```
\begin{align*}
c(y, \phi) &= \frac{\Phi(-\alpha, 0, t)}{y}
\\
\alpha &= \frac{2 - p}{1 - p}
\\
t &= \frac{\left(\frac{(p - 1)\phi}{y}\right)^{\alpha}}{(2-p)\phi}
\end{align*}
```
This depends on Wright’s (generalized Bessel) function $\Phi(a, b, z)$ as introduced in a 1933 paper by E. Wright.

Wright’s Generalized Bessel Function

According to DLMF 10.46, the function is defined as
```
\begin{equation*}
\Phi(a, b, z) = \sum_{k=0}^{\infty} \frac{z^k}{k!\Gamma(ak+b)}, \quad a > -1, b \in R, z \in C
\end{equation*}
```
which converges everywhere because it is an entire function. We will focus on the positive real axis $z=x\geq 0$ and the range $a\geq 0, b\geq 0$ (note that $a=-\alpha \in (0,\infty)$ for $1<p<2$ ). For the compound Poisson-Gamma, we even have $b=0$ .

Implementation of such a function as done in scipy.stats.wright_bessel, even for the restricted parameter range, poses tremendous challenges. The first one is that it has three parameters which is quite a lot. Then the series representation above, for instance, can always be used, but depending on the parameters, it will require a huge amount of terms, particularly for large $x$ . As each term involves the Gamma function, this becomes expensive very fast. One ends up using different representations and strategies for different parameter regions:
- Small $x$ : Taylor series according to definition
- Small $a$ : Taylor series in $a=0$
- Large $x$ : Asymptotic series due to Wright (1935)
- Large $a$ : Taylor series according to definition for a few terms around the approximate maximum term $k_{max}$ due to Dunn & Smyth (2005)
- General: Integral represantation due to Luchko (2008)
Dunn & Smyth investigated several evaluation strategies for the simpler Tweedie density which amounts to Wright’s functions with $b=0$ , see Dunn & Smyth (2005). Luchko (2008) lists most of the above strategies for the full Wright’s function.

Note that Dunn & Smyth (2008) provide another strategy to evaluate the Tweedie distribution function by means of the inverse Fourier transform. This does not involve Wright’s function, but also encounters complicated numerical integration of oscillatory functions.

The Integral Representation

This brings us deep into complex analysis: We start with Hankel’s contour integral representation of the reciprocal Gamma function.
```
\begin{equation*}
\frac{1}{\Gamma(z)} = \frac{1}{2\pi i} \int_{Ha^-} \zeta^{-z} e^\zeta \; d\zeta
\end{equation*}
```
with the Hankel path $Ha^-$ from negative infinity (A) just below the real axis, counter-clockwise with radius $\epsilon>0$ around the origin and just above the real axis back to minus infinity (D).

Hankel contour Ha^– in the complex plane.

In principle, one is free to choose any such path with the same start (A) and end point (D) as long as one does not cross the negative real axis. One usually lets the AB and CD be infinitesimal close to the negative real line. Very importantly, the radius $\epsilon>0$ is a free parameter! That is real magic🪄

By interchanging sum and integral and using the series of the exponential, Wright’s function becomes
```
\begin{align*}
\Phi(a, b, z) &= \sum_{k=0}^{\infty} \frac{z^k}{k!} \frac{1}{2\pi i} \int_{Ha^-} \zeta^{-(ak+b)} e^\zeta \; d\zeta
\\
&= \frac{1}{2\pi i} \int_{Ha^-} \zeta^{-b} e^{\zeta + z\zeta^{-a}} \; d\zeta
\end{align*}
```
Now, one needs to do the tedious work and split the integral into the 3 path sections AB, BC, CD. Putting AB and CD together gives an integral over K, the circle BC gives an integral over P:
```
\begin{align*}
\Phi(a, b, x) &= \frac{1}{\pi} \int_{\epsilon}^\infty K(a, b, x, r) \; dr
\\
 &+ \frac{\epsilon^{1-b}}{\pi} \int_0^\pi P(\epsilon, a, b, x, \varphi) \; d\varphi
\\
K(a, b, x, r) &= r^{-b}\exp(-r + x  r^{-a} \cos(\pi a)) 
\\
&\quad \sin(x \cdot r^{-a} \sin(\pi a) + \pi b)
\\
P(\epsilon, a, b, x, \varphi) &= \exp(\epsilon \cos(\varphi) + x  \epsilon^{-a}\cos(a \varphi))
\\
&\quad \cos(\epsilon \sin(\varphi) - x \cdot \epsilon^{-a} \sin(a \varphi) + (1-b) \varphi)
\end{align*}
```
What remains is to carry out the numerical integration, also known as quadrature. While this is an interesting topic in its own, let’s move to the magic part.

Arc Length Minimization

If you have come so far and say, wow, puh, uff, crazy, 🤯😱 Just keep on a little bit because here comes the real fun part🤞

It turns out that most of the time, the integral over P is the most difficult. The worst behaviour an integrand can have is widely oscillatory. Here is one of my favorite examples:

Integrands for a=5, b=1, x=100 and two choices of epsilon.

With the naive choice of $\epsilon=1$ , both integrands (blue) are—well—crazy. There is basically no chance the most sophisticated quadrature rule will work. And then look at the other choice of $\epsilon\approx 4$ . Both curves seem well behaved (for P, we would need a closer look).

So the idea is to find a good choice of $\epsilon$ to make P well behaved. Well behaved here means most boring, if possible a straight line. What makes a straight line unique? In flat space, it is the shortest path between two points. Therefore, well behaved integrands have minimal arc length. That is what we want to minimize.

The arc length $S$ from $x=a$ to $x=b$ of a 1-dimensional function $f$ is given by
```
\begin{equation*}
S = \int_a^b \sqrt{1 + f^\prime(x)^2} \; dx
\end{equation*}
```
Instead of $f=P$ , we only take the oscillatory part of P and approximate the arc length as $f(\varphi)=f(\varphi) = \epsilon \sin(\varphi) - x \epsilon^{-\rho} \sin(\rho \varphi) + (1-\beta) \varphi$ . For a single parameter point $a, b, z$ this looks like

Arc length and integrand P for different epsilon, given a=0.1, b=5, x=100.

Note the logarithmic y-scale for the right plot of P. The optimal $\epsilon=10$ is plotted in red and behaves clearly better than smaller values of $\epsilon$ .

What remains to be done for an actual implementation is
- Calculate minimal $\epsilon$ for a large grid of values $a, b, x$ .
- Choose a function with some parameters.
- Curve fitting (so again optimisation): Fit this function to the minimal $\epsilon$ of the grid via minimising least squares.
- Implement some quadrature rules and use this choice of $\epsilon$ in the hope that it intra- and extrapolates well.
This strategy turns out to work well in practice and is implemented in scipy. As the parameter space of 3 variables is huge, the integral representation breaks down in certain areas, e.g. huge values of $\epsilon$ where the integrands just overflow numerically (in 64-bit floating point precision). But we have our other evaluation strategies for that.

Conclusion

An extensive notebook for Wright’s function, with all implementation strategies can be found here.

After an adventurous journey, we arrived at one implementation strategy of Wright’s generalised Bessel function, namely the integral representation. The path went deep into complex analysis and contour integration, then further to the arc length of a function and finally curve fitting via optimisation. I am really astonished how connected all those different areas of mathematics can be.

Wright’s function is the missing piece to compute full likelihoods and probability functions of the Tweedie distribution family and is now available in the Python ecosystem via scipy.

We are at the very end of this Tweedie trilogy. I hope it has been entertaining and it has become clear why Tweedie deserves to be celebrated.

Further references:
- Delong, Ł., Lindholm, M. & Wüthrich, M.V. “Making Tweedie’s compound Poisson model more accessible”. Eur. Actuar. J. 11, 185–226 (2021). https://doi.org/10.1007/s13385-021-00264-3
- Wright E.M. 1933. “On the coefficients of power series having essential singularities”. J. London Math. Soc. 8: 71–79. https://doi.org/10.1112/jlms/s1-8.1.71
- Wright, E. M. 1935, “The asymptotic expansion of the generalized Bessel”, function. Proc. London Math. Soc. (2) 38, pp. 257–270. https://doi.org/10.1112/plms/s2-38.1.257
- Dunn, P.K., Smyth, G.K. “Series evaluation of Tweedie exponential dispersion model densities”. Stat Comput 15, 267–280 (2005). https://doi.org/10.1007/s11222-005-4070-y
- Dunn, P.K., Smyth, G.K. “Evaluation of Tweedie exponential dispersion model densities by Fourier inversion”. Stat Comput 18, 73–86 (2008). https://doi.org/10.1007/s11222-007-9039-6
- Luchko, Y. F. (2008), “Algorithms for Evaluation of the Wright Function for the Real Arguments’ Values”, Fractional Calculus and Applied Analysis 11(1). https://eudml.org/doc/11309
  Note a slight misprint in the integrand P.
2024-06-17
A Tweedie Trilogy — Part II: Offsets
TLDR: This second part of the trilogy will have a deeper look at offsets and sample weights of a GLM. Their non-equivalence stems from the mean-variance relationship. This time, we not only have a Poisson frequency but also a Gamma severity model.

This trilogy celebrates the 40th birthday of Tweedie distributions in 2024 and highlights some of their very special properties.

See part I.
Table of Contents
From Mean-Variance Relation to Score Equations
Offsets and Sample Weights
Poisson GLM
Tweedie GLM
Example
Outlook
From Mean-Variance Relation to Score Equations

In the part I, we have already introduced the mean-variance relation of a Tweedie random variable $Y\sim Tw_p(\mu, \phi)$ with Tweedie power $p$ , mean $\mu$ and dispersion parameter $\phi$ :
```
\begin{align*}
\operatorname{E}[Y] &= \mu
\\
\operatorname{Var}[Y] &= \phi \mu^p = \phi v(\mu)
\end{align*}
```
with variance function $v(\mu)$ .

This variance function directly impacts the estimation of GLMs. Assume the task is to estimate the expectation of a random variable $Y_i\sim Tw_p(\mu_i, \phi/w_i)$ , given observations of the target $y_i$ and of explanatories variables, aka features, $x_i\in R^k$ . A GLM then assumes a link function $g(\mu_i) = \sum_{j=1}^k x_{ij}\beta_j$ with coefficients $\beta$ to be estimated via an optimization procedure, of which the first order condition, also called score equation, reads
```
\begin{equation*}
\sum_i w_i \frac{y_i - \mu_i}{v(\mu_i)g'(\mu_i)} x_{ij}= 0 \quad \forall j =1, \ldots, k
\end{equation*}
```
This shows that the higher the Tweedie power $p$ , entering via $v(\mu)$ only, the less weight is given to deviations of large values. In other words, higher Tweedie powers result in GLMs that are less and less sensitive to what happens at large (expected) values.

This is also reflected in the deviance loss function. They can be derived from the negative log-likelihood and are given by
```
\begin{equation*}
d_p(y, \mu) =
	2 \cdot
	\begin{cases}
		\frac{\max(0, y^{2-p})}{(1-p)(2-p)}-\frac{y\mu^{1-p}}{1-p}+\frac{\mu^{2-p}}{2-p} & p \in \mathrm{R}\setminus (0,1] \cup \{2\} \\
		y\log\frac{y}{\mu} - y + \mu & p=0 \\
		\frac{y}{\mu} - \log\frac{y}{\mu} - 1 & p=2
	\end{cases}
\end{equation*}
```
These are the only strictly consistent scoring functions for the expectation (up to one multiplicative and one additive constant) that are homogeneous functions (of degree $2-p$ ), see, e.g., Fissler et al (2022). The Poisson deviance ( $p=1$ ), for example, has a degree of homogeneity of 1 and the same unit as the target variable. The Gamma deviance ( $p=2$ ), on the other side, is zero-homogeneous and is completely agnostic to the scale of its arguments. This is another way of stating the above: the higher the Tweedie power the less it cares about large values.

It is also connected to the fact that Tweedie distributions are the only distributions from the exponential dispersion family that are closed under scale transformations:
```
\begin{align*}
Y &\sim Tw_p(\mu, \phi) \\
cY &\sim Tw_p(c\mu, c^{2-p}\phi) \quad \forall c>0
\end{align*}
```
Offsets and Sample Weights

Poisson GLM

When estimating counts with a Poisson GLM, there is often an exposure measure like time under consideration or underlying number of things (insurance policies, trees in a forest, radioactive atoms). One then often finds two different, but equivalent formulations of a Poisson GLM with log-link.
- Sample weights: Model frequency $y=\frac{N}{w}$ and fit with sample weights $w$ to estimate $\operatorname{E}[y] = \mu_y = \exp(x \beta)$ .
- Offsets: Model counts $N$ , but account for the exposure $w$ via an offset as $\operatorname{E}[N]=\mu_N = \exp(x \beta + \log(w)) = w \mu_y$ .
Note that each way models a different target, so we had to use subscripts to distinguish the mean parameters $\mu$ .

In this special case of a Poisson GLM with (canonical) log-link, both models are equivalent and will result in the exact same parameters $\beta$ . You can plug it into the score equation to convince yourself.

Tweedie GLM

Very importantly, this simple equivalence of GLM fomulations with offsets and with sample weights does only hold for the Poisson GLM with log-link. It does not hold for any other Tweedie parameter or even other distributions from the exponential dispersion family.

One can show that a Tweedie GLM with log-link and offset (additive in link space) $\log(u)$ on target $y$ with weights $w$ is equivalent to the same Tweedie GLM but with target $\frac{y}{u}$ and weights $w u^{2-p}$ .

So one can construct an equivalence between unweighted with offsets and weighted without offsets by setting $u = \sqrt[2-p]{w}$ . But note that this does not work for a Gamma GLM which as $p=2$ .

Example

We continue with the same dataset and model as in part I and show this (non-) equivalence with the offsets.
```
from glum import GeneralizedLinearRegressor
import pandas as pd

# ... quite some code ... here we abbreviate.
# Model frequency with weights (but without offsets)
y_freq = df["ClaimNb"] / df["Exposure"]
w_freq = df["Exposure"]
X = df[x_vars]
glm_params = {
    "alpha": 0,
    "drop_first": True,
    "gradient_tol": 1e-8,
}
glm_freq = GeneralizedLinearRegressor(
    family="poisson", **glm_params
).fit(X, y_freq, sample_weight=w_freq)

# Model counts N = w * freq with offsets (but without weights)
N = w_freq * y_freq
glm_offset_freq = GeneralizedLinearRegressor(
    family="poisson", **glm_params
).fit(X, N, offset=np.log(w_freq))

print(
    f"intercept freq{'':<8}= {glm_freq.intercept_}\n"
    f"intercept freq offset = {glm_offset_freq.intercept_}"
)
# intercept freq        = -3.756437676421677
# intercept freq offset = -3.7564376764216725

np.max(np.abs(glm_freq.coef_ - glm_offset_freq.coef_)) < 1e-13
# True
```
As next, we model the severity $Y = \frac{loss}{N}$ with claim counts $N$ as weights. As is standard, we use a Gamma GLM with log-link (which is not canonical this time).
```
# Model severity with weights (but without offsets)
y_sev = (df["ClaimAmount"] / df["ClaimNb"])
w_sev = df["ClaimNb"].fillna(0)
X = df[x_vars]
# Filter out zero count (w_sev==0) rows
w_gt_0 = w_sev > 0
y_sev = y_sev[w_gt_0]
X_sev = X[w_gt_0]
w_sev = w_sev[w_gt_0]

glm_sev = GeneralizedLinearRegressor(
    family="gamma", **glm_params
).fit(X_sev, y_sev, sample_weight=w_sev)

# Note that the target is claim amount = w * sev.
claim_amount = w_sev * y_sev
glm_offset_sev = GeneralizedLinearRegressor(
    family="gamma", **glm_params
).fit(X_sev, claim_amount, offset=np.log(w_sev))

print(
    f"intercept sev{'':<8}= {glm_sev.intercept_}\n"
    f"intercept sev offset = {glm_offset_sev.intercept_}"
)
# intercept sev        = 7.287909799461992
# intercept sev offset = 7.236827150674156

np.max(np.abs(glm_sev.coef_ - glm_offset_sev.coef_))
# 0.2119162919285421
```
The deviations might seem small, but they are there and add up:
```
print(
    "Total predicted claim amounts with weights "
    f"{np.sum(w_sev * glm_sev.predict(X_sev)):_.2f}"
)
print(
    "Total predicted claim amounts offset       "
    f"{np.sum(glm_offset_sev.predict(X_sev, offset=np.log(w_sev))):_.2f}"
)
# Total predicted claim amounts with weights 49_309_687.30
# Total predicted claim amounts offset       48_769_342.47
```
Here, it becomes evident that the two models are quite different.

Outlook

The full notebook can be found here.

The final part III of the Tweedie trilogy will follow in one week and go into details of the probability density function.
2024-06-10

A Tweedie Trilogy — Part I: Frequency and Aggregration Invariance

TLDR: In this first part of the Tweedie Trilogy, we will take a look at what happens to a GLM if we aggregate the data by a group-by operation. A frequency model for insurance pricing will serve as an example.

This trilogy celebrates the 40th birthday of Tweedie distributions in 2024 and highlights some of their very special properties.

Table of Contents

Intro

Tweedie distributions and Generalised Linear Models (GLM) have an intertwined relationship. While GLMs are, in my view, one of the best reference models for estimating expectations, Tweedie distributions lie at the heart of expectation estimation. In fact, basically all applied GLMs in practice use Tweedie distributions with three notable exceptions: the binomial, the multinomial and the negative binomial distribution.

Mean-Variance Relation

“An index which distinguishes between some important exponential families” is the original publication title of Maurice Charles Kenneth Tweedie in 1984—but note that Shaul K. Bar-Lev and Peter Enis published around the same time; as their 1986 paper was received November 1983, the distribution could also be named Bar-Lev & Enis distribution.¹ This index is meanwhile called the Tweedie power parameter $p$ . Recall that distributions of the exponential dispersion family always fulfil a mean-variance relationship. Its even a way to define them. For the Tweedie distribution, denoted $Tw_p(\mu, \phi)$ , the relation reads

\begin{align*} \operatorname{E}[Y] &= \mu \\ \operatorname{Var}[Y] &= \phi \mu^p \end{align*}

with dispersion parameter $\phi$ . Some very common members are given in the following table.

$p$	distribution	domain $Y$	domain $\mu$
0	Normal / Gaussian	$\mathrm{R}$	$\mathrm{R}$
1	Poisson	$0, 1, 2, \ldots$	$\mathrm{R}_+$
$(1,2)$	Compound Poisson-Gamma	$\mathrm{R}_+ \cup \{0\}$	$\mathrm{R}_+$
2	Gamma	$\mathrm{R}_+$	$\mathrm{R}_+$
3	inverse Gaussian	$\mathrm{R}_+$	$\mathrm{R}_+$

Insurance Pricing Models

In non-life insurance pricing, most claims happen somewhat randomly, typically the occurrence as well as the size. Take the theft of your bike or a water damage of your basement due to flooding as an example. Pricing actuaries usually want to predict the expected loss $E[Y|X]$ given some features $X$ of a policy. The set of features could contain the purchasing price of your bike or the proximity of your house to a river.

Instead of directly modelling the expected loss per exposure $w$ , e.g. the time duration of the insurance contract, the most used approach is the famous frequency-severity split:

\begin{align*} \operatorname{E}\left[\frac{Y}{w}\right] = \underbrace{\operatorname{E}\left[\frac{N}{w}\right]}_{frequency} \cdot \underbrace{\operatorname{E}\left[\left. \frac{Y}{n}\right| N=n\right]}_{severity} \end{align*}

For simplicity, the conditioning on $X$ is suppressed, it would occur in every expectation. The first part $\operatorname{E}\left[\frac{N}{w}\right]$ is the (expected) frequency, i.e. the number of claims per exposure (time). The second term $\operatorname{E}\left[\left.\frac{Y}{N}\right| N\right]$ is the (expected) severity, i.e. the average claim size (per claim) given a fixed number of claims. Here, we focus on the frequency part.

Convolution and Aggregation Invariance

This property might first seem very theoretical, but it may be one of the most important properties for the estimation of expectations $E[Y|X]$ with GLMs. It is in fact a property valid for the whole exponential dispersion family: The weighted mean of i.i.d. random variables has (almost) the same distribution!

\begin{align*} Y_i &\overset{i.i.d}{\sim} \mathrm{Tw}_p(\mu, \phi/w_i) \,, \\ w_+ &= \sum_i w_i \quad\text{with } w_i >0 \,, \end{align*}

then

\begin{align*} Y &=\sum_i^n \frac{w_i Y_i}{w_+} \sim \mathrm{Tw}_p(\mu, \phi/w_+) \,. \end{align*}

It is obvious that the mean of $Y$ is again $\mu$ . But is is remarkable that it has the same distribution with the same power parameter, only the 2nd argument with the dispersion parameter differs. But the dispersion parameter cancels out in GLM estimations. In fact, we will show that two GLMs, one on aggregated data, give identical results. Another way of saying the same in statistical terms is that (weighted) averages are the sufficient statistic for the expectation within the exponential dispersion family.

This is quite an essential property for data aggregation. It means that one can aggregate rows with identical features and still do an analysis (of the conditional expectation) without loss of information.

The weighted average above can be written a bit more intuitive. For instance, a frequency $Y_i=\frac{N_i}{w_i}$ has weighted average $Y=\frac{\sum_i N_i}{\sum_i w_i}$ .

Poisson Distribution

When modelling counts, the Poisson distribution is by far the easiest distribution one can think of. It only has a single parameter, is a member of the Tweedie family, and fulfils the mean-variance relation

\begin{equation*} \operatorname{E}[N] = \mu = \operatorname{Var}[N] \,.\end{equation*}

In particular, $p=1$ . While the distribution is strictly speaking only for counts, i.e. $N$ takes on non-negative integer values, Poisson regression also works for any non-negative response variable like $N/w \in \mathrm{R}$ .

Frequency Example

For demonstration, we fit a Poisson GLM on the french motor third-party liability claims dataset, cf. the corresponding scikit-learn example and the case study 1 of the Swiss Association of Actuaries on the same dataset.

from glum import GeneralizedLinearRegressor
import pandas as pd

# ... quite some code ... here we abbreviate.
y_freq = df["ClaimNb"] / df["Exposure"]
w_freq = df["Exposure"]
X = df[x_vars]
glm_params = {
    "alpha": 0,
    "drop_first": True,
    "gradient_tol": 1e-8,
}
glm_freq = GeneralizedLinearRegressor(
    family="poisson", **glm_params
).fit(X, y_freq, sample_weight=w_freq)
print(
  f"Total predicted number of claims = "
  f"{(w_freq * glm_freq.predict(X)).sum():_.2f}"
)
# Total predicted number of claims = 26_444.00

# Now aggregated
df_agg = df.groupby(x_vars, observed=True).sum().reset_index()
print(
    f"Aggregation reduced number of rows from {df.shape[0]:_}"
    f"to {df_agg.shape[0]:_}."
)
# Aggregation reduced number of rows from 678_013 to 133_413.
y_agg_freq = df_agg["ClaimNb"] / df_agg["Exposure"]
w_agg_freq = df_agg["Exposure"]
X_agg = df_agg[x_vars]
glm_agg_freq = GeneralizedLinearRegressor(
    family="poisson", **glm_params
).fit(X_agg, y_agg_freq, sample_weight=w_agg_freq)
print(
    f"Total predicted number of claims = "
    f"{(w_agg_freq * glm_agg_freq.predict(X_agg)).sum():_.2f}"
)
# Total predicted number of claims = 26_444.00

In fact, both models have the same intercept term and same coefficients, they are really identical models (up to numerical precision):

print(
    f"intercept freq{'':<18}= {glm_freq.intercept_}\n"
    f"intercept freq aggregated model = {glm_agg_freq.intercept_}"
)
# intercept freq                  = -3.7564376764216747
# intercept freq aggregated model = -3.7564376764216747

np.max(np.abs(glm_freq.coef_ - glm_agg_freq.coef_)) < 1e-13
# True

Outlook

The full notebook can be found here.

In the next week, part II of this trilogy will follow. There, we will meet some more of its quite remarkable properties.

Further references:

Tweedie M.C.K. 1984. “An index which distinguishes between some important exponential families”. Statistics: Applications and New Directions. Proceedings of the Indian Statistical Institute Golden Jubilee International Conference, Indian Statistical Institute, Cal- cutta, pp. 579–604.
Bar-Lev, S.K., Enis, P. Reproducibility in the one-parameter exponential family. Metrika 32, 391–394 (1985). https://doi.org/10.1007/BF01897827
Shaul K. Bar-Lev. Peter Enis. “Reproducibility and Natural Exponential Families with Power Variance Functions.” Ann. Statist. 14 (4) 1507 – 1522, December, 1986. https://doi.org/10.1214/aos/1176350173

A great thanks to Prof. Mario Wüthrich for pointing out the references of Bar-Lev and Enis. ↩︎

2024-06-03

Interactive Plotting Backend for Model-Diagnostics
With its newest release 1.1.0, the Python package model-diagnostics got the concept of a plotting backend. Before this release, all plots were constructed with matplotlib. This is still the default. But additionally, the user can now select plotly, if it is installed.

There are 2 ways to specify the plotting backend explicitly
Setting the plotting backend via global configuation

from model_diagnostics import set_config set_config(plot_backend="plotly")
Setting the plotting backend via a context manager

from model_diagnostics import config_context from model_diagnostics.calibration import plot_bias with config_context(plot_backend="plotly"): plot_bias(...)
The context manager has precedence over the global setting. Here is an example of an interactive reliability diagram backed by plotly:
```
import numpy as np
from model_diagnostics.calibration import plot_reliability_diagram
from model_diagnostics import set_config

set_config(plot_backend="plotly")

x = np.linspace(0, 2 * np.pi, 100)
y_obs = np.cos(x)
# A poor linear interpolation through extrema as predictions. 
y_pred = np.interp(x, [0, np.pi, 2 * np.pi], [1, -1, 1])
ax = plot_reliability_diagram(y_obs=y_obs, y_pred=y_pred)
```
If you wonder why this graph is not interactive despite promised to be, here is why: While this graph renders nicely in a juper lab notebook, for instance, this website is built with wordpress and I was simply unable to figure out a way to get the html of the graph rendered properly—without wordpress crashing. I you know a way, please contact me.

Code above as notebook: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2024-04-15%20reliability_plot_plotly.ipynb
2024-04-15
An Open Source Journey with Scikit-Learn
In this post, I’d like to tell the story of my journey into the open source world of Python with a focus on scikit-learn. My hope is that it encourages others to start or to keep contributing and have endurance for bigger picture changes.
Table of Content
How it all started
Becoming a scikit-learn core developer
Summary as core developer
How it all started

Back in 2015/2016, I was working as a non-life pricing actuary. The standard vendor desktop applications we used for generalized linear models (GLM) had problems of system discontinuities, manual error prone steps and the lack of modern machine learning capabilities (not even out-of-sample model comparison).

Python was then on the rise for data science. Numpy, scipy and pandas had laid the foundations, then came deep learning alias neural net frameworks leading to tensorflow and pytorch. XGBoost was also a game changer visible in Kaggle competition leaderboards. All those projects came as open source with thriving communities and possibilities to contribute.

While the R base package always comes with splendid dataframes (I guess they invented it) and battle proven GLMs out of the box, the Python site for GLMs was not that well developed. So I started with GLMs in statsmodels and generalized linear mixed models (a.k.a. hierarchical or multilevel models) in pymc (then called pymc3). My first open source contributions in the Python world were small issues in statsmodels and a little later the bug report pymc#2640 about memory alignment issues which was caused by joblib#563.

To my great surprise the famous machine learning library scikit-learn did not have GLMs, only penalized linear models and logistic regression, but no Poisson or Gamma GLMs which are essential in non-life insurance pricing. Fortunately, I was not the first one to notice this lack. There was already an open issue scikit-learn#5975 with many people asking for this feature. Just nobody had contributed a pull request (PR) yet.

That’s when I said to myself: It should not fail just because no one implements it. I really like open source and gained some programming experience during my PhD in particle physics, mainly C++. Eventually, I boldly (because I was still a newbie) opened the PR scikit-learn#9405 in summer 2017.

Becoming a scikit-learn core developer

This PR turned out to be essential for the development of GLMs and for becoming a scikit-learn core developer. I dare say that I almost got crazy trying to convince the core developers that GLMs are really that useful for supervised machine learning and that GLMs should land in scikit-learn. In retrospective, this was the hardest part and it took me almost 2 years of patience and repeating my arguments, some examples comments are given below:
comment example 1

“I can only repeat myself: I’d prefer to have this functionality in scikit-learn for several reasons (your review, opinion and ideas, very official/trustworthy library, more efficient maintainance, effort to release this pr as its own library, …).
To be more explicit for the moment: If it takes longer than the end of 2019 (+-), I’ll consider to release it as separate library.” link

comment example 2

“I see it a bit different. Scikit-Learn like R glm and glmnet is trusted world-wide and can be used in many companies, whereas it might be difficult to get any of the existing GLM libraries on pypi (h2o excluded) into production (no offense intended). That being said, I’d like to return the question and ask you: What exactly has to be fulfilled in order for a GLM PR to be merged into scikit-learn? Once that is clarified, I’ll think about starting a collaboration for this.” link
comment example 3

…

guidance – maintenance
As a GLM user on a fairly regular basis, I’d be happy to help as good as I can. Feel free to reach out to me. As to maintenance, I think a unified framework would even lower the burden. I can also imagine to give some support for maintenance.

miscellaneous

…

For GBMs to rely on the same loss and link functions would make sense …

…

further steps
Besides further commits to this PR, let me know how I can help you best.

[link]
As I wanted to demonstrate the full utility of GLMs, this PR had become much too large for review and inclusion: +4000 lines of code with several solvers, penalty matrices, 3 examples, a lot of documentation and good test coverage (and a lot of things I would do differently today).

The conclusion was to carve out a minimal GLM implementation using the L-BFGS solver of scipy. This way, I met Roman Yurchak with whom it was a pleasure to work with. It took a little 🇨🇭Swiss chocolate🍫 incentive to finally get scikit-learn#14300 (still +2900 loc) reviewed and merged in spring 2020. Almost 3 years after opening my original PR, it was released in scikit-learn version 0.23!

I guess it was mainly this work and perseverance around GLMs that catched the attention of the core developers and that motivated them to vote for me: In summer 2020, I was invited to become a scikit-learn core developer and gladly accepted.

Summary as core developer

Further directions

My work on GLMs was easily extensible to other estimators in the form of loss functions. Again, to my surprise, loss functions, a core element for supervised learning, were re-implemented again and again within scikit-learn. So, based on Roman’s idea in #15123, I started a project to unify them, and by unifying also extending several tree estimator classes with poisson and gamma losses (and making existing ones more stable and faster).

As loss functions are such important core components, they have basically 2 major requirements: be numerically stable and fast. That’s why I went with Cython (preferred way for fast code in scikit-learn) in scikit-learn#20567 and guess which loop it closed? Again, I met segfault errors caused by joblib#563. This time, it motivated another core developer to quite an investment in fixing it in joblib#1254.

Another story branch is the dedicated GLM Python library glum. The authors took my original way too long GLM PR as a starting point and developed one of the most feature rich and fastest GLM implementations out there. This is almost like a dream come true.

A summary of my contributions over those 3 intensive years as scikit-learn core developer are best given in several categories.

Pull requests

A summary of my contributions in terms of code may be:
- Unified loss module, unified naming of losses, poisson and gamma losses for GLMs and decision tree based models
- LinearModelLoss and NewtonSolver (newton-cholesky) for GLMs like LogisticRegression and PoissonRegressor as well as further solver improvements
- QuantileRegressor (linear quantile regression) and quantile/pinball loss for HistGradientBoostingRegressor (HGBT). BTW, linear quantile regression is much harder than GLM solvers!
- SplineTransformer
- Interaction constraints and feature subsampling for HGBT
From the release notes and the github PRs (where one would miss a few) a more details list of important PRs

Sample weights for ElasticNet v0.23 (Major Feature)

Minimal Generalized linear models implementation v0.23 (Major Feature)

ENH Poisson loss for HistGradientBoostingRegressor v0.23

ENH add Poisson splitting criterion for single trees v0.24

Common Private Loss Module with tempita v1.1

RFC Consistent options/names for loss and criterion v1.0 and v1.1

ENH Replace loss module HGBT v1.1

FEA add quantile HGBT v1.1 (Major Feature)

ENH Loss module LogisticRegression v1.1

ENH migrate GLMs / TweedieRegressor to linear loss v1.1

FEA Add Gamma deviance as loss function to HGBT v1.3

ENH replace loss module Gradient boosting future v1.4

Add quantile regression together with David Dale v1.0 (MajorFeature)

FEA Add SplineTransformer v1.0

ENH FEA add interaction constraints to HGBT v.1.2 (Major Feature)

FEA add (single) Cholesky Newton solver to GLMs v.1.2

ENH add newton-cholesky solver to LogisticRegression v.1.2

TST tight tests for GLMs v.1.2

ENH scaling of LogisticRegression loss as 1/n * LinearModelLoss future v1.4

ENH add feature subsampling per split for HGBT future v1.4
Reviewing and steering

Among the biggest changes in newer scikit-learn history are two scikit-learn enhancement proposals (SLEP)
- SLEP018: Pandas Output for Transformers with set_output
  championed by Thomas Fan, implemented in PR#23734 v1.2, further developments like PR#27315 for polars in future v1.4
- SLEP006: Metadata Routing
  championed by Adrin Jalali, base implementation PR#22083
For both, I did one of the 2 obligatory reviews. Then, maybe the technically most challenging review I can remember was on:
- ENH Add Categorical support for HistGradientBoosting from Thomas Fan, v0.24 (MajorFeature)
Keep in mind that review(er)s are by far the scarcest resource of scikit-learn.

I also like to mention PR#25753 which changed to government to be more inclusive, in particular with voting rights.

Lessons learned

Just before the end, a few critical words must be allowed.
- Scikit-learn is focused a lot on stability. For some items of my wish list to land in scikit-learn, it would have again taken years. This time, I decided to release my own library model-diagnostics and I enjoy the freedom to use cutting edge components like polars.
- As part-time statistician, I consider certain design choices like classifiers’ predict implicitly using a 50% threshold instead of returning a predicted probability (what predict_proba does) a bit poor. Hard to change!!! At least, PR#26120 might improve that to some extent.
- I ponder a lot on the pipeline concept. At first, it was like an eye-opener for me to think of feature preprocessing as part of the estimator. The scikit-learn API is build around the pipeline design with fit, transform and predict. But the current trend of modern model classes like gradient boosted trees (XGBoost, LightGBM, HGBT) don’t need a preprocessing pipeline anymore, e.g., they can natively deal with categorical features and missing values. But it is hard to pass the information which feature to treat as categorical through a pipeline, see scikit-learn#18894.
- It is still a very painful experience to specify design matrices of linear models, in particular interaction terms, see scikit-learn#15263, #19533 and #25412. Doing that in a pipline with a ColumnTransformer is just very complicated and prohibits a lot of optimizations (mostly for categoricals)—which is one of the reasons glum is faster.
One of the greatest rewards of this journey was that I learned a lot, about Python, machine learning, rigorous reviews, CI/CD, open source communities, endurance. But even more so, I had the pleasure to meet and work with some kind, brilliant and gifted people like Roman Yurchak, Alexandre Gramfort, Olivier Grisel, Thomas Fan, Nicolas Hug, Adrin Jalali and many more. I am really grateful to be a part of something bigger than its parts.
2023-11-24
Model Diagnostics in Python
🚀Version 1.0.0 of the new Python package for model-diagnostics was just released on PyPI. If you use (machine learning or statistical or other) models to predict a mean, median, quantile or expectile, this library offers tools to assess the calibration of your models and to compare and decompose predictive model performance scores.🚀
```
pip install model-diagnostics
```
After having finished our paper (or better: user guide) “Model Comparison and Calibration Assessment: User Guide for Consistent Scoring Functions in Machine Learning and Actuarial Practice” last year, I realised that there is no Python package that supports the proposed diagnostic tools (which are not completely new). Most of the required building blocks are there, but putting them together to get a result amounts quickly to a large amount of code. Therefore, I decided to publish a new package.

By the way, I really never wanted to write a plotting library. But it turned out that arranging results until they are ready to be visualised amounts to quite a large part of the source code. I hope this was worth the effort. Your feedback is very welcome, either here in the comments or as feature request or bug report under https://github.com/lorentzenchr/model-diagnostics/issues.

For a jump start, I recommend to go directly to the two examples:
- Regression on Workers’ Compensation Dataset
- Quantile Regression on Synthetic Data
To give a glimpse of the functionality, here are some short code snippets.
```
from model_diagnostics.calibration import compute_bias
from model_diagnostics.calibration import plot_reliability_diagram


y_obs = list(range(10))
y_pred = [2, 1, 3, 3, 6, 8, 5, 5, 8, 9.]
plot_reliability_diagram(
    y_obs=y_obs,
    y_pred=y_pred,
    n_bootstrap=1000,
    confidence_level=0.9,
)
```
```
compute_bias(y_obs=y_obs, y_pred=y_pred)
```
bias_mean bias_count bias_weights bias_stderr p_value
f64 u32 f64 f64 f64
0.5 10 10.0 0.477261 0.322121
```
from model_diagnostics.scoring import SquaredError, decompose


decompose(
    y_obs=y_obs,
    y_pred=y_pred,
    scoring_function=SquaredError(),
)
```
miscalibration discrimination uncertainty score
f64 f64 f64 f64
1.283333 7.233333 8.25 2.3

This score decomposition is additive (and unique):
```
\begin{equation*}
\mathrm{score} = \mathrm{miscalibration} - \mathrm{discrimination} + \mathrm{uncertainty}
\end{equation*}
```
As usual, the code snippets are collected in a notebook: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2023-07-16%20model-diagnostics.ipynb.
2023-07-16

Quantiles And Their Estimation

Applied statistics is dominated by the ubiquitous mean. For a change, this post is dedicated to quantiles. I will give my best to provide a good mix of theory and practical examples.

While the mean describes only the central tendency of a distribution or random sample, quantiles are able to describe the whole distribution. They appear in box-plots, in childrens’ weight-for-age curves, in salary survey results, in risk measures like the value-at-risk in the EU-wide solvency II framework for insurance companies, in quality control and in many more fields.

Definitions

Often, one talks about quantiles, but rarely defines them. In what fallows, I borrow from Gneiting (2011).

Definition 1: Quantile

Given a cumulative probability distribution (CDF) $F(x)=\mathbb{P}(X\leq x)$ , the quantile at level $\alpha \in (0,1)$ (ɑ-quantile for short), $q_\alpha(F)$ , is defined as

\begin{equation*} q_\alpha(F) = \{x: \lim_{y\uparrow x} F(y)\leq \alpha \leq F(x)\} \,. \end{equation*}

The inequalities of this definition are called coverage conditions. It is very important to note that quantiles are potentially set valued. Another way to write this set is as an interval:

\begin{align*} q_\alpha(F) &\in [q_\alpha^-(F), q_\alpha^+(F)] \\ q_\alpha^-(F) &= \sup\{x:F(x) < \alpha\} = \inf\{x:F(x) \geq \alpha\} \\ q_\alpha^+(F) &= \inf\{x:F(x) > \alpha\} = \sup\{x:F(x) \leq \alpha\} \end{align*}

For $q_\alpha^-$ , we recover the usual quantile definition as the generalized inverse of $F$ . But this is only one possible value. I will discuss examples of quantile intervals later on.

To get acquainted a bit more, let’s plot the cumulative distribution function and the quantile function for some continuous distributions: Normal, Gamma and Pareto distribution. The parametrisation is such that all have mean 2, Normal and Gamma have variance 4, and Pareto has an infinite variance. For those continuous and strictly increasing distributions, all quantiles are unique, and therefore simplify to the inverse CDF $q_\alpha^-$ in the above equations. Note that those three distributions have very different tail behaviour: The density of the Normal distribution has the famously fast decrease $\propto e^{-x^2}$ , the Gamma density has an exponentially decreasing tail $\propto e^{-x}$ and the Pareto density has a fat tail, i.e. an inverse power $\propto \frac{1}{x^\alpha}$ .

CDF (top) and quantile function (bottom) of several distributions: Normal $N(\mu=2, \sigma^2=2)$ (left), Gamma $Ga(\alpha=2, \beta=\frac{1}{2})$ (middle) and Pareto $Pa(\alpha=2)$ (right).

There are at least two more equivalent ways to define quantiles. They will help us later to get a better visualisations.

Definition 2: Quantile as minimiser

Given a probability distribution $F$ , the ɑ-quantile $q_\alpha(F)$ is defined as any minimiser of the expected scoring function $S$

\begin{align*} q_\alpha(F) &\in \argmin_x \mathbb{E}(S_\alpha(x, Y)) \\ S_\alpha(x, y) &= (\mathbf{1}_{x\geq y} - \alpha)(x - y) \end{align*}

where the expectation is taken over $Y \sim F$ .

The scoring function or loss function $S$ can be generalized to $S_\alpha(x, y) = (\mathbb{1}_{x\geq y} – \alpha)(g(x) – g(y))$ for any increasing function $g$ , but the above version in definition 2 is by far the simplest one and coincides with the pinball loss used in quantile regression.

This definition is super useful because it provides a tool to assess whether a given value really is a quantile. A plot will suffice.

Having a definition in terms of a convex optimisation problem, there is another definition in terms of the first order condition of optimality. For continuous, strictly increasing distributions, this would be equivalent to setting the first derivative to zero. For our non-smooth scoring function with potentially set-valued solution, this gets more complicated, e.g. subdifferential or subgradients replacing derivatives. In the end, it amounts to a sign change of the expectation of the so called identification function $V(x, y)=\mathbf{1}_{x\geq y}-\alpha$ .

Empirical Distribution

The empirical distribution provides an excellent example. Given a sample of n observations $y_1, \ldots, y_n$ , the empirical distribution is given by $F_n(x)=\frac{1}{n}\sum_{i=1}^n \mathbf{1}_{x\geq y_i}$ . Let us start simple and take two observations $y_1=1$ and $y_2=2$ . Plugging in this distribution in the definition 1 of quantiles gives the exact quantiles of this 2-point empirical CDF:

\begin{equation} q_\alpha(F_2)= \begin{cases} 1 &\text{if } \alpha<\frac{1}{2} \\ [1, 2] &\text{if } \alpha=\frac{1}{2} \\ 2 &\text{else} \end{cases} \end{equation}

Here we encounter the interval [1, 2] for $\alpha=\frac{1}{2}$ . Again, I plot both the (empirical) CDF $F_n$ and the quantiles.

Empirical distribution function and exact quantiles of observations y=1 and y=2.

In the left plot, the big dots unambiguously mark the values at x=1 and x=2. For the quantiles in the right plot, the vertical line at probability 0.5 really means that all those values between 1 and 2 are possible 0.5-quantiles, also known as median.

If you wonder about the value 1 for quantiles of level smaller than 50%, the minimisation formulation helps. The following plot shows $\mathbb{E}(S_\alpha(x, Y))$ for $\alpha=0.2$ with a clear unique minimum at x=1.

Expected scoring function (pinball loss) for $\alpha=0.2$ for the empirical CDF with observations 1 and 2.

A note for the interested reader: The above empirical distribution is the same as the distribution of a Bernoulli random variable, except that the x-values are shifted, i.e. the Bernoulli random variables are canonically set to 0 and 1 instead of 1 and 2. Furthermore, there is a direct connection between quantiles and classification via the cost-weighted misclassification error, see Fissler, Lorentzen & Mayer (2022).

Empirical Quantiles

From the empirical CDF, it is only a small step to empirical quantiles. But what’s the difference anyway? While we saw the exact quantile of the empirical distribution, $q_\alpha(F_n)$ , an empirical or sample quantile estimate the true (population) quantile given a data sample, i.e. $\hat{q}_\alpha(\{y_i\}) \approx q_\alpha(F)$ .

As an empirical CDF estimates the CDF of the true underlying (population) distribution, $F_n=\hat{F} \approx F$ , one immediate way to estimate a quantile is:

Estimate the CDF via the empirical CDF $F_n$ .
Use the exact quantile in analogy to Eq.(1) as an estimate.

Very importantly, this is just one way to estimate quantiles from a sample. There are many, many more. Here is the outcome of the 20%-quantile of our tiny data sample $y_1=1$ and $y_2=2$ .

import numpy as np

methods = [
    'inverted_cdf',
    'averaged_inverted_cdf',
    'closest_observation',
    'interpolated_inverted_cdf',
    'hazen',
    'weibull',
    'linear',
    'median_unbiased',
    'normal_unbiased',
    'nearest',
    'lower',
    'higher',
    'midpoint',
]
alpha = 0.2
for m in methods:
    estimate = np.quantile([1, 2], 0.2, method=m)
    print(f"{m:<25} {alpha}-quantile estimate = {estimate}")

inverted_cdf              0.2-quantile estimate = 1
averaged_inverted_cdf     0.2-quantile estimate = 1.0
closest_observation       0.2-quantile estimate = 1
interpolated_inverted_cdf 0.2-quantile estimate = 1.0
hazen                     0.2-quantile estimate = 1.0
weibull                   0.2-quantile estimate = 1.0
linear                    0.2-quantile estimate = 1.2
median_unbiased           0.2-quantile estimate = 1.0
normal_unbiased           0.2-quantile estimate = 1.0
nearest                   0.2-quantile estimate = 1
lower                     0.2-quantile estimate = 1
higher                    0.2-quantile estimate = 2
midpoint                  0.2-quantile estimate = 1.5

Note that the first 9 methods are the ones discussed in a famous paper of Hyndman & Fan (1996). The default method of both Python’s numpy.quantile and R’s quantile is linear, i.e. number 7 in Hyndman & Fan. Somewhat surprisingly, we observe that this default method is clearly biased in this case and overestimates the true quantile.

For large sample sizes, the differences will get tiny and all methods converge finally to a true quantile, at least for continuous distributions. In order to assess the bias with small sample sizes for each method, I do a simulation. This is where the fun starts😄

For all three selected distributions and for quantile levels 15%, 50% and 85%, I simulate 10000 random samples, each of sample size 10 and calculate the sample quantile. Then I take the mean over all 10000 simulations as well as the 5% and the 95% quantiles as a measure of uncertainty, i.e. 90% confidence intervals. After some coding, this results in the following plot. (I spare you the code at this point. You can find it in the linked notebook at the bottom of this post).

Small sample bias (n=10) of different empirical quantile estimation methods (x-axis and color) based on 10000 simulations. Dots are the mean values, error bars cover a 90% confidence interval. The dotted horizontal line is the theoretical quantile value.
left: Normal distribution; mid: Gamma distribution; right: Pareto distribution
top: 15%-quantile; mid: 50%-quantile; bottom: 85%-quantile

For the 15%-quantile, the default linear method always overestimates, but it does surprisingly well for the 85%-quantile of the Pareto distribution. Overall, I personally would prefer the median unbiased or the Hazen method. Interestingly, the Hazen method is one of the oldest, namely from Hazen (1914), and is the only one that fulfills all proposed properties of Hyndman & Fan, who propose the median unbiased method as default.

Quantile Regression

So far, the interest was in the quantile of a sample or distribution. To go one step further, one might ask for the conditional quantile of a response variable Y given some features or covariates X, $q_\alpha(Y|X)=q_\alpha(F_{Y|X})$ . This is the realm of quantile regression as invented by Koenker & Bassett (1978). To stay within linear models, one simply replaces the squared error of ordinary least squares (OLS) by the pinball loss.

Without going into any more details, I chose the Palmer’s penguin dataset and the body mass in gram as response variable, all the other variables as feature variables. And again, I model the 15%, 50% and 85% quantiles.

import seaborn as sns

df = sns.load_dataset("penguins").dropna()
df

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	Male
…	…	…	…	…	…	…	…
338	Gentoo	Biscoe	47.2	13.7	214.0	4925.0	Female
340	Gentoo	Biscoe	46.8	14.3	215.0	4850.0	Female
341	Gentoo	Biscoe	50.4	15.7	222.0	5750.0	Male
342	Gentoo	Biscoe	45.2	14.8	212.0	5200.0	Female
343	Gentoo	Biscoe	49.9	16.1	213.0	5400.0	Male

from sklearn.base import clone
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import QuantileRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, SplineTransformer

y = df["body_mass_g"]
X = df.drop(columns="body_mass_g")
qr50 = Pipeline([
    ("column_transformer",
     ColumnTransformer([
         ("ohe", OneHotEncoder(drop="first"), ["species", "island", "sex"]),
         ("spline", SplineTransformer(n_knots=3, degree=2), ["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]),
         
     ])
    ),
    ("quantile_regressor",
     QuantileRegressor(quantile=0.5, alpha=0, solver="highs")
    )
])
qr15 = clone(qr50)
qr15.set_params(quantile_regressor__quantile=0.15)
qr85 = clone(qr50)
qr85.set_params(quantile_regressor__quantile=0.85)

qr15.fit(X, y)
qr50.fit(X, y)
qr85.fit(X, y)

This code snippet gives the three fitted linear quantile models. That was the easy part. I find it much harder to produce good visualisations.

import polars as pl  # imported and used much earlier in the notebook

df_obs = df.copy()
df_obs["type"] = "observed"
dfs = [pl.from_pandas(df_obs)]
for m, name in [(qr15, "15%-q"), (qr50, "median"), (qr85, "85%-q")]:
    df_pred = df.copy()
    df_pred["type"] = "predicted_" + name
    df_pred["body_mass_g"] = m.predict(X)
    dfs.append(pl.from_pandas(df_pred))
df_pred = pl.concat(dfs).to_pandas()

sns.catplot(df_pred, x="species", y="body_mass_g", hue="type", alpha=0.5)

Quantile regression estimates and observed values for species (x-axis).

This plot per species seems suspicious. We would expect the 85%-quantile prediction in red to mostly be larger than the median (green), instead we detect several clusters. The reason behind it is the sex of the penguins which also enters the model. To demonstrate this fact, I plot the sex separately and make the species visible by the plot style. This time, I put the flipper length on the x-axis.

sns.relplot(
    df_pred,
    x="flipper_length_mm",
    y="body_mass_g",
    hue="type",
    col="sex",
    style="species",
    alpha=0.5,
)

Quantile regression estimates and observed values vs flipper length (x-axis). Left: male; right: female

This is a really nice plot:

One immediately observes the differences in sex on the left and right subplot.
The two clusters in each subplot can be well explained by the penguin species.
The 3 quantiles are now vertically in the expected order, 15% quantile in yellow the lowest, 85%-quantile in red the highest, the median in the middle, and some observations beyond those predicted quantiles.
The models are linear in flipper length with a positive, but slightly different slope per quantile level.

As a final check, let us compute the coverage of each quantile model (in-sample).

[
    np.mean(y <= qr15.predict(X)),
    np.mean(y <= qr50.predict(X)),
    np.mean(y <= qr85.predict(X)),
]

[0.15015015015015015, 0.5225225225225225, 0.8618618618618619]

Those numbers are close enough to 15%, 50% and 85%.

As always, the full Python code is available as notebook: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2023-02-11%20empirical_quantiles.ipynb.

2023-02-11

Histograms, Gradient Boosted Trees, Group-By Queries and One-Hot Encoding

This post shows how filling histograms can be done in very different ways thereby connecting very different areas: from gradient boosted trees to SQL queries to one-hot encoding. Let’s jump into it!

Modern gradient boosted trees (GBT) like LightGBM, XGBoost and the HistGradientBoostingRegressor of scikit-learn all use two techniques on top of standard gradient boosting:

2nd order Taylor expansion of the loss which amounts to using gradients and hessians.
One histogram per feature: bin the feature and fill the histogram with the gradients and hessians.

The filling of the histograms is often the bottleneck when fitting GBTs. While filling a single histogram is very fast, this operation is executed many times: for each boosting round, for each tree split and for each feature. This is the reason why GBT implementations have dedicated routines for it. We look into this operation from different angles.

For the coming (I)Python code snippets to work (# %% indicates a new notebook cell), we need the following imports.

import duckdb                    # v0.5.1
import matplotlib.pyplot as plt  # v.3.6.1
from matplotlib.ticker import MultipleLocator
import numpy as np               # v1.23.4
import pandas as pd              # v1.5.0
import pyarrow as pa             # v9.0.0
import tabmat                    # v3.1.2

from sklearn.ensemble._hist_gradient_boosting.histogram import (
    _build_histogram_root,
)                                # v1.1.2
from sklearn.ensemble._hist_gradient_boosting.common import (
  HISTOGRAM_DTYPE
)

Naive Histogram Visualisation

As a starter, we create a small table with two columns: bin index and value of the hessian.

def highlight(df):
    if df["bin"] == 0:
        return ["background-color: rgb(255, 128, 128)"] * len(df)
    elif df["bin"] == 1:
        return ["background-color: rgb(128, 255, 128)"] * len(df)
    else:
        return ['background-color: rgb(128, 128, 255)'] * len(df)

df = pd.DataFrame({"bin": [0, 2, 1, 0, 1], "hessian": [1.5, 1, 2, 2.5, 3]})
df.style.apply(highlight, axis=1)

	bin	hessian
0	0	1.500000
1	2	1.000000
2	1	2.000000
3	0	2.500000
4	1	3.000000

A histogram then sums up all the hessian values belonging to the same bin. The result looks like the following.

Dedicated Method

We simulate filling the histogram of a single feature. Therefore, we draw 1,000,000 random variables for gradients and hessians as well as the bin indices.

import duckdb
import pyarrow as pa
import numpy as np
import tabmat

from sklearn.ensemble._hist_gradient_boosting.histogram import (
    _build_histogram_root,
)
from sklearn.ensemble._hist_gradient_boosting.common import HISTOGRAM_DTYPE


rng = np.random.default_rng(42)
n_obs = 1000_000
n_bins = 256
binned_feature = rng.integers(0, n_bins, size=n_obs, dtype=np.uint8)
gradients = rng.normal(size=n_obs).astype(np.float32)
hessians = rng.lognormal(size=n_obs).astype(np.float32)

Now we use the dedicated (and private!) and single-threaded method _build_histogram_root from sckit-learn to fill a histogram.

hist_root = np.zeros((1, n_bins), dtype=HISTOGRAM_DTYPE)
%time _build_histogram_root(0, binned_feature, gradients, hessians, hist_root)
# Wall time: 1.38 ms

This executes in around 1.4 ms. This is quite fast. But again, imagine 100 boosting rounds with 10 tree splits on average and 100 features. This means this is done around 100,000 times and would therefore take roughly 2 minutes.

Let’s have a look at the first 5 bins:

hist_root[:, 0:5]

array([[(-79.72386998, 6508.89500265, 3894),
        ( 37.98393589, 6460.63222205, 3998),
        ( 53.54256977, 6492.22722797, 3805),
        ( 21.19542398, 6797.34159299, 3928),
        ( 16.24716742, 6327.03757573, 3875)]],
      dtype=[('sum_gradients', '<f8'), ('sum_hessians', '<f8'), ('count', '<u4')])

SQL Group-By Query

Someone familiar with SQL and database queries might immediately see how this task can be formulated as SQL group-by-aggregate query. To demonstrate it on our simulated data, we use DuckDB as well as Apache Arrow (the file format as well as the Python library pyarrow). You can read more about DuckDB in our post DuckDB: Quacking SQL.

# %%
con = duckdb.connect()
arrow_table = pa.Table.from_pydict(
    {
        "bin": binned_feature,
        "gradients": gradients,
        "hessians": hessians,
})
# Read data once to make timing fairer
arrow_result = con.execute("SELECT * FROM arrow_table")

# %%
%%time
arrow_result = con.execute("""
SELECT
    bin as bin,
    SUM(gradients) as sum_gradients,
    SUM(hessians) as sum_hessians,
    COUNT() as count
FROM arrow_table
GROUP BY bin
""").arrow()
# Wall time: 6.52 ms

On my laptop, this takes about 6.5 ms and, upon sorting, gives the same results:

arrow_result.sort_by("bin").slice(length=5)

pyarrow.Table
bin: uint8
sum_gradients: double
sum_hessians: double
count: int64
----
bin: [[0,1,2,3,4]]
sum_gradients: [[-79.72386997545254,37.98393589106854,53.54256977112527,21.195423980039777,16.247167424764484]]
sum_hessians: [[6508.895002648234,6460.632222048938,6492.227227974683,6797.341592986137,6327.037575732917]]
count: [[3894,3998,3805,3928,3875]]

As we have the table as an Arrow table, we can stay within pyarrow:

%%time
arrow_result = arrow_table.group_by("bin").aggregate(
    [
        ("gradients", "sum"),
        ("hessians", "sum"),
        ("bin", "count"),
    ]
)
# Wall time: 10.8 ms

The fact that DuckDB is faster than Arrow on this task might have to do with the large invested effort on parallelised group-by operations, see their post Parallel Grouped Aggregation in DuckDB for more infos.

One-Hot encoded Matrix Multiplication

I think it is very interesting that filling histograms can be written as a matrix multiplication! The trick is to view the feature as a categorical feature and use its one-hot encoded matrix representation. This blows up memory, of course. Note that one-hot encoding is usually met with generalized linear models (GLM) in order to incorporate nominal categorical feature variables with no internal ordering in the design matrix.

For our demonstration, we use a numpy index trick to construct the one-hot encoded matrix employing the fact that the binned feature already contains the right indices.

# %%
%%time
m_OHE = np.eye(n_bins)[binned_feature].T
vec = np.column_stack((gradients, hessians, np.ones_like(gradients)))
# Wall time: 770 ms

# %%
%time result_ohe = m_OHE @ vec
# Wall time: 199 ms

# %%
result_ohe[:5]

array([[ -79.72386998, 6508.89500265, 3894.        ],
       [  37.98393589, 6460.63222205, 3998.        ],
       [  53.54256977, 6492.22722797, 3805.        ],
       [  21.19542398, 6797.34159299, 3928.        ],
       [  16.24716742, 6327.03757573, 3875.        ]])

This is way slower, but, somehow surprisingly, produces the same result.

The one-hot encoded matrix is very sparse, with only one non-zero value per column, i.e. only one out of 256 (number of bins) values is non-zero. This structure can be exploited to reduce both CPU time as well as memory consumption, with the help of the package tabmat that was built to accelerate GLMs. Unfortunately, tabmat only provides a matrix-vector multiplication (and the sandwich product, of course), but no matrix-matrix multiplication. So we have to do a little extra work.

# %%
%time m_categorical = tabmat.CategoricalMatrix(cat_vec=binned_feature)
# Wall time: 21.5 ms

# %%
# tabmat needs contigous arrays with dtype = Python float = float64
vec = np.asfortranarray(vec, dtype=float)

# %%
%%time
tabmat_result = np.column_stack(
    (
        vec[:, 0] @ m_categorical,
        vec[:, 1] @ m_categorical,
        vec[:, 2] @ m_categorical,
    )
)
# Wall time: 4.82 ms

# %%
tabmat_result[0:5]

array([[ -79.72386998, 6508.89500265, 3894.        ],
       [  37.98393589, 6460.63222205, 3998.        ],
       [  53.54256977, 6492.22722797, 3805.        ],
       [  21.19542398, 6797.34159299, 3928.        ],
       [  16.24716742, 6327.03757573, 3875.        ]])

While the timing of this approach is quite good, the construction of a CategoricalMatrix requires more time than the matrix-vector multiplication.

Conclusion

In the end, the special (Cython) routine of scikit-learn ist the fastest of our tested methods for filling histograms. The other GBT libraries have their own even more specialised routines which might be a reason for even faster fit times. What we learned in this post is that this seemingly simple task plays a very crucial part in modern GBTs and can be accomplished by very different approaches. These different approaches uncover connections of algorithms of quite different domains.

The full code as ipython notebook can be found at https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2022-10-31%20histogram-GBT-GroupBy-OHE.ipynb.

2022-10-31

The Unfairness of AI Fairness
Fairness in Artificial Intelligence (AI) and Machine Learning (ML) is a recent and hot topic. As ML models are used in insurance pricing, the fairness topic also applies there. Just last month, Lindholm, Richman, Tsanakas and Wüthrich published a discussion paper on this subject that sheds new light on established AI fairness criteria. This post provides a short summary of this discussion paper with a few comments of my own. I recommend the interested reader to jump to the original: A Discussion of Discrimination and Fairness in Insurance Pricing.

First of all, I’d like to state that fairness in the form of solidarity and risk sharing was always at the heart of insurance and, as such, is very very old. The recent discussions regarding fairness has a different focus. It comes with the rise of successful ML models that can easily make use of the information contained in large amounts of data (many feature variables). A statistician might just call that multivariate statistical models. Insurance pricing is a domain where ML models (including GLMs) are successfully applied for quite some time (at least since the 1990s), and where at the same time protected information like gender and ethnicity might be available in the data. This led the European Council to forbid gender in insurance pricing.

The important point is—and here speaks the statistician again—that not using a certain features does in no way guarantee that this protected information is not used by a model. A car model or type, for instance, is correlated with the gender of the owner. This is called proxy discrimination.

The brilliant idea of Lindholm et al. was to construct an example where a protected feature does not influence the actuarial best price. So, everyone would agree that this is a fair model. But it turns out that the most common (statistical) definitions of AI fairness all fail. All of them judge this best price model as unfair. To be explicit, the following three group fairness axioms were analysed:
- Independence axiom / Statistical parity / Demographic parity
- Separation axiom / Equalized odds / Disparate mistreatment
- Sufficiency axiom / Predictive parity
On top of that, these 3 fairness criteria may force different insurance companies to exclude different non-protected variables from their pricing models.

How to conclude? It turns out that fairness is a complicated matter. It has many sociological, cultural and moral aspects. Apart from this broad spectrum, one particular challenge is to give precise mathematical definitions. This topic seems to be, as the paper suggests, open for discussion.
2022-10-14
From Least Squares Benchmarks to the Marchenko–Pastur Distribution
In this blog post, I tell the story how I learned about a theorem for random matrices of the two Ukrainian🇺🇦 mathematicians Vladimir Marchenko and Leonid Pastur. It all started with benchmarking least squares solvers in scipy.

Setting the Stage for Least Squares Solvers

Least squares starts with a matrix $A \in \mathbb{R}^{n,p}$ and a vector $b \in \mathbb{R}^{n}$ and one is interested in solution vectors $x \in \mathbb{R}^{p}$ fulfilling
```
x^\star = \argmin_x ||Ax-b||_2^2 \,.
```
You can read more about least squares in our earlier post Least Squares Minimal Norm Solution. There are many possible ways to tackle this problem and many algorithms are available. One standard solver is LSQR with the following properties:
- Iterative solver, which terminates when some stopping criteria are smaller than a user specified tolerance.
- It only uses matrix-vector products. Therefore, it is suitable for large and sparse matrices $A$ .
- It effectively solves the normal equations $A^T A = A^Tb$ based on a bidiagonalization procedure of Golub and Kahan (so never really calculating $A^T A$ ).
- It is, unfortunately, susceptible to ill-conditioned matrices $A$ , which we demonstrated in our earlier post.
Wait, what is an ill-conditioned matrix? This is most easily explained with the help of the singular value decomposition (SVD). Any real valued matrix permits a decomposition into three parts:
```
A = U S V^T \,.
```
$U$ and $V$ are orthogonal matrices, but not of further interest to us. The matrix $S$ on the other side is (rectangular) diagonal with only non-negative entries $s_i = S_{ii} \geq 0$ . A matrix $A$ is said to be ill-conditioned if it has a large condition number, which can be defined as the ratio of largest and smallest singular value, $\mathrm{cond}(A) = \frac{\max_i s_i}{\min_i s_i} = \frac{s_{\mathrm{max}}}{s_{\mathrm{min}}}$ . Very often, large condition numbers make numerical computations difficult or less precise.

Benchmarking LSQR

One day, I decided to benchmark the computation time of least squares solvers provided by scipy, in particular LSQR. I wanted results for different sizes $n, p$ of the matrix dimensions. So I needed to somehow generate different matrices $A$ . There are a myriad ways to do that. Naive as I was, I did the most simple thing and used standard Normal (Gaussian) distributed random matrices $A_{ij} \sim \mathcal{N}(0, 1)$ and ran benchmarks on those. Let’s see how that looks in Python.
```
from collections import OrderedDict
from functools import partial

import matplotlib.pyplot as plt
import numpy as np
from scipy.linalg import lstsq
from scipy.sparse.linalg import lsqr
import seaborn as sns
from neurtu import Benchmark, delayed

plt.ion()

p_list = [100, 500]
rng = np.random.default_rng(42)
X = rng.standard_normal(max(p_list) ** 2 * 2)
y = rng.standard_normal(max(p_list) * 2)

def bench_cases():    
    for p in p_list:
        for n in np.arange(0.1, 2.1, 0.1):
            n = int(p*n)
            A = X[:n*p].reshape(n, p)
            b = y[:n]
            for solver in ['lstsq', 'lsqr']:
                tags = OrderedDict(n=n, p=p, solver=solver)
                if solver == 'lstsq':
                    solve = delayed(lstsq, tags=tags)
                elif solver == 'lsqr':
                    solve = delayed(
                        partial(
                            lsqr, atol=1e-10, btol=1e-10, iter_lim=1e6
                        ),
                        tags=tags)

                yield solve(A, b)

bench = Benchmark(wall_time=True)
df = bench(bench_cases())

g = sns.relplot(x='n', y='wall_time', hue='solver', col='p',
            kind='line', facet_kws={'sharex': False, 'sharey': False},
            data=df.reset_index(), marker='o')
g.set_titles("p = {col_name}")
g.set_axis_labels("n", "Wall time (s)")
g.set(xscale="linear", yscale="log")

plt.subplots_adjust(top=0.9)
g.fig.suptitle('Benchmark LSQR vs LSTSQ')
for ax in g.axes.flatten():
    ax.tick_params(labelbottom=True)
```
Timing benchmark of LSQR and standard LAPACK LSTSQ solvers for different sizes of n on the x-axis and p=100 left, p=500 right.

The left plot already looks a bit suspicious around $n=p$ . But what is happening on the right side? Where does this spike of LSQR come from? And why does the standard least squares solver, SVD-based lstsq, not show this spike?

When I saw these results, I thought something might be wrong with LSQR and opened an issue on the scipy github repository, see https://github.com/scipy/scipy/issues/11204. The community there is really fantastic. Brett Naul pointed me to ….

The Marchenko–Pastur Distribution

The Marchenko–Pastur distribution is the distribution of the eigenvalues (singular values of square matrices) of certain random matrices in the large sample limit. Given a random matrix $A \in \mathbb{R}^{n,p}$ with i.i.d. entries $A_{ij}$ having zero mean, $\mathbb{E}[A_{ij}] = 0$ , and finite variance, $\mathrm{Var}[A_{ij}] = \sigma^2 < \infty$ , we define the matrix $Y_n = \frac{1}{n}A^T A \in \mathbb{R}^{p,p}$ . As square and even symmetric matrix, $Y_n$ has a simpler SVD, namely $Y_n = V \Sigma V^T$ . One can in fact show that $V$ is the same as in the SVD of $A$ and that the diagonal matrix $\Sigma = \frac{1}{n}S^TS$ contains the squared singular values of $A$ and $\min(0, p-n)$ extra zero values. The (diagonal) values of $\Sigma$ are called eigenvalues $\lambda_1, \ldots, \lambda_p$ of $Y_n$ . Note/ that the eigenvalues are themselves random variables, and we are interested in their probability distribution or probability measure. We define the (random) measure $\mu_p(B) = \frac{1}{p} \#\{\lambda_j \in B\}$ for all intervals $B \subset \mathbb{R}$ .

The theorem of Marchenko and Pastur then states that for $n, p \rightarrow \infty$ with $\frac{p}{n} \rightarrow \rho$ , we have $\mu_p \rightarrow \mu$ , where
```
\mu(B) = \begin{cases} (1-\frac{1}{\rho})\mathbb{1}_{0\in B} + \nu(B),\quad &\rho > 1 \\ \nu(B), \quad & 0\leq\rho\leq 1 \end{cases} \,,
```
```
\nu(x) = \frac{1}{2\pi\sigma^2} \frac{\sqrt{(\rho_+ - x)(x - \rho_-)}}{\rho x} \mathbb{1}_{x \in [\rho_-, \rho_+]} dx\,,
```
```
\rho_{\pm} = \sigma^2 (1\pm\sqrt{\rho})^2 \,.
```
We can at least derive the point mass at zero for $\rho>1 \Leftrightarrow p>n$ : We said above that $\Sigma$ contains $p-n$ extra zeros and those correspond to a density of $\frac{p-n}{p}=1 – \frac{1}{\rho}$ at zero.

A lot of math so far. Just note that the assumptions on $A$ are exactly met by the one in our benchmark above. Also note that the normal equations can be expressed in terms of $Y_n$ as $n Y_n x = A^Tb$ .

Empirical Confirmation of the Marchenko–Pastur Distribution

Before we come back to the spikes in our benchmark, let us have a look and see how good the Marchenko–Pastur distribution is approximated for finite sample size. We choose n=1000, p=500 which gives $\rho=\frac{1}{2}$ . We plot a histrogram of the eigenvalues next to the Marchenko–Pastur distribution.
```
def marchenko_pastur_mu(x, rho, sigma2=1):
    x = np.atleast_1d(x).astype(float)
    rho_p = sigma2 * (1 + np.sqrt(rho)) ** 2
    rho_m = sigma2 * (1 - np.sqrt(rho)) ** 2
    mu = np.zeros_like(x)
    is_nonzero = (rho_m < x) & (x < rho_p)
    x_valid = x[is_nonzero]
    factor = 1 / (2 * np.pi * sigma2 * rho)
    mu[is_nonzero] = factor / x_valid
    mu[is_nonzero] *= np.sqrt((rho_p - x_valid) * (x_valid - rho_m))
    if rho > 1:
        mu[x == 0] = 1 - 1 / rho
    return mu

fig, ax = plt.subplots()

n, p = 1000, 500
A = X.reshape(n, p)
Y = 1/n * A.T @ A
eigenvals, _ = np.linalg.eig(Y)
ax.hist(eigenvals.real, bins=50, density=True, label="histogram")
x = np.linspace(0, np.max(eigenvals.real), 100)
ax.plot(x, marchenko_pastur_mu(x, rho=p/n), label="MP distribution")
ax.legend()
ax.set_xlabel("eigenvalue")
ax.set_ylabel("probability")
ax.set_title("Empirical evidence for n=1000, p=500, rho=0.5")
```
Hitogram of eigenvalues for n=100, p=500 versus Marchenko–Pastur distribution

I have to say, I am very impressed by this good agreement for n=1000, which is far from being huge.

Conclusion

Let’s visualize the Marchenko–Pastur distribution $Y_n$ for several ratios $\rho$ and fix $\sigma=1$ :
```
fig, ax = plt.subplots()

rho_list = [0.5, 1, 1.5]
x = np.linspace(0, 5, 1000)[1:] # exclude 0
for rho in rho_list:
    y = marchenko_pastur_mu(x, rho)
    line, = ax.plot(x, y, label=f"rho={rho}")
    # plot zero point mass
    if rho > 1:
        ax.scatter(0, marchenko_pastur_mu(0, rho), color = line.get_color())

ax.set_ylim(None, 1.2)
ax.legend()
ax.set_title("Marchenko-Pastur Distribution")
ax.set_xlabel("x")
ax.set_ylabel("dmu/dx")
```
Marchenko-Pastur distribution for several ratios rho. The green dot is the point mass at zero.

From this figure it becomes obvious that the closer the ratio $\rho = 1$ , the higher the probability for very tiny eigenvalues. This results in a high probability for an ill-conditioned matrix $A^TA$ coming from an ill-conditioned matrix $A$ . Let’s confirm that:
```
p = 500
n_vec = []
c_vec = []
for n in np.arange(0.1, 2.05, 0.05):
    n = int(p*n)
    A = X[:n*p].reshape(n, p)
    n_vec.append(n)
    c_vec.append(np.linalg.cond(A))

fig, ax = plt.subplots()
ax.plot(n_vec, c_vec)
ax.set_xlabel("n")
ax.set_ylabel("condition number of A")
ax.set_title("Condition Number of A for p=500")
```
Condition number of the random matrix A

As a result of the ill-conditioned $A$ , the LSQR solver has problems to achieve its tolerance criterion, needs more iterations, and takes longer time. This is exactly what we observed in the benchmark plots: the peak occurred around $n=p$ . The SVD-based lstsq solver, on the other hand, does not use an iterative scheme and does not need more time for ill-conditioned matrices.

You find the accompanying notebook here: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2022-05-22%20from%20least%20squares%20benchmarks%20to%20the%20marchenko-pastur%20distribution.ipynb
2022-05-22
Personal Highlights of Scikit-Learn 1.0
Yes! After more than 10 years, scikit-learn released its 1.0 version on 24 September 2021. In this post, I’d like to point out some personal highlights apart from the release highlights.

1. Feature Names

This one is listed in the release highlights, but deserves to be mentioned again.
```
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import pandas as pd

df = pd.DataFrame({
    "pet": ["dog", "cat", "fish"],
    "age": [3, 7, 1],
    "noise": [-99, pd.NA, 1e-10],
    "target": [1, 0, 1],
})
y = df.pop("target")
X = df

preprocessor = ColumnTransformer(
    [
        ("numerical", StandardScaler(), ["age"]),
        ("categorical", OneHotEncoder(), ["pet"]),
    ],
    verbose_feature_names_out=False,
    remainder="drop",
)

pipe = make_pipeline(preprocessor, LogisticRegression())
pipe.fit(X, y)
pipe[:-1].get_feature_names_out()
```
```
array(['age', 'pet_cat', 'pet_dog', 'pet_fish'], dtype=object)
```
This is not yet available for all estimators and transformers, but it is a big step towards SLEP007.

2. ColumnTransformer allows changed order of columns

Before this release, ColumnTransformer recorded the order of the columns of a dataframe during the fit method and required that a dataframe X passed to transform had the exact same columns and in the exact same order.

This was a big pain point in productive settings because fit and predict of a model pipeline, both calling transform, often get data from different sources, and, for instance, SQL does not care about the order of columns. On top, remainder=”drop” forced you to have also all dropped columns in transform. This contradicted at least my modelling workflow as I often specify all meaningful features explicitly and drop the rest by the remainder option. This then led to unwanted surprises when applying predict to new data in the end. It might also happen, that one forgets to remove the target variable from the training X and relies on the drop option. Usually, the application of the predictive model pipeline is on new data without the target variable. The error in this case, however, might be considered a good thing.

With pull request (PR) #19263, the ColumnTransformer only cares about the presence and names of the columns, not about their order. With remainder=”drop”, it only cares about the specified columns and ignores all other columns, even no matter if the dropped ones are different in fit and transform. Note that this only works with pandas dataframes as input (or an object that quacks alike).
```
df_new = pd.DataFrame({
    "age": [1, 9, 3],
    "another_noise": [pd.NA, -99, 1e-10],
    "pet": ["cat", "dog", "fish"],
})
pipe.predict(df_new)
```
You find these little code snippets as notebook at the usual place: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-10-21%20scikit-learn_v1_release_highlights.ipynb.

3. Poisson criterion for random forests

Scikit-learn v0.24 shipped with the new option criterion=”poisson” for DecisionTreeRegressor to split nodes based on the reduction of Poisson deviance. Version 1.0 passed this option further to the RandomForestRegressor in PR #19836. Random forests are often used models and valued for their ease of use. We even like to write blog posts about them:
- Strong Random Forests with XGBoost
- Feature Subsampling For Random Forest Regression
The Poisson splitting criterion has its place when modelling counts or frequencies. It allows for non-negative values to be modelled, but forbids non-positive predictions. This corresponds to $y_{train} \geq 0$ and $y_{predict} > 0$ .

4. The best example/tutorial of the year

It’s not visible from the release notes, but this deserves to be noted. PR #20281 added a fantastic example, more like a tutorial, on time-related feature engineering. You find a lot of interesting features, some of them shipped with the 1.0 release, e.g. time base cross validation, generation of cyclic b-splines and adding pairwise interactions to a linear model, usage of native categorical features in the HistGradientBoostingRegressor…

Take a look for yourself at this wonderful example.
2021-10-21

Feature Subsampling For Random Forest Regression

TLDR: The number of subsampled features is a main source of randomness and an important parameter in random forests. Mind the different default values across implementations.

Randomness in Random Forests

Random forests are very popular machine learning models. They are build from easily understandable and well visualizable decision trees and give usually good predictive performance without the need for excessive hyperparameter tuning. Some drawbacks are that they do not scale well to very large datasets and that their predictions are discontinuous on continuous features.

A key ingredient for random forests is—no surprise here—randomness. The two main sources for randomness are:

Feature subsampling in every node split when fitting decision trees.
Row subsampling (bagging) of the training dataset for each decision tree.

In this post, we want to investigate the first source, feature subsampling, with a special focus on regression problems on continuous targets (as opposed to classification).

Feature Subsampling

In his seminal paper, Leo Breiman introduced random forests and pointed out several advantages of feature subsamling per node split. We cite from his paper:

The forests studied here consist of using randomly selected inputs or combinations of inputs at each node to grow each tree. The resulting forests give accuracy that compare favorably with Adaboost. This class of procedures has desirable characteristics:
i Its accuracy is as good as Adaboost and sometimes better.
ii It’s relatively robust to outliers and noise.
iii It’s faster than bagging or boosting.
iv It gives useful internal estimates of error, strength, correlation and variable importance.
v It’s simple and easily parallelized.
Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).

Note the focus on comparing with Adaboost at that time and the, in today’s view, relatively small datasets used for empirical studies in this paper.

If the input data as p number of features (columns), implementations of random forests usually allow to specify how many features to consider at each split:

Implementation	Language	Parameter	Default
scikit-learn RandomForestRegressor	Python	max_features	p
scikit-learn RandomForestClassifier	Python	max_features	sqrt(p)
ranger	R	mtry	sqrt(p)
randomForest regression	R	mtry	p/3
randomForest classification	R	mtry	sqrt(p)
H2O regression	Python & R	mtries	p/3
H2O classification	Python & R	mtries	sqrt(p)

Note that the default of scikit-learn for regression is surprising because it switches of the randomness from feature subsampling rendering it equal to bagged trees!

While empirical studies on the impact of feature for good default choices focus on classification problems, see the literature review in Probst et al 2019, we consider a set of regression problems with continuous targets. Note that different results might be more related to different feature spaces than to the difference between classification and regression.

The hyperparameters mtry, sample size and node size are the parameters that control the randomness of the RF. […]. Out of these parameters, mtry is most influential both according to the literature and in our own experiments. The best value of mtry depends on the number of variables that are related to the outcome.
Probst, P. et al. “Hyperparameters and tuning strategies for random forest.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (2019): n. pag.

Benchmarks

We selected the following 13 datasets with regression problems:

Dataset	number of samples	number of used features p
Allstate	188,318	130
Bike_Sharing_Demand	17,379	12
Brazilian_houses	10’692	12
ames	1’460	79
black_friday	166’821	9
colleges	7’063	49
delays_zurich_transport	27’327	17
diamonds	53’940	6
la_crimes	1’468’825	25
medical_charges_nominal	163’065	11
nyc-taxi-green-dec-2016	581’835	14
particulate-matter-ukair-2017	394’299	9
taxi	581’835	18

Note that among those, there is no high dimensional dataset in the sense that p>number of samples.

On these, we fitted the scikit-learn (version 0.24) RandomForestRegressor (within a short pipeline handling missing values) with default parameters. We used 5-fold cross validation with 4 different values max_feature=p/3 (blue), sqrt(p) (orange), 0.9 p (green), and p (red). Now, we show the mean squared error with uncertainty bars (± one standard deviation of cross validation splits), the lower the better.

In addition, we also report the fit time of each (5-fold) fit in seconds, again the lower the better.

Note that sqrt(p) is often smaller than p/3. With this in mind, this graphs show that fit time is about proportional to the number of features subsampled.

Conclusion

The main tuning parameter of random forests is the number of features used for feature subsampling (max_features, mtry). Depending on the dataset, it has a relevant impact on the predictive performance.
The default of scikit-learn’s RandomForestRegressor seems odd. It produces bagged trees. This is a bit like using ridge regression with a zero penalty😉. However, it can be justified by our benchmarks above.

The full code can be found here:

https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-08-19%20random_forests_max_features.ipynb

2021-08-19

bias_mean	bias_count	bias_weights	bias_stderr	p_value
f64	u32	f64	f64	f64
0.5	10	10.0	0.477261	0.322121

miscalibration	discrimination	uncertainty	score
f64	f64	f64	f64
1.283333	7.233333	8.25	2.3

Author: Christian Lorentzen

Inference

Prediction

Model Fit

Diagnostics for Inference

Diagnostics for Prediction

Tweedie Distributions

Compound Poisson Gamma

Wright’s Generalized Bessel Function

The Integral Representation

Arc Length Minimization

Conclusion

From Mean-Variance Relation to Score Equations

Offsets and Sample Weights

Poisson GLM

Tweedie GLM

Example

Outlook

Intro

Mean-Variance Relation

Insurance Pricing Models

Convolution and Aggregation Invariance

Poisson Distribution

Frequency Example

Outlook

How it all started

Becoming a scikit-learn core developer

Summary as core developer

Further directions

Pull requests

Reviewing and steering

Lessons learned

Definitions

Empirical Distribution

Empirical Quantiles

Quantile Regression

Naive Histogram Visualisation

Dedicated Method

SQL Group-By Query

One-Hot encoded Matrix Multiplication

Conclusion

Setting the Stage for Least Squares Solvers

Benchmarking LSQR

The Marchenko–Pastur Distribution

Empirical Confirmation of the Marchenko–Pastur Distribution

Conclusion

1. Feature Names

2. ColumnTransformer allows changed order of columns

3. Poisson criterion for random forests

4. The best example/tutorial of the year

Randomness in Random Forests

Feature Subsampling

Benchmarks

Conclusion