A Tweedie Trilogy — Part I: Frequency and Aggregration Invariance

TLDR: In this first part of the Tweedie Trilogy, we will take a look at what happens to a GLM if we aggregate the data by a group-by operation. A frequency model for insurance pricing will serve as an example.

This trilogy celebrates the 40th birthday of Tweedie distributions in 2024 and highlights some of their very special properties.

Intro

Tweedie distributions and Generalised Linear Models (GLM) have an intertwined relationship. While GLMs are, in my view, one of the best reference models for estimating expectations, Tweedie distributions lie at the heart of expectation estimation. In fact, basically all applied GLMs in practice use Tweedie distributions with three notable exceptions: the binomial, the multinomial and the negative binomial distribution.

Mean-Variance Relation

“An index which distinguishes between some important exponential families” is the original publication title of Maurice Charles Kenneth Tweedie in 1984—but note that Shaul K. Bar-Lev and Peter Enis published around the same time; as their 1986 paper was received November 1983, the distribution could also be named Bar-Lev & Enis distribution.1 This index is meanwhile called the Tweedie power parameter p. Recall that distributions of the exponential dispersion family always fulfil a mean-variance relationship. Its even a way to define them. For the Tweedie distribution, denoted Tw_p(\mu, \phi), the relation reads

\begin{align*}
\operatorname{E}[Y] &= \mu
\\
\operatorname{Var}[Y] &= \phi \mu^p
\end{align*}

with dispersion parameter \phi. Some very common members are given in the following table.

pdistributiondomain Ydomain \mu
0Normal / Gaussian\mathrm{R}\mathrm{R}
1Poisson0, 1, 2, \ldots\mathrm{R}_+
(1,2)Compound Poisson-Gamma\mathrm{R}_+ \cup \{0\}\mathrm{R}_+
2Gamma\mathrm{R}_+\mathrm{R}_+
3inverse Gaussian\mathrm{R}_+\mathrm{R}_+

Insurance Pricing Models

In non-life insurance pricing, most claims happen somewhat randomly, typically the occurrence as well as the size. Take the theft of your bike or a water damage of your basement due to flooding as an example. Pricing actuaries usually want to predict the expected loss E[Y|X] given some features X of a policy. The set of features could contain the purchasing price of your bike or the proximity of your house to a river.

Instead of directly modelling the expected loss per exposure w, e.g. the time duration of the insurance contract, the most used approach is the famous frequency-severity split:

\begin{align*}
\operatorname{E}\left[\frac{Y}{w}\right] = \underbrace{\operatorname{E}\left[\frac{N}{w}\right]}_{frequency} \cdot
\underbrace{\operatorname{E}\left[\left. \frac{Y}{n}\right| N=n\right]}_{severity}
\end{align*}

For simplicity, the conditioning on X is suppressed, it would occur in every expectation. The first part \operatorname{E}\left[\frac{N}{w}\right]is the (expected) frequency, i.e. the number of claims per exposure (time). The second term \operatorname{E}\left[\left.\frac{Y}{N}\right| N\right] is the (expected) severity, i.e. the average claim size (per claim) given a fixed number of claims. Here, we focus on the frequency part.

Convolution and Aggregation Invariance

This property might first seem very theoretical, but it may be one of the most important properties for the estimation of expectations E[Y|X] with GLMs. It is in fact a property valid for the whole exponential dispersion family: The weighted mean of i.i.d. random variables has (almost) the same distribution!

If

\begin{align*}
Y_i &\overset{i.i.d}{\sim} \mathrm{Tw}_p(\mu, \phi/w_i) \,,
\\
w_+ &= \sum_i w_i \quad\text{with } w_i >0 \,,
\end{align*}

then

\begin{align*}
Y &=\sum_i^n \frac{w_i Y_i}{w_+} \sim \mathrm{Tw}_p(\mu, \phi/w_+) \,.
\end{align*}

It is obvious that the mean of Yis again \mu. But is is remarkable that it has the same distribution with the same power parameter, only the 2nd argument with the dispersion parameter differs. But the dispersion parameter cancels out in GLM estimations. In fact, we will show that two GLMs, one on aggregated data, give identical results. Another way of saying the same in statistical terms is that (weighted) averages are the sufficient statistic for the expectation within the exponential dispersion family.

This is quite an essential property for data aggregation. It means that one can aggregate rows with identical features and still do an analysis (of the conditional expectation) without loss of information.

The weighted average above can be written a bit more intuitive. For instance, a frequency Y_i=\frac{N_i}{w_i} has weighted average Y=\frac{\sum_i N_i}{\sum_i w_i}.

Poisson Distribution

When modelling counts, the Poisson distribution is by far the easiest distribution one can think of. It only has a single parameter, is a member of the Tweedie family, and fulfils the mean-variance relation

\begin{equation*}
\operatorname{E}[N] = \mu = \operatorname{Var}[N] \,.\end{equation*}

In particular, p=1. While the distribution is strictly speaking only for counts, i.e. N takes on non-negative integer values, Poisson regression also works for any non-negative response variable like N/w \in \mathrm{R}.

Frequency Example

For demonstration, we fit a Poisson GLM on the french motor third-party liability claims dataset, cf. the corresponding scikit-learn example and the case study 1 of the Swiss Association of Actuaries on the same dataset.

from glum import GeneralizedLinearRegressor
import pandas as pd

# ... quite some code ... here we abbreviate.
y_freq = df["ClaimNb"] / df["Exposure"]
w_freq = df["Exposure"]
X = df[x_vars]
glm_params = {
    "alpha": 0,
    "drop_first": True,
    "gradient_tol": 1e-8,
}
glm_freq = GeneralizedLinearRegressor(
    family="poisson", **glm_params
).fit(X, y_freq, sample_weight=w_freq)
print(
  f"Total predicted number of claims = "
  f"{(w_freq * glm_freq.predict(X)).sum():_.2f}"
)
# Total predicted number of claims = 26_444.00

# Now aggregated
df_agg = df.groupby(x_vars, observed=True).sum().reset_index()
print(
    f"Aggregation reduced number of rows from {df.shape[0]:_}"
    f"to {df_agg.shape[0]:_}."
)
# Aggregation reduced number of rows from 678_013 to 133_413.
y_agg_freq = df_agg["ClaimNb"] / df_agg["Exposure"]
w_agg_freq = df_agg["Exposure"]
X_agg = df_agg[x_vars]
glm_agg_freq = GeneralizedLinearRegressor(
    family="poisson", **glm_params
).fit(X_agg, y_agg_freq, sample_weight=w_agg_freq)
print(
    f"Total predicted number of claims = "
    f"{(w_agg_freq * glm_agg_freq.predict(X_agg)).sum():_.2f}"
)
# Total predicted number of claims = 26_444.00

In fact, both models have the same intercept term and same coefficients, they are really identical models (up to numerical precision):

print(
    f"intercept freq{'':<18}= {glm_freq.intercept_}\n"
    f"intercept freq aggregated model = {glm_agg_freq.intercept_}"
)
# intercept freq                  = -3.7564376764216747
# intercept freq aggregated model = -3.7564376764216747

np.max(np.abs(glm_freq.coef_ - glm_agg_freq.coef_)) < 1e-13
# True

Outlook

The full notebook can be found here.

In the next week, part II of this trilogy will follow. There, we will meet some more of its quite remarkable properties.

Further references:

  • Tweedie M.C.K. 1984. “An index which distinguishes between some important exponential families”. Statistics: Applications and New Directions. Proceedings of the Indian Statistical Institute Golden Jubilee International Conference, Indian Statistical Institute, Cal- cutta, pp. 579–604.
  • Bar-Lev, S.K., Enis, P. Reproducibility in the one-parameter exponential family. Metrika 32, 391–394 (1985). https://doi.org/10.1007/BF01897827
  • Shaul K. Bar-Lev. Peter Enis. “Reproducibility and Natural Exponential Families with Power Variance Functions.” Ann. Statist. 14 (4) 1507 – 1522, December, 1986. https://doi.org/10.1214/aos/1176350173
  1. A great thanks to Prof. Mario Wüthrich for pointing out the references of Bar-Lev and Enis. ↩︎

Posted

in

,

by

Comments

One response to “A Tweedie Trilogy — Part I: Frequency and Aggregration Invariance”

Leave a Reply

Your email address will not be published. Required fields are marked *