Tag: Tweedie Trilogy

  • A Tweedie Trilogy — Part III: From Wrights Generalized Bessel Function to Tweedie’s Compound Poisson Distribution

    TLDR: The scipy 1.7.0 release introduced Wright’s generalized Bessel function in the Python ecosystem. It is an important ingredient for the density and log-likelihood of Tweedie probabilty distributions. In this last part of the trilogy I’d like to point out why it was important to have this function and share the endeavor of implementing this inconspicuous but highly intractable special function. The fun part is exploiting a free parameter in an integral representation, which can be optimized by curve fitting to the minimal arc length.

    This trilogy celebrates the 40th birthday of Tweedie distributions in 2024 and highlights some of their very special properties.

    See part i and part ii.

    Tweedie Distributions

    As pointed out in part I and part II, the family of Tweedie distributions is a very special one with outstanding properties. They are central for estimating expectations with GLMs. The probability distributions have mainly positive (non-negative) support and are skewed, e.g. Poisson, Gamma, Inverse Gaussian and compound Poisson-Gamma.

    As members of the exponential dispersion family, a slight extension of the exponential family, the probability density can be written as

    \begin{align*}
    f(y; \theta, \phi) &= c(y, \phi) \exp\left(\frac{y\theta - \kappa(\theta)}{\phi}\right)
    \\
    \kappa(\theta) &= \kappa_p(\theta) = \frac{1}{2-p}((1-p)\theta)^{\frac{2-p}{1-p}}
    \end{align*}

    It is often more instructive to parametrise the distribution with p, \mu and \phi, using

    \begin{align*}
    \theta &= \begin{cases}
    \frac{\mu^{1-p}}{1-p}\,,\quad p\neq 1\\
    \log(\mu)\,,\quad p=1
    \end{cases}
    \\
    \kappa(\theta) &= \begin{cases}
    \frac{\mu^{2-p}}{2-p}\,,\quad p\neq 2\\
    \log(\mu)\,,\quad p=2
    \end{cases}
    \end{align*}

    and write

    \begin{align*}
    Y &\sim \mathrm{Tw}_p(\mu, \phi)
    \end{align*}
    Probability density of several Tweedie distributions.

    Compound Poisson Gamma

    A very special domain for the power parameter is between Poisson and Gamma: 1<p<2. This range results in the Compound Poisson distribution which is suitable if you have a random count process and if each count itself has a random amount. A well know example is insurance claims. Typically, there is a random number of insurance claims, and each and every claim has a random amount of claim costs.

    \begin{align*}
    N &\sim \mathrm{Poisson}(\lambda)\\
    X_i &\sim \mathrm{Gamma}(a, b)\\
    Y &= \sum_{i=0}^N X_i \sim \mathrm{CompPois}(\lambda, a, b)
    \end{align*}

    For Poisson count we have \operatorname{E}[N]=\lambda and \operatorname{Var}[N]=\lambda=\operatorname{E}[N], for the Gamma amount \operatorname{E}[X]=\frac{a}{b} and \operatorname{Var}[X]=\frac{a}{b^2}=\frac{1}{a}\operatorname{E}[X]^2. For the compound Poisson-Gamma variable, we obtain

    \begin{align*}
    \operatorname{E}[Y] &= \operatorname{E}[N] \operatorname{E}[X] = \lambda\frac{a}{b}=\mu\\
    \operatorname{Var}[Y] &=  \operatorname{Var}[N] \operatorname{E}[X]^2 +  \operatorname{E}[N] \operatorname{Var}[X] =  \phi \mu^p\\
    p &= \frac{a + 2}{a+1} \in (1, 2)\\
    \phi &= \frac{(\lambda a)^{1-p}}{(p-1)b^{2-p}}
    \end{align*}

    What’s so special here is that there is a point mass at zero, i.e., P(Y=0)=\exp(-\frac{\mu^{2-p}}{\phi(2-p)}) > 0. Hence, it is a suitable distribution for non-negative quantities with some exact zeros.

    Probability density for compound Poisson Gamma, point masses at zero are marked as points.

    Code

    The rest of this post is about how to compute the density for this parameter range. The easy part is \exp\left(\frac{y\theta - \kappa(\theta)}{\phi}\right) which can be directly implemented. The real obstacle is the term c(y, \phi) which is given by

    \begin{align*}
    c(y, \phi) &= \frac{\Phi(-\alpha, 0, t)}{y}
    \\
    \alpha &= \frac{2 - p}{1 - p}
    \\
    t &= \frac{\left(\frac{(p - 1)\phi}{y}\right)^{\alpha}}{(2-p)\phi}
    \end{align*}

    This depends on Wright’s (generalized Bessel) function \Phi(a, b, z) as introduced in a 1933 paper by E. Wright.

    Wright’s Generalized Bessel Function

    According to DLMF 10.46, the function is defined as

    \begin{equation*}
    \Phi(a, b, z) = \sum_{k=0}^{\infty} \frac{z^k}{k!\Gamma(ak+b)}, \quad a > -1, b \in R, z \in C
    \end{equation*}

    which converges everywhere because it is an entire function. We will focus on the positive real axis z=x\geq 0 and the range a\geq 0, b\geq 0 (note that a=-\alpha \in (0,\infty) for 1<p<2). For the compound Poisson-Gamma, we even have b=0.

    Implementation of such a function as done in scipy.stats.wright_bessel, even for the restricted parameter range, poses tremendous challenges. The first one is that it has three parameters which is quite a lot. Then the series representation above, for instance, can always be used, but depending on the parameters, it will require a huge amount of terms, particularly for large x. As each term involves the Gamma function, this becomes expensive very fast. One ends up using different representations and strategies for different parameter regions:

    • Small x: Taylor series according to definition
    • Small a: Taylor series in a=0
    • Large x: Asymptotic series due to Wright (1935)
    • Large a: Taylor series according to definition for a few terms around the approximate maximum term k_{max} due to Dunn & Smyth (2005)
    • General: Integral represantation due to Luchko (2008)

    Dunn & Smyth investigated several evaluation strategies for the simpler Tweedie density which amounts to Wright’s functions with b=0, see Dunn & Smyth (2005). Luchko (2008) lists most of the above strategies for the full Wright’s function.

    Note that Dunn & Smyth (2008) provide another strategy to evaluate the Tweedie distribution function by means of the inverse Fourier transform. This does not involve Wright’s function, but also encounters complicated numerical integration of oscillatory functions.

    The Integral Representation

    This brings us deep into complex analysis: We start with Hankel’s contour integral representation of the reciprocal Gamma function.

    \begin{equation*}
    \frac{1}{\Gamma(z)} = \frac{1}{2\pi i} \int_{Ha^-} \zeta^{-z} e^\zeta \; d\zeta
    \end{equation*}

    with the Hankel path Ha^- from negative infinity (A) just below the real axis, counter-clockwise with radius \epsilon>0 around the origin and just above the real axis back to minus infinity (D).

    Hankel contour Ha in the complex plane.

    In principle, one is free to choose any such path with the same start (A) and end point (D) as long as one does not cross the negative real axis. One usually lets the AB and CD be infinitesimal close to the negative real line. Very importantly, the radius \epsilon>0 is a free parameter! That is real magic🪄

    By interchanging sum and integral and using the series of the exponential, Wright’s function becomes

    \begin{align*}
    \Phi(a, b, z) &= \sum_{k=0}^{\infty} \frac{z^k}{k!} \frac{1}{2\pi i} \int_{Ha^-} \zeta^{-(ak+b)} e^\zeta \; d\zeta
    \\
    &= \frac{1}{2\pi i} \int_{Ha^-} \zeta^{-b} e^{\zeta + z\zeta^{-a}} \; d\zeta
    \end{align*}

    Now, one needs to do the tedious work and split the integral into the 3 path sections AB, BC, CD. Putting AB and CD together gives an integral over K, the circle BC gives an integral over P:

    \begin{align*}
    \Phi(a, b, x) &= \frac{1}{\pi} \int_{\epsilon}^\infty K(a, b, x, r) \; dr
    \\
     &+ \frac{\epsilon^{1-b}}{\pi} \int_0^\pi P(\epsilon, a, b, x, \varphi) \; d\varphi
    \\
    K(a, b, x, r) &= r^{-b}\exp(-r + x  r^{-a} \cos(\pi a)) 
    \\
    &\quad \sin(x \cdot r^{-a} \sin(\pi a) + \pi b)
    \\
    P(\epsilon, a, b, x, \varphi) &= \exp(\epsilon \cos(\varphi) + x  \epsilon^{-a}\cos(a \varphi))
    \\
    &\quad \cos(\epsilon \sin(\varphi) - x \cdot \epsilon^{-a} \sin(a \varphi) + (1-b) \varphi)
    \end{align*}

    What remains is to carry out the numerical integration, also known as quadrature. While this is an interesting topic in its own, let’s move to the magic part.

    Arc Length Minimization

    If you have come so far and say, wow, puh, uff, crazy, 🤯😱 Just keep on a little bit because here comes the real fun part🤞

    It turns out that most of the time, the integral over P is the most difficult. The worst behaviour an integrand can have is widely oscillatory. Here is one of my favorite examples:

    Integrands for a=5, b=1, x=100 and two choices of epsilon.

    With the naive choice of \epsilon=1, both integrands (blue) are—well—crazy. There is basically no chance the most sophisticated quadrature rule will work. And then look at the other choice of \epsilon\approx 4. Both curves seem well behaved (for P, we would need a closer look).

    So the idea is to find a good choice of \epsilon to make P well behaved. Well behaved here means most boring, if possible a straight line. What makes a straight line unique? In flat space, it is the shortest path between two points. Therefore, well behaved integrands have minimal arc length. That is what we want to minimize.

    The arc length S from x=a to x=b of a 1-dimensional function f is given by

    \begin{equation*}
    S = \int_a^b \sqrt{1 + f^\prime(x)^2} \; dx
    \end{equation*}

    Instead of f=P, we only take the oscillatory part of P and approximate the arc length as f(\varphi)=f(\varphi) = \epsilon \sin(\varphi) - x \epsilon^{-\rho} \sin(\rho \varphi) + (1-\beta) \varphi. For a single parameter point a, b, z this looks like

    Arc length and integrand P for different epsilon, given a=0.1, b=5, x=100.

    Note the logarithmic y-scale for the right plot of P. The optimal \epsilon=10 is plotted in red and behaves clearly better than smaller values of \epsilon.

    What remains to be done for an actual implementation is

    • Calculate minimal \epsilon for a large grid of values a, b, x.
    • Choose a function with some parameters.
    • Curve fitting (so again optimisation): Fit this function to the minimal \epsilon of the grid via minimising least squares.
    • Implement some quadrature rules and use this choice of \epsilon in the hope that it intra- and extrapolates well.

    This strategy turns out to work well in practice and is implemented in scipy. As the parameter space of 3 variables is huge, the integral representation breaks down in certain areas, e.g. huge values of \epsilon where the integrands just overflow numerically (in 64-bit floating point precision). But we have our other evaluation strategies for that.

    Conclusion

    An extensive notebook for Wright’s function, with all implementation strategies can be found here.

    After an adventurous journey, we arrived at one implementation strategy of Wright’s generalised Bessel function, namely the integral representation. The path went deep into complex analysis and contour integration, then further to the arc length of a function and finally curve fitting via optimisation. I am really astonished how connected all those different areas of mathematics can be.

    Wright’s function is the missing piece to compute full likelihoods and probability functions of the Tweedie distribution family and is now available in the Python ecosystem via scipy.

    We are at the very end of this Tweedie trilogy. I hope it has been entertaining and it has become clear why Tweedie deserves to be celebrated.

    Further references:

  • A Tweedie Trilogy — Part II: Offsets

    TLDR: This second part of the trilogy will have a deeper look at offsets and sample weights of a GLM. Their non-equivalence stems from the mean-variance relationship. This time, we not only have a Poisson frequency but also a Gamma severity model.

    This trilogy celebrates the 40th birthday of Tweedie distributions in 2024 and highlights some of their very special properties.

    See part I.

    From Mean-Variance Relation to Score Equations

    In the part I, we have already introduced the mean-variance relation of a Tweedie random variable Y\sim Tw_p(\mu, \phi) with Tweedie power p, mean \mu and dispersion parameter \phi:

    \begin{align*}
    \operatorname{E}[Y] &= \mu
    \\
    \operatorname{Var}[Y] &= \phi \mu^p = \phi v(\mu)
    \end{align*}

    with variance function v(\mu).

    This variance function directly impacts the estimation of GLMs. Assume the task is to estimate the expectation of a random variableY_i\sim Tw_p(\mu_i, \phi/w_i), given observations of the target y_i and of explanatories variables, aka features, x_i\in R^k. A GLM then assumes a link functiong(\mu_i) = \sum_{j=1}^k x_{ij}\beta_jwith coefficients \beta to be estimated via an optimization procedure, of which the first order condition, also called score equation, reads

    \begin{equation*}
    \sum_i w_i \frac{y_i - \mu_i}{v(\mu_i)g'(\mu_i)} x_{ij}= 0 \quad \forall j =1, \ldots, k
    \end{equation*}

    This shows that the higher the Tweedie power p, entering via v(\mu) only, the less weight is given to deviations of large values. In other words, higher Tweedie powers result in GLMs that are less and less sensitive to what happens at large (expected) values.

    This is also reflected in the deviance loss function. They can be derived from the negative log-likelihood and are given by

    \begin{equation*}
    d_p(y, \mu) =
    	2 \cdot
    	\begin{cases}
    		\frac{\max(0, y^{2-p})}{(1-p)(2-p)}-\frac{y\mu^{1-p}}{1-p}+\frac{\mu^{2-p}}{2-p} & p \in \mathrm{R}\setminus (0,1] \cup \{2\} \\
    		y\log\frac{y}{\mu} - y + \mu & p=0 \\
    		\frac{y}{\mu} - \log\frac{y}{\mu} - 1 & p=2
    	\end{cases}
    \end{equation*}

    These are the only strictly consistent scoring functions for the expectation (up to one multiplicative and one additive constant) that are homogeneous functions (of degree 2-p), see, e.g., Fissler et al (2022). The Poisson deviance (p=1), for example, has a degree of homogeneity of 1 and the same unit as the target variable. The Gamma deviance (p=2), on the other side, is zero-homogeneous and is completely agnostic to the scale of its arguments. This is another way of stating the above: the higher the Tweedie power the less it cares about large values.

    It is also connected to the fact that Tweedie distributions are the only distributions from the exponential dispersion family that are closed under scale transformations:

    \begin{align*}
    Y &\sim Tw_p(\mu, \phi) \\
    cY &\sim Tw_p(c\mu, c^{2-p}\phi) \quad \forall c>0
    \end{align*}

    Offsets and Sample Weights

    Poisson GLM

    When estimating counts with a Poisson GLM, there is often an exposure measure like time under consideration or underlying number of things (insurance policies, trees in a forest, radioactive atoms). One then often finds two different, but equivalent formulations of a Poisson GLM with log-link.

    • Sample weights: Model frequency y=\frac{N}{w} and fit with sample weights w to estimate \operatorname{E}[y] = \mu_y = \exp(x \beta).
    • Offsets: Model counts N, but account for the exposure w via an offset as \operatorname{E}[N]=\mu_N = \exp(x \beta + \log(w)) = w \mu_y.

    Note that each way models a different target, so we had to use subscripts to distinguish the mean parameters \mu.

    In this special case of a Poisson GLM with (canonical) log-link, both models are equivalent and will result in the exact same parameters \beta. You can plug it into the score equation to convince yourself.

    Tweedie GLM

    Very importantly, this simple equivalence of GLM fomulations with offsets and with sample weights does only hold for the Poisson GLM with log-link. It does not hold for any other Tweedie parameter or even other distributions from the exponential dispersion family.

    One can show that a Tweedie GLM with log-link and offset (additive in link space) \log(u) on target y with weights w is equivalent to the same Tweedie GLM but with target \frac{y}{u} and weights w u^{2-p}.

    So one can construct an equivalence between unweighted with offsets and weighted without offsets by setting u = \sqrt[2-p]{w}. But note that this does not work for a Gamma GLM which as p=2.

    Example

    We continue with the same dataset and model as in part I and show this (non-) equivalence with the offsets.

    from glum import GeneralizedLinearRegressor
    import pandas as pd
    
    # ... quite some code ... here we abbreviate.
    # Model frequency with weights (but without offsets)
    y_freq = df["ClaimNb"] / df["Exposure"]
    w_freq = df["Exposure"]
    X = df[x_vars]
    glm_params = {
        "alpha": 0,
        "drop_first": True,
        "gradient_tol": 1e-8,
    }
    glm_freq = GeneralizedLinearRegressor(
        family="poisson", **glm_params
    ).fit(X, y_freq, sample_weight=w_freq)
    
    # Model counts N = w * freq with offsets (but without weights)
    N = w_freq * y_freq
    glm_offset_freq = GeneralizedLinearRegressor(
        family="poisson", **glm_params
    ).fit(X, N, offset=np.log(w_freq))
    
    print(
        f"intercept freq{'':<8}= {glm_freq.intercept_}\n"
        f"intercept freq offset = {glm_offset_freq.intercept_}"
    )
    # intercept freq        = -3.756437676421677
    # intercept freq offset = -3.7564376764216725
    
    np.max(np.abs(glm_freq.coef_ - glm_offset_freq.coef_)) < 1e-13
    # True

    As next, we model the severity Y = \frac{loss}{N} with claim counts N as weights. As is standard, we use a Gamma GLM with log-link (which is not canonical this time).

    # Model severity with weights (but without offsets)
    y_sev = (df["ClaimAmount"] / df["ClaimNb"])
    w_sev = df["ClaimNb"].fillna(0)
    X = df[x_vars]
    # Filter out zero count (w_sev==0) rows
    w_gt_0 = w_sev > 0
    y_sev = y_sev[w_gt_0]
    X_sev = X[w_gt_0]
    w_sev = w_sev[w_gt_0]
    
    glm_sev = GeneralizedLinearRegressor(
        family="gamma", **glm_params
    ).fit(X_sev, y_sev, sample_weight=w_sev)
    
    # Note that the target is claim amount = w * sev.
    claim_amount = w_sev * y_sev
    glm_offset_sev = GeneralizedLinearRegressor(
        family="gamma", **glm_params
    ).fit(X_sev, claim_amount, offset=np.log(w_sev))
    
    print(
        f"intercept sev{'':<8}= {glm_sev.intercept_}\n"
        f"intercept sev offset = {glm_offset_sev.intercept_}"
    )
    # intercept sev        = 7.287909799461992
    # intercept sev offset = 7.236827150674156
    
    np.max(np.abs(glm_sev.coef_ - glm_offset_sev.coef_))
    # 0.2119162919285421

    The deviations might seem small, but they are there and add up:

    print(
        "Total predicted claim amounts with weights "
        f"{np.sum(w_sev * glm_sev.predict(X_sev)):_.2f}"
    )
    print(
        "Total predicted claim amounts offset       "
        f"{np.sum(glm_offset_sev.predict(X_sev, offset=np.log(w_sev))):_.2f}"
    )
    # Total predicted claim amounts with weights 49_309_687.30
    # Total predicted claim amounts offset       48_769_342.47

    Here, it becomes evident that the two models are quite different.

    Outlook

    The full notebook can be found here.

    The final part III of the Tweedie trilogy will follow in one week and go into details of the probability density function.

  • A Tweedie Trilogy — Part I: Frequency and Aggregration Invariance

    TLDR: In this first part of the Tweedie Trilogy, we will take a look at what happens to a GLM if we aggregate the data by a group-by operation. A frequency model for insurance pricing will serve as an example.

    This trilogy celebrates the 40th birthday of Tweedie distributions in 2024 and highlights some of their very special properties.

    Intro

    Tweedie distributions and Generalised Linear Models (GLM) have an intertwined relationship. While GLMs are, in my view, one of the best reference models for estimating expectations, Tweedie distributions lie at the heart of expectation estimation. In fact, basically all applied GLMs in practice use Tweedie distributions with three notable exceptions: the binomial, the multinomial and the negative binomial distribution.

    Mean-Variance Relation

    “An index which distinguishes between some important exponential families” is the original publication title of Maurice Charles Kenneth Tweedie in 1984—but note that Shaul K. Bar-Lev and Peter Enis published around the same time; as their 1986 paper was received November 1983, the distribution could also be named Bar-Lev & Enis distribution.1 This index is meanwhile called the Tweedie power parameter p. Recall that distributions of the exponential dispersion family always fulfil a mean-variance relationship. Its even a way to define them. For the Tweedie distribution, denoted Tw_p(\mu, \phi), the relation reads

    \begin{align*}
    \operatorname{E}[Y] &= \mu
    \\
    \operatorname{Var}[Y] &= \phi \mu^p
    \end{align*}

    with dispersion parameter \phi. Some very common members are given in the following table.

    pdistributiondomain Ydomain \mu
    0Normal / Gaussian\mathrm{R}\mathrm{R}
    1Poisson0, 1, 2, \ldots\mathrm{R}_+
    (1,2)Compound Poisson-Gamma\mathrm{R}_+ \cup \{0\}\mathrm{R}_+
    2Gamma\mathrm{R}_+\mathrm{R}_+
    3inverse Gaussian\mathrm{R}_+\mathrm{R}_+

    Insurance Pricing Models

    In non-life insurance pricing, most claims happen somewhat randomly, typically the occurrence as well as the size. Take the theft of your bike or a water damage of your basement due to flooding as an example. Pricing actuaries usually want to predict the expected loss E[Y|X] given some features X of a policy. The set of features could contain the purchasing price of your bike or the proximity of your house to a river.

    Instead of directly modelling the expected loss per exposure w, e.g. the time duration of the insurance contract, the most used approach is the famous frequency-severity split:

    \begin{align*}
    \operatorname{E}\left[\frac{Y}{w}\right] = \underbrace{\operatorname{E}\left[\frac{N}{w}\right]}_{frequency} \cdot
    \underbrace{\operatorname{E}\left[\left. \frac{Y}{n}\right| N=n\right]}_{severity}
    \end{align*}

    For simplicity, the conditioning on X is suppressed, it would occur in every expectation. The first part \operatorname{E}\left[\frac{N}{w}\right]is the (expected) frequency, i.e. the number of claims per exposure (time). The second term \operatorname{E}\left[\left.\frac{Y}{N}\right| N\right] is the (expected) severity, i.e. the average claim size (per claim) given a fixed number of claims. Here, we focus on the frequency part.

    Convolution and Aggregation Invariance

    This property might first seem very theoretical, but it may be one of the most important properties for the estimation of expectations E[Y|X] with GLMs. It is in fact a property valid for the whole exponential dispersion family: The weighted mean of i.i.d. random variables has (almost) the same distribution!

    If

    \begin{align*}
    Y_i &\overset{i.i.d}{\sim} \mathrm{Tw}_p(\mu, \phi/w_i) \,,
    \\
    w_+ &= \sum_i w_i \quad\text{with } w_i >0 \,,
    \end{align*}

    then

    \begin{align*}
    Y &=\sum_i^n \frac{w_i Y_i}{w_+} \sim \mathrm{Tw}_p(\mu, \phi/w_+) \,.
    \end{align*}

    It is obvious that the mean of Yis again \mu. But is is remarkable that it has the same distribution with the same power parameter, only the 2nd argument with the dispersion parameter differs. But the dispersion parameter cancels out in GLM estimations. In fact, we will show that two GLMs, one on aggregated data, give identical results. Another way of saying the same in statistical terms is that (weighted) averages are the sufficient statistic for the expectation within the exponential dispersion family.

    This is quite an essential property for data aggregation. It means that one can aggregate rows with identical features and still do an analysis (of the conditional expectation) without loss of information.

    The weighted average above can be written a bit more intuitive. For instance, a frequency Y_i=\frac{N_i}{w_i} has weighted average Y=\frac{\sum_i N_i}{\sum_i w_i}.

    Poisson Distribution

    When modelling counts, the Poisson distribution is by far the easiest distribution one can think of. It only has a single parameter, is a member of the Tweedie family, and fulfils the mean-variance relation

    \begin{equation*}
    \operatorname{E}[N] = \mu = \operatorname{Var}[N] \,.\end{equation*}

    In particular, p=1. While the distribution is strictly speaking only for counts, i.e. N takes on non-negative integer values, Poisson regression also works for any non-negative response variable like N/w \in \mathrm{R}.

    Frequency Example

    For demonstration, we fit a Poisson GLM on the french motor third-party liability claims dataset, cf. the corresponding scikit-learn example and the case study 1 of the Swiss Association of Actuaries on the same dataset.

    from glum import GeneralizedLinearRegressor
    import pandas as pd
    
    # ... quite some code ... here we abbreviate.
    y_freq = df["ClaimNb"] / df["Exposure"]
    w_freq = df["Exposure"]
    X = df[x_vars]
    glm_params = {
        "alpha": 0,
        "drop_first": True,
        "gradient_tol": 1e-8,
    }
    glm_freq = GeneralizedLinearRegressor(
        family="poisson", **glm_params
    ).fit(X, y_freq, sample_weight=w_freq)
    print(
      f"Total predicted number of claims = "
      f"{(w_freq * glm_freq.predict(X)).sum():_.2f}"
    )
    # Total predicted number of claims = 26_444.00
    
    # Now aggregated
    df_agg = df.groupby(x_vars, observed=True).sum().reset_index()
    print(
        f"Aggregation reduced number of rows from {df.shape[0]:_}"
        f"to {df_agg.shape[0]:_}."
    )
    # Aggregation reduced number of rows from 678_013 to 133_413.
    y_agg_freq = df_agg["ClaimNb"] / df_agg["Exposure"]
    w_agg_freq = df_agg["Exposure"]
    X_agg = df_agg[x_vars]
    glm_agg_freq = GeneralizedLinearRegressor(
        family="poisson", **glm_params
    ).fit(X_agg, y_agg_freq, sample_weight=w_agg_freq)
    print(
        f"Total predicted number of claims = "
        f"{(w_agg_freq * glm_agg_freq.predict(X_agg)).sum():_.2f}"
    )
    # Total predicted number of claims = 26_444.00

    In fact, both models have the same intercept term and same coefficients, they are really identical models (up to numerical precision):

    print(
        f"intercept freq{'':<18}= {glm_freq.intercept_}\n"
        f"intercept freq aggregated model = {glm_agg_freq.intercept_}"
    )
    # intercept freq                  = -3.7564376764216747
    # intercept freq aggregated model = -3.7564376764216747
    
    np.max(np.abs(glm_freq.coef_ - glm_agg_freq.coef_)) < 1e-13
    # True

    Outlook

    The full notebook can be found here.

    In the next week, part II of this trilogy will follow. There, we will meet some more of its quite remarkable properties.

    Further references:

    • Tweedie M.C.K. 1984. “An index which distinguishes between some important exponential families”. Statistics: Applications and New Directions. Proceedings of the Indian Statistical Institute Golden Jubilee International Conference, Indian Statistical Institute, Cal- cutta, pp. 579–604.
    • Bar-Lev, S.K., Enis, P. Reproducibility in the one-parameter exponential family. Metrika 32, 391–394 (1985). https://doi.org/10.1007/BF01897827
    • Shaul K. Bar-Lev. Peter Enis. “Reproducibility and Natural Exponential Families with Power Variance Functions.” Ann. Statist. 14 (4) 1507 – 1522, December, 1986. https://doi.org/10.1214/aos/1176350173
    1. A great thanks to Prof. Mario Wüthrich for pointing out the references of Bar-Lev and Enis. ↩︎