Building Strong GLMs in Python via ML + XAI

In our latest post, we explained how to use ML + XAI to build strong generalized linear models with R. Let’s do the same with Python.

Insurance pricing data

We will use again a synthetic dataset with 1 Mio insurance policies, with reference:

Mayer, M., Meier, D. and Wuthrich, M.V. (2023),
SHAP for Actuaries: Explain any Model.
https://doi.org/10.2139/ssrn.4389797

Let’s start by loading and describing the data:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import shap
from sklearn.datasets import fetch_openml
from sklearn.inspection import PartialDependenceDisplay
from sklearn.metrics import mean_poisson_deviance
from sklearn.dummy import DummyRegressor
from lightgbm import LGBMRegressor
# We need preview version of glum that adds formulaic API
# !pip install git+https://github.com/Quantco/glum@glum-v3#egg=glum
from glum import GeneralizedLinearRegressor

# Load data

df = fetch_openml(data_id=45106, parser="pandas").frame
df.head()

# Continuous features
df.hist(["driver_age", "car_weight", "car_power", "car_age"])
_ = plt.suptitle("Histograms of continuous features", fontsize=15)

# Response and discrete features
fig, axes = plt.subplots(figsize=(8, 3), ncols=3)

for v, ax in zip(["claim_nb", "year", "town"], axes):
    df[v].value_counts(sort=False).sort_index().plot(kind="bar", ax=ax, rot=0, title=v)
plt.suptitle("Barplots of response and discrete features", fontsize=15)
plt.tight_layout()
plt.show()

# Rank correlations
corr = df.corr("spearman")
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.suptitle("Rank-correlogram", fontsize=15)
_ = sns.heatmap(
    corr, mask=mask, vmin=-0.7, vmax=0.7, center=0, cmap="vlag", square=True
)

Modeling

We fit a tuned Boosted Trees model to model log(E(claim count)) via Poisson deviance loss.
And perform a SHAP analysis to derive insights.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("claim_nb", axis=1), df["claim_nb"], test_size=0.1, random_state=30
)

# Tuning step not shown. Number of boosting rounds found via early stopping on CV performance
params = dict(
    learning_rate=0.05,
    objective="poisson",
    num_leaves=7,
    min_child_samples=50,
    min_child_weight=0.001,
    colsample_bynode=0.8,
    subsample=0.8,
    reg_alpha=3,
    reg_lambda=5,
    verbose=-1,
)

model_lgb = LGBMRegressor(n_estimators=360, **params)
model_lgb.fit(X_train, y_train)

# SHAP analysis
X_explain = X_train.sample(n=2000, random_state=937)
explainer = shap.Explainer(model_lgb)
shap_val = explainer(X_explain)

plt.suptitle("SHAP importance", fontsize=15)
shap.plots.bar(shap_val)

for s in [shap_val[:, 0:3], shap_val[:, 3:]]:
    shap.plots.scatter(s, color=shap_val, ymin=-0.5, ymax=1)

Here, we would come to the conclusions:

car_weight and year might be dropped, depending on the specify aim of the model.
Add a regression spline for driver_age.
Add an interaction between car_power and town.

Build strong GLM

Let’s build a GLM with these insights. Two important things:

Glum is an extremely powerful GLM implementation that was inspired by a pull request of our Christian Lorentzen.
In the upcoming version 3.0, it adds a formula API based of formulaic, a very performant formula parser. This gives a very easy way to add interaction effects, regression splines, dummy encodings etc.

model_glm = GeneralizedLinearRegressor(
    family="poisson",
    l1_ratio=1.0,
    alpha=1e-10,
    formula="car_power * C(town) + bs(driver_age, 7) + car_age",
)
model_glm.fit(X_train, y=y_train)  # 1 second on old laptop

# PDPs of both models
fig, ax = plt.subplots(2, 2, figsize=(7, 5))
cols = ("tab:blue", "tab:orange")
for color, name, model in zip(cols, ("GLM", "LGB"), (model_glm, model_lgb)):
    disp = PartialDependenceDisplay.from_estimator(
        model,
        features=["driver_age", "car_age", "car_power", "town"],
        X=X_explain,
        ax=ax if name == "GLM" else disp.axes_,
        line_kw={"label": name, "color": color},
    )
fig.suptitle("PDPs of both models", fontsize=15)
fig.tight_layout()

# Stratified PDP of car_power
for color, town in zip(("tab:blue", "tab:orange"), (0, 1)):
    mask = X_explain.town == town
    disp = PartialDependenceDisplay.from_estimator(
        model_glm,
        features=["car_power"],
        X=X_explain[mask],
        ax=None if town == 0 else disp.axes_,
        line_kw={"label": town, "color": color},
    )
plt.suptitle("PDP of car_power stratified by town (0 vs 1)", fontsize=15)
_ = plt.ylim(0, 0.2)

In this relatively simple situation, the mean Poisson deviance of our models are very simlar now:

model_dummy = DummyRegressor().fit(X_train, y=y_train)
deviance_null = mean_poisson_deviance(y_test, model_dummy.predict(X_test)) 

dev_imp = []
for name, model in zip(("GLM", "LGB", "Null"), (model_glm, model_lgb, model_dummy)):
    dev_imp.append((name, mean_poisson_deviance(y_test, model.predict(X_test))))
pd.DataFrame(dev_imp, columns=["Model", "Mean_Poisson_Deviance"])

Final words

Glum is an extremely powerful GLM implementation – we have only scratched its surface. You can expect more blogposts on Glum…
Having a formula interface is especially useful for adding interactions. Fingers crossed that the upcoming version 3.0 will soon be released.
Building GLMs via ML + XAI is so smooth, especially when you work with large data. For small data, you need to be careful to not add hidden overfitting to the model.

Click here for the full Python notebook

Comments

2 responses to “Building Strong GLMs in Python via ML + XAI”

2024-05-02

Matthias Schmidtblaicher

Thanks for the post! In case you haven’t seen it yet, we finally released glum v3 to conda-forge. Please let us know if you encounter any issue or features that you would like to add.

2024-11-23

Effect Plots in Python and R – Michael's and Christian's Blog

[…] both high-quality Python and R code. We will use the same data and models as in one of our latest posts on how to build strong GLMs via ML + […]

Insurance pricing data

Modeling

Build strong GLM

Final words

Comments

2 responses to “Building Strong GLMs in Python via ML + XAI”

Leave a Reply Cancel reply

More posts

Tabular foundation models: simply a tool for impressing managers?

Gap Safe Screening Rules for the Lasso landed in scikit-learn

Introducing LightSHAP

Key improvements in shapviz and kernelshap