{"id":1440,"date":"2024-02-02T19:15:49","date_gmt":"2024-02-02T18:15:49","guid":{"rendered":"https:\/\/lorentzen.ch\/?p=1440"},"modified":"2024-04-12T08:43:50","modified_gmt":"2024-04-12T06:43:50","slug":"ml-xai-strong-glm-in-python","status":"publish","type":"post","link":"https:\/\/lorentzen.ch\/index.php\/2024\/02\/02\/ml-xai-strong-glm-in-python\/","title":{"rendered":"Building Strong GLMs in Python via ML + XAI"},"content":{"rendered":"\n<p>In our <a href=\"https:\/\/lorentzen.ch\/index.php\/2024\/01\/21\/ml-xai-strong-glm\/\">latest post<\/a>, we explained how to use ML + XAI to build strong generalized linear models with R. Let&#8217;s do the same with Python.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"695\" height=\"495\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-5.png\" alt=\"\" class=\"wp-image-1446\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-5.png 695w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-5-300x214.png 300w\" sizes=\"auto, (max-width: 695px) 100vw, 695px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Insurance pricing data<\/h2>\n\n\n\n<p>We will use again a synthetic dataset with 1 Mio insurance policies, with reference:<\/p>\n\n\n\n<p>Mayer, M., Meier, D. and Wuthrich, M.V. (2023), <br>SHAP for Actuaries: Explain any Model.<br><a href=\"https:\/\/doi.org\/10.2139\/ssrn.4389797\">https:\/\/<\/a><a href=\"http:\/\/dx.doi.org\/10.2139\/ssrn.4389797\">doi.org\/10.2139\/ssrn.4389797<\/a><\/p>\n\n\n\n<p>Let&#8217;s start by loading and describing the data:<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport shap\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.inspection import PartialDependenceDisplay\nfrom sklearn.metrics import mean_poisson_deviance\nfrom sklearn.dummy import DummyRegressor\nfrom lightgbm import LGBMRegressor\n# We need preview version of glum that adds formulaic API\n# !pip install git+https:\/\/github.com\/Quantco\/glum@glum-v3#egg=glum\nfrom glum import GeneralizedLinearRegressor\n\n# Load data\n\ndf = fetch_openml(data_id=45106, parser=&quot;pandas&quot;).frame\ndf.head()\n\n# Continuous features\ndf.hist([&quot;driver_age&quot;, &quot;car_weight&quot;, &quot;car_power&quot;, &quot;car_age&quot;])\n_ = plt.suptitle(&quot;Histograms of continuous features&quot;, fontsize=15)\n\n# Response and discrete features\nfig, axes = plt.subplots(figsize=(8, 3), ncols=3)\n\nfor v, ax in zip([&quot;claim_nb&quot;, &quot;year&quot;, &quot;town&quot;], axes):\n    df[v].value_counts(sort=False).sort_index().plot(kind=&quot;bar&quot;, ax=ax, rot=0, title=v)\nplt.suptitle(&quot;Barplots of response and discrete features&quot;, fontsize=15)\nplt.tight_layout()\nplt.show()\n\n# Rank correlations\ncorr = df.corr(&quot;spearman&quot;)\nmask = np.triu(np.ones_like(corr, dtype=bool))\nplt.suptitle(&quot;Rank-correlogram&quot;, fontsize=15)\n_ = sns.heatmap(\n    corr, mask=mask, vmin=-0.7, vmax=0.7, center=0, cmap=&quot;vlag&quot;, square=True\n)\n\n<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"569\" height=\"457\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image.png\" alt=\"\" class=\"wp-image-1441\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image.png 569w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-300x241.png 300w\" sizes=\"auto, (max-width: 569px) 100vw, 569px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"791\" height=\"288\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-1.png\" alt=\"\" class=\"wp-image-1442\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-1.png 791w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-1-300x109.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-1-768x280.png 768w\" sizes=\"auto, (max-width: 791px) 100vw, 791px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"562\" height=\"520\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-2.png\" alt=\"\" class=\"wp-image-1443\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-2.png 562w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-2-300x278.png 300w\" sizes=\"auto, (max-width: 562px) 100vw, 562px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Modeling<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li>We fit a tuned Boosted Trees model to model log(E(claim count)) via Poisson deviance loss.<\/li>\n\n\n\n<li>And perform a SHAP analysis to derive insights.<\/li>\n<\/ol>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">from sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(\n    df.drop(&quot;claim_nb&quot;, axis=1), df[&quot;claim_nb&quot;], test_size=0.1, random_state=30\n)\n\n# Tuning step not shown. Number of boosting rounds found via early stopping on CV performance\nparams = dict(\n    learning_rate=0.05,\n    objective=&quot;poisson&quot;,\n    num_leaves=7,\n    min_child_samples=50,\n    min_child_weight=0.001,\n    colsample_bynode=0.8,\n    subsample=0.8,\n    reg_alpha=3,\n    reg_lambda=5,\n    verbose=-1,\n)\n\nmodel_lgb = LGBMRegressor(n_estimators=360, **params)\nmodel_lgb.fit(X_train, y_train)\n\n# SHAP analysis\nX_explain = X_train.sample(n=2000, random_state=937)\nexplainer = shap.Explainer(model_lgb)\nshap_val = explainer(X_explain)\n\nplt.suptitle(&quot;SHAP importance&quot;, fontsize=15)\nshap.plots.bar(shap_val)\n\nfor s in [shap_val[:, 0:3], shap_val[:, 3:]]:\n    shap.plots.scatter(s, color=shap_val, ymin=-0.5, ymax=1)<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"766\" height=\"413\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-3.png\" alt=\"\" class=\"wp-image-1444\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-3.png 766w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-3-300x162.png 300w\" sizes=\"auto, (max-width: 766px) 100vw, 766px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"888\" height=\"674\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-4.png\" alt=\"\" class=\"wp-image-1445\" style=\"aspect-ratio:1.317507418397626;width:822px;height:auto\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-4.png 888w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-4-300x228.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-4-768x583.png 768w\" sizes=\"auto, (max-width: 888px) 100vw, 888px\" \/><\/figure>\n\n\n\n<p>Here, we would come to the conclusions:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><code>car_weight<\/code> and <code>year<\/code> might be dropped, depending on the specify aim of the model.<\/li>\n\n\n\n<li>Add a regression spline for <code>driver_age<\/code>.<\/li>\n\n\n\n<li>Add an interaction between <code>car_power<\/code> and <code>town<\/code>.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Build strong GLM<\/h2>\n\n\n\n<p>Let&#8217;s build a GLM with these insights. Two important things:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/github.com\/Quantco\/glum\">Glum<\/a> is an extremely powerful GLM implementation that was inspired by a pull request of our Christian Lorentzen.<\/li>\n\n\n\n<li>In the upcoming version 3.0, it adds a formula API based of <a href=\"https:\/\/github.com\/matthewwardrop\/formulaic\">formulaic<\/a>, a very performant formula parser. This gives a very easy way to add interaction effects, regression splines, dummy encodings etc.<\/li>\n<\/ol>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">model_glm = GeneralizedLinearRegressor(\n    family=&quot;poisson&quot;,\n    l1_ratio=1.0,\n    alpha=1e-10,\n    formula=&quot;car_power * C(town) + bs(driver_age, 7) + car_age&quot;,\n)\nmodel_glm.fit(X_train, y=y_train)  # 1 second on old laptop\n\n# PDPs of both models\nfig, ax = plt.subplots(2, 2, figsize=(7, 5))\ncols = (&quot;tab:blue&quot;, &quot;tab:orange&quot;)\nfor color, name, model in zip(cols, (&quot;GLM&quot;, &quot;LGB&quot;), (model_glm, model_lgb)):\n    disp = PartialDependenceDisplay.from_estimator(\n        model,\n        features=[&quot;driver_age&quot;, &quot;car_age&quot;, &quot;car_power&quot;, &quot;town&quot;],\n        X=X_explain,\n        ax=ax if name == &quot;GLM&quot; else disp.axes_,\n        line_kw={&quot;label&quot;: name, &quot;color&quot;: color},\n    )\nfig.suptitle(&quot;PDPs of both models&quot;, fontsize=15)\nfig.tight_layout()\n\n# Stratified PDP of car_power\nfor color, town in zip((&quot;tab:blue&quot;, &quot;tab:orange&quot;), (0, 1)):\n    mask = X_explain.town == town\n    disp = PartialDependenceDisplay.from_estimator(\n        model_glm,\n        features=[&quot;car_power&quot;],\n        X=X_explain[mask],\n        ax=None if town == 0 else disp.axes_,\n        line_kw={&quot;label&quot;: town, &quot;color&quot;: color},\n    )\nplt.suptitle(&quot;PDP of car_power stratified by town (0 vs 1)&quot;, fontsize=15)\n_ = plt.ylim(0, 0.2)<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"695\" height=\"495\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-5.png\" alt=\"\" class=\"wp-image-1446\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-5.png 695w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-5-300x214.png 300w\" sizes=\"auto, (max-width: 695px) 100vw, 695px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"599\" height=\"479\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-6.png\" alt=\"\" class=\"wp-image-1447\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-6.png 599w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-6-300x240.png 300w\" sizes=\"auto, (max-width: 599px) 100vw, 599px\" \/><\/figure>\n\n\n\n<p>In this relatively simple situation, the mean Poisson deviance of our models are very simlar now:<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">model_dummy = DummyRegressor().fit(X_train, y=y_train)\ndeviance_null = mean_poisson_deviance(y_test, model_dummy.predict(X_test)) \n\ndev_imp = []\nfor name, model in zip((&quot;GLM&quot;, &quot;LGB&quot;, &quot;Null&quot;), (model_glm, model_lgb, model_dummy)):\n    dev_imp.append((name, mean_poisson_deviance(y_test, model.predict(X_test))))\npd.DataFrame(dev_imp, columns=[&quot;Model&quot;, &quot;Mean_Poisson_Deviance&quot;])<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"122\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/02\/image-7.png\" alt=\"\" class=\"wp-image-1448\" style=\"aspect-ratio:2.098360655737705;width:318px;height:auto\"\/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Final words<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Glum is an extremely powerful GLM implementation &#8211; we have only scratched its surface. You can expect more blogposts on Glum&#8230;<\/li>\n\n\n\n<li>Having a formula interface is especially useful for adding interactions. Fingers crossed that the upcoming version 3.0 will soon be released.<\/li>\n\n\n\n<li>Building GLMs via ML + XAI is so smooth, especially when you work with large data. For small data, you need to be careful to not add hidden overfitting to the model.<\/li>\n<\/ul>\n\n\n\n<p><a href=\"https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2024-02-02%20strong_glm_with_xai.ipynb\">Click here for the full Python notebook<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We use Python to craft a strong GLM by insights from a boosted trees model.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16,17,9],"tags":[6],"class_list":["post-1440","post","type-post","status-publish","format-standard","hentry","category-machine-learning","category-programming","category-statistics","tag-python"],"featured_image_src":null,"author_info":{"display_name":"Michael Mayer","author_link":"https:\/\/lorentzen.ch\/index.php\/author\/michael\/"},"_links":{"self":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1440","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/comments?post=1440"}],"version-history":[{"count":5,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1440\/revisions"}],"predecessor-version":[{"id":1534,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1440\/revisions\/1534"}],"wp:attachment":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/media?parent=1440"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/categories?post=1440"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/tags?post=1440"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}