{"id":292,"date":"2021-06-23T20:40:24","date_gmt":"2021-06-23T18:40:24","guid":{"rendered":"https:\/\/lorentzen.ch\/?p=292"},"modified":"2021-06-23T20:40:25","modified_gmt":"2021-06-23T18:40:25","slug":"shap-analysis-in-9-lines","status":"publish","type":"post","link":"https:\/\/lorentzen.ch\/index.php\/2021\/06\/23\/shap-analysis-in-9-lines\/","title":{"rendered":"SHAP Analysis in 9 Lines"},"content":{"rendered":"\n<p>Hello ML world<\/p>\n\n\n\n<p>Recently, together with Yang Liu, we have been investing some time to extend the R package <a href=\"https:\/\/github.com\/liuyanguu\/SHAPforxgboost\">SHAPforxgboost<\/a>. This package is designed to make beautiful SHAP plots for <a href=\"https:\/\/xgboost.readthedocs.io\/en\/latest\/\">XGBoost<\/a> models, using the native treeshap implementation shipped with XGBoost. <\/p>\n\n\n\n<p><strong>Some of the new features of SHAPforxgboost<\/strong> <\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Added support for <a href=\"https:\/\/lightgbm.readthedocs.io\/en\/latest\/\">LightGBM<\/a> models, using the native treeshap implementation for LightGBM. So don&#8217;t get tricked by the package name &#8220;SHAPforxgboost&#8221; :-).<\/li><li>The function <code>shap.plot.dependence<\/code>() has received the option to select the heuristically strongest interacting feature on the color scale, see last section for details.<\/li><li><code>shap.plot.dependence<\/code>() now allows jitter and alpha transparency.<\/li><li>The new function <code>shap.importance<\/code>() returns SHAP importances without plotting them.<\/li><li>Added <a href=\"https:\/\/cran.r-project.org\/web\/packages\/SHAPforxgboost\/vignettes\/basic_workflow.html\">vignette<\/a> with basic workflow to <a href=\"https:\/\/cran.r-project.org\/web\/packages\/SHAPforxgboost\/index.html\">CRAN<\/a>.<\/li><li>Added logo:<\/li><\/ul>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/github.com\/liuyanguu\/SHAPforxgboost\/blob\/master\/man\/figures\/logo.png?raw=true\" alt=\"logo.png\"\/><\/figure><\/div>\n\n\n\n<p>An interesting alternative to calculate and plot SHAP values for different tree-based models is the <a href=\"https:\/\/github.com\/ModelOriented\/treeshap\">treeshap<\/a> package by Szymon Maksymiuk et al. Keep an eye on this one &#8211; it is actively being developed!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is SHAP?<\/h2>\n\n\n\n<p>A couple of years ago, the concept of Shapely values from game theory from the 1950ies was discovered e.g. by Scott Lundberg as an interesting approach to explain predictions of ML models. <\/p>\n\n\n\n<p>The basic idea is to decompose a prediction in a fair way into additive contributions of features. Repeating the process for many predictions provides a brilliant way to investigate the model as a whole. <\/p>\n\n\n\n<p>The main resource on the topic is Scott Lundberg&#8217;s <a href=\"https:\/\/github.com\/slundberg\/shap\">site<\/a>. Besides this, I&#8217;d recomment to go through these two fantastic blog posts, even if you already know what SHAP values are:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"https:\/\/towardsdatascience.com\/explain-your-model-with-the-shap-values-bc36aac4de3d\">https:\/\/towardsdatascience.com\/explain-your-model-with-the-shap-values-bc36aac4de3d<\/a><\/li><li><a href=\"https:\/\/meichenlu.com\/2018-11-10-SHAP-explainable-machine-learning\/\">https:\/\/meichenlu.com\/2018-11-10-SHAP-explainable-machine-learning\/<\/a><\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Illustration<\/h2>\n\n\n\n<p>As an example, we will try to model log house prices of 20&#8217;000 sold houses in Kings County. The dataset is available e.g. on <a href=\"http:\/\/openml.org\">OpenML.org<\/a> under ID 42092.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"224\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/image-1024x224.png\" alt=\"\" class=\"wp-image-418\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/image-1024x224.png 1024w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/image-300x66.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/image-768x168.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/image.png 1146w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Some rows and columns from the Kings County house dataset.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Fetch and prepare data<\/h3>\n\n\n\n<p>We start by downloading the data and preparing it for modelling.<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-4097b0e9-1f33-4621-b8dd-b08316098152\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-4097b0e9-1f33-4621-b8dd-b08316098152-tab-0\" aria-controls=\"ub-tabbed-content-4097b0e9-1f33-4621-b8dd-b08316098152-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: center; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-4097b0e9-1f33-4621-b8dd-b08316098152-panel-0\" aria-labelledby=\"ub-tabbed-content-4097b0e9-1f33-4621-b8dd-b08316098152-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'>library(farff)\nlibrary(OpenML)\nlibrary(dplyr)\nlibrary(xgboost)\nlibrary(ggplot2)\nlibrary(SHAPforxgboost)\n\n# Load King Country house prices dataset on OpenML\n# ID 42092, https:\/\/www.openml.org\/d\/42092\ndf &lt;- getOMLDataSet(data.id = 42092)$data\nhead(df)\n\n# Prepare\ndf &lt;- df %&gt;%\n  mutate(\n    log_price = log(price),\n    log_sqft_lot = log(sqft_lot),\n    year = as.numeric(substr(date, 1, 4)),\n    building_age = year - yr_built,\n    zipcode = as.integer(as.character(zipcode))\n  )\n\n# Define response and features\ny &lt;- \"log_price\"\nx &lt;- c(\"grade\", \"year\", \"building_age\", \"sqft_living\",\n       \"log_sqft_lot\", \"bedrooms\", \"bathrooms\", \"floors\", \"zipcode\",\n       \"lat\", \"long\", \"condition\", \"waterfront\")\n\n# random split\nset.seed(83454)\nix &lt;- sample(nrow(df), 0.8 * nrow(df))<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<h3 class=\"wp-block-heading\">Fit XGBoost model<\/h3>\n\n\n\n<p>Next, we fit a manually tuned XGBoost model to the data.<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-8352d308-0a50-43df-b466-73605eed4a85\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-8352d308-0a50-43df-b466-73605eed4a85-tab-0\" aria-controls=\"ub-tabbed-content-8352d308-0a50-43df-b466-73605eed4a85-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: center; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-8352d308-0a50-43df-b466-73605eed4a85-panel-0\" aria-labelledby=\"ub-tabbed-content-8352d308-0a50-43df-b466-73605eed4a85-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'>dtrain &lt;- xgb.DMatrix(data.matrix(df[ix, x]),\n                      label = df[ix, y])\ndvalid &lt;- xgb.DMatrix(data.matrix(df[-ix, x]),\n                      label = df[-ix, y])\n\nparams &lt;- list(\n  objective = \"reg:squarederror\",\n  learning_rate = 0.05,\n  subsample = 0.9,\n  colsample_bynode = 1,\n  reg_lambda = 2,\n  max_depth = 5\n)\n\nfit_xgb &lt;- xgb.train(\n  params,\n  data = dtrain,\n  watchlist = list(valid = dvalid),\n  early_stopping_rounds = 20,\n  print_every_n = 100,\n  nrounds = 10000 # early stopping\n)<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<p>The resulting model consists of about 600 trees and reaches a validation RMSE of 0.16. This means that about 2\/3 of the predictions are within 16% of the observed price, using the <a href=\"https:\/\/en.wikipedia.org\/wiki\/68%E2%80%9395%E2%80%9399.7_rule\">empirical rule<\/a>.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Compact SHAP analysis<\/h4>\n\n\n\n<p>ML models are rarely of any use without interpreting its results, so let&#8217;s use SHAP to peak into the model.<\/p>\n\n\n\n<p>The analysis includes a first plot with SHAP importances. Then, with decreasing importance, dependence plots are shown to get an impression on the effects of each feature.<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-b447be11-9aef-4f70-9e10-a48207f0b45d\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-b447be11-9aef-4f70-9e10-a48207f0b45d-tab-0\" aria-controls=\"ub-tabbed-content-b447be11-9aef-4f70-9e10-a48207f0b45d-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: center; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-b447be11-9aef-4f70-9e10-a48207f0b45d-panel-0\" aria-labelledby=\"ub-tabbed-content-b447be11-9aef-4f70-9e10-a48207f0b45d-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'># Step 1: Select some observations\nX &lt;- data.matrix(df[sample(nrow(df), 1000), x])\n\n# Step 2: Crunch SHAP values\nshap &lt;- shap.prep(fit_xgb, X_train = X)\n\n# Step 3: SHAP importance\nshap.plot.summary(shap)\n\n# Step 4: Loop over dependence plots in decreasing importance\nfor (v in shap.importance(shap, names_only = TRUE)) {\n  p &lt;- shap.plot.dependence(shap, v, color_feature = \"auto\", \n                            alpha = 0.5, jitter_width = 0.1) +\n    ggtitle(v)\n  print(p)\n}<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<p>Some of the plots are shown below. The code actually produces all plots, see the corresponding <a href=\"https:\/\/htmlpreview.github.io\/?https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2021-06-23-shapforxgboost.html\">html<\/a> output on github.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-1-1024x731.png\" alt=\"\" class=\"wp-image-522\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-1-1024x731.png 1024w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-1-300x214.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-1-768x549.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-1-1200x857.png 1200w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-1.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 1: SHAP importance for XGBoost model. The results make intuitive sense. Location and size are among the strongest predictors.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-3-1-1024x731.png\" alt=\"\" class=\"wp-image-526\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-3-1-1024x731.png 1024w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-3-1-300x214.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-3-1-768x549.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-3-1-1200x857.png 1200w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-3-1.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 2: SHAP dependence for the second strongest predictor. The larger the living area, the higher the log price. There is not much vertical scatter, indicating that living area acts quite additively on the predictions on the log scale.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-8-1024x731.png\" alt=\"\" class=\"wp-image-531\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-8-1024x731.png 1024w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-8-300x214.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-8-768x549.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-8-1200x857.png 1200w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/xgbshap-8.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 3: SHAP dependence for a less important predictor. The effect of &#8220;condition&#8221; 4 vs 3 seems to depend on the zipcode (see the color). For some zipcodes, the condition does not have a big effect on the price, while for other zipcodes, the effect is clearly higher.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Same workflow for LightGBM<\/h3>\n\n\n\n<p>Let&#8217;s try out the <code>SHAPforxgboost<\/code> package with LightGBM. <\/p>\n\n\n\n<p><em>Note: LightGBM Version 3.2.1 on CRAN is not working properly under Windows. This will be fixed in the next release of LightGBM. As a temporary solution, you need to build it from the current <a href=\"https:\/\/github.com\/microsoft\/LightGBM\">master branch<\/a><\/em>.<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-04510b3b-d984-4ddf-afb0-95e565214faa\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-04510b3b-d984-4ddf-afb0-95e565214faa-tab-0\" aria-controls=\"ub-tabbed-content-04510b3b-d984-4ddf-afb0-95e565214faa-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: center; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-04510b3b-d984-4ddf-afb0-95e565214faa-panel-0\" aria-labelledby=\"ub-tabbed-content-04510b3b-d984-4ddf-afb0-95e565214faa-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'>library(lightgbm)\n\ndtrain &lt;- lgb.Dataset(data.matrix(df[ix, x]),\n                      label = df[ix, y])\ndvalid &lt;- lgb.Dataset(data.matrix(df[-ix, x]),\n                      label = df[-ix, y])\n\nparams &lt;- list(\n  objective = \"regression\",\n  learning_rate = 0.05,\n  subsample = 0.9,\n  reg_lambda = 2,\n  num_leaves = 15\n)\n\nfit_lgb &lt;- lgb.train(\n  params,\n  data = dtrain,\n  valids = list(valid = dvalid),\n  early_stopping_rounds = 20,\n  eval_freq = 100,\n  eval = \"rmse\",\n  nrounds = 10000\n)<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<p>Early stopping on the validation data selects about 900 trees as being optimal and results in a validation RMSE of also 0.16.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">SHAP analysis<\/h4>\n\n\n\n<p>We use exactly the same short snippet to analyze the model by SHAP.<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-9abec7e2-804d-4939-9b43-b51a554bf558\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-9abec7e2-804d-4939-9b43-b51a554bf558-tab-0\" aria-controls=\"ub-tabbed-content-9abec7e2-804d-4939-9b43-b51a554bf558-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: center; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-9abec7e2-804d-4939-9b43-b51a554bf558-panel-0\" aria-labelledby=\"ub-tabbed-content-9abec7e2-804d-4939-9b43-b51a554bf558-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'>X &lt;- data.matrix(df[sample(nrow(df), 1000), x])\nshap &lt;- shap.prep(fit_lgb, X_train = X)\nshap.plot.summary(shap)\n\nfor (v in shap.importance(shap, names_only = TRUE)) {\n  p &lt;- shap.plot.dependence(shap, v, color_feature = \"auto\", \n                            alpha = 0.5, jitter_width = 0.1) +\n    ggtitle(v)\n  print(p)\n}<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<p>Again, we only show some of the output and refer to the <a href=\"https:\/\/htmlpreview.github.io\/?https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2021-06-23-shapforxgboost.html\">html<\/a> of the corresponding rmarkdown. Overall, the model seems to be very similar to the one obtained by XGBoost.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/lgbfit-1-1024x731.png\" alt=\"\" class=\"wp-image-533\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/lgbfit-1-1024x731.png 1024w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/lgbfit-1-300x214.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/lgbfit-1-768x549.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/lgbfit-1-1200x857.png 1200w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/lgbfit-1.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 4: SHAP importance for LightGBM. By chance, the order of importance is the same as for XGBoost.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/lgbfit-3-1024x731.png\" alt=\"\" class=\"wp-image-535\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/lgbfit-3-1024x731.png 1024w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/lgbfit-3-300x214.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/lgbfit-3-768x549.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/lgbfit-3-1200x857.png 1200w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/lgbfit-3.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 5: The dependence plot for the living area also looks identical in shape than for the XGBoost model.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">How does the dependence plot selects the color variable?<\/h2>\n\n\n\n<p>By default, Scott&#8217;s <a href=\"https:\/\/github.com\/slundberg\/shap\">shap<\/a> package for Python uses a statistical heuristic to colorize the points in the dependence plot by the variable with possibly strongest interaction. The heuristic used by SHAPforxgboost is slightly different and directly uses conditional variances. More specifically, the variable X on the x-axis as well as each other feature Z_k is binned into categories. Then, for each Z_k, the conditional variance across binned X and Z_k is calculated. The Z_k with the highest conditional variance is selected as the color variable. <\/p>\n\n\n\n<p>Note that the heuristic does not depend on &#8220;shap interaction values&#8221; in order to save time (and because these would not be available for LightGBM).<\/p>\n\n\n\n<p>The following simple example shows how\/that it is working. First, a dataset is created and a model with three features and strong interaction between x1 and x2 is being fitted. Then, we look at the dependence plots to see if it is consistent with the model\/data situation.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;r&quot;,&quot;mime&quot;:&quot;text\/x-rsrc&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;R&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;r&quot;}\">n &lt;- 1000\n\nset.seed(334)\n\ndf &lt;- data.frame(\n  x1 = runif(n),\n  x2 = runif(n),\n  x3 = runif(n)\n  ) %&gt;% \n  mutate(\n    y = x1 * x2 + x3 + runif(n)\n  )\nx &lt;- c(&quot;x1&quot;, &quot;x2&quot;, &quot;x3&quot;)\ndtrain &lt;- lgb.Dataset(data.matrix(df[, x]),\n                      label = df[, &quot;y&quot;])\n\nparams &lt;- list(\n  objective = &quot;regression&quot;,\n  learning_rate = 0.05,\n  subsample = 0.9,\n  reg_lambda = 2,\n  num_leaves = 15\n)\n\nfit_lgb &lt;- lgb.train(\n  params,\n  data = dtrain,\n  eval = &quot;rmse&quot;,\n  nrounds = 100\n)\n\nshap &lt;- shap.prep(fit_lgb, X_train = data.matrix(df[, x]))\nshap.plot.summary(shap)\n\nshap.plot.dependence(shap, &quot;x1&quot;, color_feature = &quot;auto&quot;)\nshap.plot.dependence(shap, &quot;x2&quot;, color_feature = &quot;auto&quot;)\nshap.plot.dependence(shap, &quot;x3&quot;, color_feature = &quot;auto&quot;)<\/pre><\/div>\n\n\n\n<p>Here the dependence plots for x1 and x3.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/interaction-2-1024x731.png\" alt=\"\" class=\"wp-image-544\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/interaction-2-1024x731.png 1024w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/interaction-2-300x214.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/interaction-2-768x549.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/interaction-2-1200x857.png 1200w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/interaction-2.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 6: The dependence plots for x1 shows a clear interaction effect with the color variable x2. This is as simulated in the data.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/interaction-4-1024x731.png\" alt=\"\" class=\"wp-image-545\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/interaction-4-1024x731.png 1024w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/interaction-4-300x214.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/interaction-4-768x549.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/interaction-4-1200x857.png 1200w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/05\/interaction-4.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 7: The dependence plots for x3 does not show clear interaction effects, consistent with the data situation.<\/figcaption><\/figure>\n\n\n\n<p>The full R script and rmarkdown file of this post can be found on <a href=\"https:\/\/github.com\/lorentzenchr\/notebooks\">github<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post shows how to make very generic and quick SHAP interpretations of XGBoost and LightGBM models.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[5],"class_list":["post-292","post","type-post","status-publish","format-standard","hentry","category-statistics","tag-r"],"featured_image_src":null,"author_info":{"display_name":"Michael Mayer","author_link":"https:\/\/lorentzen.ch\/index.php\/author\/michael\/"},"_links":{"self":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/292","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/comments?post=292"}],"version-history":[{"count":41,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/292\/revisions"}],"predecessor-version":[{"id":597,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/292\/revisions\/597"}],"wp:attachment":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/media?parent=292"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/categories?post=292"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/tags?post=292"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}