{"id":1338,"date":"2023-11-11T10:59:02","date_gmt":"2023-11-11T09:59:02","guid":{"rendered":"https:\/\/lorentzen.ch\/?p=1338"},"modified":"2023-11-20T20:48:16","modified_gmt":"2023-11-20T19:48:16","slug":"permutation-shap-versus-kernel-shap","status":"publish","type":"post","link":"https:\/\/lorentzen.ch\/index.php\/2023\/11\/11\/permutation-shap-versus-kernel-shap\/","title":{"rendered":"Permutation SHAP versus Kernel SHAP"},"content":{"rendered":"\n<p>SHAP is the predominant way to interpret black-box ML models, especially for tree-based models with the blazingly fast TreeSHAP algorithm.<\/p>\n\n\n\n<p>For general models, two slower SHAP algorithms exist:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Permutation SHAP (\u0160trumbelj and Kononenko, 2010)<\/li>\n\n\n\n<li>Kernel SHAP (Lundberg and Lee, 2017)<\/li>\n<\/ol>\n\n\n\n<p>Kernel SHAP was introduced in [2] as an approximation to permutation SHAP. <\/p>\n\n\n\n<p>The <a href=\"https:\/\/cran.r-project.org\/web\/packages\/kernelshap\/index.html\">0.4.0 CRAN release<\/a> of our {kernelshap} package now contains an exact permutation SHAP algorithm for up to 14 features, and thus it becomes easy to make experiments between the two approaches.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Some initial statements about permutation SHAP and Kernel SHAP<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Exact permutation SHAP and exact Kernel SHAP have the same computational complexity.<\/li>\n\n\n\n<li>Technically, exact Kernel SHAP is still an approximation of exact permutation SHAP, so you should prefer the latter.<\/li>\n\n\n\n<li>Kernel SHAP assumes feature independence. Since features are never independent in practice: does this mean we should never use Kernel SHAP?<\/li>\n\n\n\n<li>Kernel SHAP can be calculated almost exactly for any number of features, while permutation SHAP approximations get more and more inprecise when the number of features gets too large.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Simulation 1<\/h3>\n\n\n\n<p>We will first work with the iris data because it has extremely strong correlations between features. To see the impact of having models with and without interactions, we work with a random forest model of increasing tree depth. Depth 1 means no interactions, depth 2 means pairwise interactions etc.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;r&quot;,&quot;mime&quot;:&quot;text\/x-rsrc&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;R&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;r&quot;}\">library(kernelshap)\nlibrary(ranger)\n\ndifferences &lt;- numeric(4)\n\nset.seed(1)\n\nfor (depth in 1:4) {\n  fit &lt;- ranger(\n    Sepal.Length ~ ., \n    mtry = 3,\n    data = iris, \n    max.depth = depth\n  )\n  ps &lt;- permshap(fit, iris[2:5], bg_X = iris)\n  ks &lt;- kernelshap(fit, iris[2:5], bg_X = iris)\n  differences[depth] &lt;- mean(abs(ks$S - ps$S))\n}\n\ndifferences  # for tree depth 1, 2, 3, 4\n# 5.053249e-17 9.046443e-17 2.387905e-04 4.403375e-04\n\n# SHAP values of first two rows with tree depth 4\nps\n#      Sepal.Width Petal.Length Petal.Width      Species\n# [1,]  0.11377616   -0.7130647  -0.1956012 -0.004437022\n# [2,] -0.06852539   -0.7596562  -0.2259017 -0.006575266\n  \nks\n#      Sepal.Width Petal.Length Petal.Width      Species\n# [1,]  0.11463191   -0.7125194  -0.1951810 -0.006258208\n# [2,] -0.06828866   -0.7597391  -0.2259833 -0.006647530<\/pre><\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Up to pairwise interactions (tree depth 2), the mean absolute difference between the two (150 x 4) SHAP matrices is 0.<\/li>\n\n\n\n<li>Even for interactions of order three or higher, the differences are small. This is unexpected &#8211; in the end all iris features are strongly correlated!<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Simulation 2<\/h3>\n\n\n\n<p>Let&#8217;s now use a different data set with more features: miami house price data. As modeling technique, we use XGBoost where we would normally use TreeSHAP. Also here, we increase tree depth from 1 to 3 for increasing interaction depth.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;r&quot;,&quot;mime&quot;:&quot;text\/x-rsrc&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;R&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;r&quot;}\">library(xgboost)\nlibrary(shapviz)\n\ncolnames(miami) &lt;- tolower(colnames(miami))\nmiami$log_ocean &lt;- log(miami$ocean_dist)\nx &lt;- c(&quot;log_ocean&quot;, &quot;tot_lvg_area&quot;, &quot;lnd_sqfoot&quot;, &quot;structure_quality&quot;, &quot;age&quot;, &quot;month_sold&quot;)\n\n# Train\/valid split\nset.seed(1)\nix &lt;- sample(nrow(miami), 0.8 * nrow(miami))\n\ny_train &lt;- log(miami$sale_prc[ix])\ny_valid &lt;- log(miami$sale_prc[-ix])\nX_train &lt;- data.matrix(miami[ix, x])\nX_valid &lt;- data.matrix(miami[-ix, x])\n\ndtrain &lt;- xgb.DMatrix(X_train, label = y_train)\ndvalid &lt;- xgb.DMatrix(X_valid, label = y_valid)\n\n# Fit via early stopping (depth 1 to 3)\ndifferences &lt;- numeric(3)\n\nfor (i in 1:3) {\n  fit &lt;- xgb.train(\n    params = list(learning_rate = 0.15, objective = &quot;reg:squarederror&quot;, max_depth = i),\n    data = dtrain,\n    watchlist = list(valid = dvalid),\n    early_stopping_rounds = 20,\n    nrounds = 1000,\n    callbacks = list(cb.print.evaluation(period = 100))\n  )\n  ps &lt;- permshap(fit, X = head(X_valid, 500), bg_X = head(X_valid, 500))\n  ks &lt;- kernelshap(fit, X = head(X_valid, 500), bg_X = head(X_valid, 500))\n  differences[i] &lt;- mean(abs(ks$S - ps$S))\n}\ndifferences # for tree depth 1, 2, 3\n# 2.904010e-09 5.158383e-09 6.586577e-04\n\n# SHAP values of top two rows for tree depth 3\nps\n# log_ocean tot_lvg_area lnd_sqfoot structure_quality        age  month_sold\n# 0.2224359   0.04941044  0.1266136         0.1360166 0.01036866 0.005557032\n# 0.3674484   0.01045079  0.1192187         0.1180312 0.01426247 0.005465283\n\nks\n# log_ocean tot_lvg_area lnd_sqfoot structure_quality        age  month_sold\n# 0.2245202  0.049520308  0.1266020         0.1349770 0.01142703 0.003355770\n# 0.3697167  0.009575195  0.1198201         0.1168738 0.01544061 0.003450425<\/pre><\/div>\n\n\n\n<p>Again the same picture as with iris: Essentially no differences for interactions up to order two, and only small differences with interactions of higher order.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Wrap-Up<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Use <code>kernelshap::permshap()<\/code> to crunch exact permutation SHAP values for models with not too many features.<\/li>\n\n\n\n<li>In real-world applications, exact Kernel SHAP and exact permutation SHAP start to differ (slightly) with models containing interactions of order three or higher.<\/li>\n\n\n\n<li>Since Kernel SHAP can be calculated almost exactly also for many features, it remains an excellent way to crunch SHAP values for arbitrary models.<\/li>\n<\/ol>\n\n\n\n<p>What is your experience?<\/p>\n\n\n\n<p>The R code is <a href=\"https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2023-11-11%20Permutation-SHAP.R\">here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When do the two methods agree? When not?<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16,17,9],"tags":[5],"class_list":["post-1338","post","type-post","status-publish","format-standard","hentry","category-machine-learning","category-programming","category-statistics","tag-r"],"featured_image_src":null,"author_info":{"display_name":"Michael Mayer","author_link":"https:\/\/lorentzen.ch\/index.php\/author\/michael\/"},"_links":{"self":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1338","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/comments?post=1338"}],"version-history":[{"count":4,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1338\/revisions"}],"predecessor-version":[{"id":1374,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1338\/revisions\/1374"}],"wp:attachment":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/media?parent=1338"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/categories?post=1338"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/tags?post=1338"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}