{"id":1749,"date":"2024-08-23T11:42:52","date_gmt":"2024-08-23T09:42:52","guid":{"rendered":"https:\/\/lorentzen.ch\/?p=1749"},"modified":"2025-03-02T14:55:05","modified_gmt":"2025-03-02T13:55:05","slug":"out-of-sample-imputation-with-missranger","status":"publish","type":"post","link":"https:\/\/lorentzen.ch\/index.php\/2024\/08\/23\/out-of-sample-imputation-with-missranger\/","title":{"rendered":"Out-of-sample Imputation with {missRanger}"},"content":{"rendered":"\n<p><a href=\"https:\/\/cran.r-project.org\/web\/packages\/missRanger\/index.html\">{missRanger}<\/a> is a multivariate imputation algorithm based on random forests, and a fast version of the original missForest algorithm of Stekhoven and Buehlmann (2012). Surprise, surprise: it uses {ranger} to fit random forests. Especially combined with predictive mean matching (PMM), the imputations are often quite realistic.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"240\" height=\"278\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2024\/08\/image.png\" alt=\"\" class=\"wp-image-1750\" style=\"width:384px;height:auto\"\/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Out-of-sample application<\/h2>\n\n\n\n<p>The newest CRAN release 2.6.0 offers out-of-sample application. This is useful for removing any leakage between train\/test data or during cross-validation. Furthermore, it allows to fill missing values in user provided data. By default, it uses the same number of PMM donors as during training, but you can change this by setting <code>pmm.k = nice value<\/code>.<\/p>\n\n\n\n<p>We distinguish <strong>two types of observations<\/strong> to be imputed:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Easy case: Only a single value is missing. Here, we simply apply the corresponding random forest to fill the one missing value.<\/li>\n\n\n\n<li>Hard case: Multiple values are missing. Here, we first fill the values univariately, and then repeatedly apply the corresponding random forests, with the hope that the effect of univariate imputation vanishes. If values of two highly correlated features are missing, then the imputations can be non-sensical. There is no way to mend this.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Example<\/h2>\n\n\n\n<p>To illustrate the technique with a simple example, we use the iris data.<\/p>\n\n\n\n<p>1. First, we randomly add 10% missing values.<br>2. Then, we make a train\/test split.<br>3. Next, we &#8220;fit&#8221; <code>missRanger()<\/code> to the training data.<br>4. Finally, we use its new <code>predict()<\/code> method to fill the test data.<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-4e433800-cbf1-401a-a9e4-83e1fe73e409\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-4e433800-cbf1-401a-a9e4-83e1fe73e409-tab-0\" aria-controls=\"ub-tabbed-content-4e433800-cbf1-401a-a9e4-83e1fe73e409-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: center; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\"><br><br>R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-4e433800-cbf1-401a-a9e4-83e1fe73e409-panel-0\" aria-labelledby=\"ub-tabbed-content-4e433800-cbf1-401a-a9e4-83e1fe73e409-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'>library(missRanger)\n\n# 10% missings\nir &lt;- iris |&gt; \n  generateNA(p = 0.1, seed = 11)\n\n# Train\/test split stratified by Species\noos &lt;- c(1:10, 51:60, 101:110)\ntrain &lt;- ir[-oos, ]\ntest &lt;- ir[oos, ]\n\nhead(test)\n\n#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n# 1          5.1         3.5          1.4         0.2  setosa\n# 2          4.9         3.0          1.4         0.2  setosa\n# 3          4.7         3.2          1.3          NA  setosa\n# 4          4.6         3.1          1.5         0.2  setosa\n# 5          5.0         3.6          1.4         0.2  setosa\n# 6          5.4          NA          1.7          NA  setosa\n\nmr &lt;- missRanger(train, pmm.k = 5, keep_forests = TRUE, seed = 1)\ntest_filled &lt;- predict(mr, test, seed = 1)\nhead(test_filled)\n\n#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n# 1          5.1         3.5          1.4         0.2  setosa\n# 2          4.9         3.0          1.4         0.2  setosa\n# 3          4.7         3.2          1.3         0.2  setosa\n# 4          4.6         3.1          1.5         0.2  setosa\n# 5          5.0         3.6          1.4         0.2  setosa\n# 6          5.4         4.0          1.7         0.4  setosa\n\n# Original\nhead(iris)\n\n#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n# 1          5.1         3.5          1.4         0.2  setosa\n# 2          4.9         3.0          1.4         0.2  setosa\n# 3          4.7         3.2          1.3         0.2  setosa\n# 4          4.6         3.1          1.5         0.2  setosa\n# 5          5.0         3.6          1.4         0.2  setosa\n# 6          5.4         3.9          1.7         0.4  setosa\n<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<p>The results look reasonable, in this case even for the &#8220;hard case&#8221; row 6 with missing values in two variables. Here, it is probably the strong association with <code>Species<\/code> that helped to create good values.<\/p>\n\n\n\n<p>The new <code>predict()<\/code> also works with single row input.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Learn more about {missRanger}<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>Basics<\/em>: <a href=\"https:\/\/mayer79.github.io\/missRanger\/articles\/missRanger.html\">https:\/\/mayer79.github.io\/missRanger\/articles\/missRanger.html<\/a><\/li>\n\n\n\n<li><em>Multiple imputation<\/em>: <a href=\"https:\/\/mayer79.github.io\/missRanger\/articles\/multiple_imputation.html\">https:\/\/mayer79.github.io\/missRanger\/articles\/multiple_imputation.html<\/a><\/li>\n\n\n\n<li><em>Working with survival data<\/em>: <a href=\"https:\/\/mayer79.github.io\/missRanger\/articles\/working_with_censoring.html\">https:\/\/mayer79.github.io\/missRanger\/articles\/working_with_censoring.html<\/a><\/li>\n<\/ul>\n\n\n\n<p><a href=\"https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2024-08-23-missranger.R\">The full R script<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Multivariate imputations with missRanger.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16,17,9],"tags":[5],"class_list":["post-1749","post","type-post","status-publish","format-standard","hentry","category-machine-learning","category-programming","category-statistics","tag-r"],"featured_image_src":null,"author_info":{"display_name":"Michael Mayer","author_link":"https:\/\/lorentzen.ch\/index.php\/author\/michael\/"},"_links":{"self":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1749","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/comments?post=1749"}],"version-history":[{"count":7,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1749\/revisions"}],"predecessor-version":[{"id":1845,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1749\/revisions\/1845"}],"wp:attachment":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/media?parent=1749"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/categories?post=1749"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/tags?post=1749"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}