Out-of-sample Imputation with {missRanger}

{missRanger} is a multivariate imputation algorithm based on random forests, and a fast version of the original missForest algorithm of Stekhoven and Buehlmann (2012). Surprise, surprise: it uses {ranger} to fit random forests. Especially combined with predictive mean matching (PMM), the imputations are often quite realistic.

Out-of-sample application

The newest CRAN release 2.6.0 offers out-of-sample application. This is useful for removing any leakage between train/test data or during cross-validation. Furthermore, it allows to fill missing values in user provided data. By default, it uses the same number of PMM donors as during training, but you can change this by setting pmm.k = nice value.

We distinguish two types of observations to be imputed:

Easy case: Only a single value is missing. Here, we simply apply the corresponding random forest to fill the one missing value.
Hard case: Multiple values are missing. Here, we first fill the values univariately, and then repeatedly apply the corresponding random forests, with the hope that the effect of univariate imputation vanishes. If values of two highly correlated features are missing, then the imputations can be non-sensical. There is no way to mend this.

Example

To illustrate the technique with a simple example, we use the iris data.

1. First, we randomly add 10% missing values.
2. Then, we make a train/test split.
3. Next, we “fit” missRanger() to the training data.
4. Finally, we use its new predict() method to fill the test data.

library(missRanger)

# 10% missings
ir <- iris |> 
  generateNA(p = 0.1, seed = 11)

# Train/test split stratified by Species
oos <- c(1:10, 51:60, 101:110)
train <- ir[-oos, ]
test <- ir[oos, ]

head(test)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3          NA  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4          NA          1.7          NA  setosa

mr <- missRanger(train, pmm.k = 5, keep_forests = TRUE, seed = 1)
test_filled <- predict(mr, test, seed = 1)
head(test_filled)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         4.0          1.7         0.4  setosa

# Original
head(iris)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

The results look reasonable, in this case even for the “hard case” row 6 with missing values in two variables. Here, it is probably the strong association with Species that helped to create good values.

The new predict() also works with single row input.

Learn more about {missRanger}

Basics: https://mayer79.github.io/missRanger/articles/missRanger.html
Multiple imputation: https://mayer79.github.io/missRanger/articles/multiple_imputation.html
Working with survival data: https://mayer79.github.io/missRanger/articles/working_with_censoring.html

The full R script

Comments

2 responses to “Out-of-sample Imputation with {missRanger}”

2025-03-20

Chiara Sciascia

Hi! I have a question for you. I’m working with a panel dataset, but my missing are spread through time so how can i readapt this out-of -sample imputation in order to impute the whole dataset?

1. 2025-03-20
  
  Michael Mayer
  
  I don’t see why you would need the out-of-sample mode in this case. It’s main purpose is to act on fresh data rows, or in train/test situations.

Out-of-sample application

Example

Learn more about {missRanger}

Comments

2 responses to “Out-of-sample Imputation with {missRanger}”

Leave a Reply Cancel reply

More posts

Model Diagnostics: Statistics vs Machine Learning

Fast Grouped Counts and Means in R

Dictionary for Data Scientists and Statisticians

Converting arbitrarily large CSVs to Parquet with R