Michael's and Christian's Blog

Category: Machine Learning

Visualize SHAP Values without Tears
SHAP (SHapley Additive exPlanations, Lundberg and Lee, 2017) is an ingenious way to study black box models. SHAP values decompose – as fair as possible – predictions into additive feature contributions.

When it comes to SHAP, the Python implementation is the de-facto standard. It not only offers many SHAP algorithms, but also provides beautiful plots. In R, the situation is a bit more confusing. Different packages contain implementations of SHAP algorithms, e.g.,
- XGBoost,
- LightGBM,
- fastshap, and
- treeshap,
some of which with great visualizations. Plus there is SHAPforxgboost (see my recent post), originally designed to visualize the results of SHAP values calculated from XGBoost, but it can also be used more generally by now.

The shapviz package

In order to entangle calculation from visualization, the shapviz package was designed. It solely focuses on visualization of SHAP values. Closely following its README, it currently provides these plots:
- sv_waterfall(): Waterfall plots to study single predictions.
- sv_force(): Force plots as an alternative to waterfall plots.
- sv_importance(): Importance plots (bar and/or beeswarm plots) to study variable importance.
- sv_dependence(): Dependence plots to study feature effects (optionally colored by heuristically strongest interacting feature).
They require a “shapviz” object, which is built from two things only:
1. S: Matrix of SHAP values
2. X: Dataset with corresponding feature values
Furthermore, a “baseline” can be passed to represent an average prediction on the scale of the SHAP values.

A key feature of the “shapviz” package is that X is used for visualization only. Thus it is perfectly fine to use factor variables, even if the underlying model would not accept these.

To further simplify the use of shapviz, direct connectors to the packages
- XGBoost,
- LightGBM,
- fastshap, and
- treeshap
are available.

Installation

The package shapviz can be installed from CRAN or Github:
- devtools::install_github("shapviz")
- devtools::install_github("mayer79/shapviz")
Example

Shiny diamonds… let’s model their prices by four “c” variables with XGBoost, and create an explanation dataset with 2000 randomly picked diamonds.
R
library(shapviz) library(ggplot2) library(xgboost) set.seed(3653) X <- diamonds[c("carat", "cut", "color", "clarity")] dtrain <- xgb.DMatrix(data.matrix(X), label = diamonds$price) fit <- xgb.train( params = list(learning_rate = 0.1, objective = "reg:squarederror"), data = dtrain, nrounds = 65L ) # Explanation dataset X_small <- X[sample(nrow(X), 2000L), ]
Create “shapviz” object

One line of code creates a shapviz object. It contains SHAP values and feature values for the set of observations we are interested in. Note again that X is solely used as explanation dataset, not for calculating SHAP values.

In this example we construct the shapviz object directly from the fitted XGBoost model. Thus we also need to pass a corresponding prediction dataset X_pred used for calculating SHAP values by XGBoost.
R
shp <- shapviz(fit, X_pred = data.matrix(X_small), X = X_small)
Explaining one single prediction

Let’s start by explaining a single prediction by a waterfall plot or, alternatively, a force plot.
R
# Two types of visualizations sv_waterfall(shp, row_id = 1) sv_force(shp, row_id = 1
Waterfall plot

Factor/character variables are kept as they are, even if the underlying XGBoost model required them to be integer encoded.

Force plot

Explaining the model as a whole

We have decomposed 2000 predictions, not just one. This allows us to study variable importance at a global model level by studying average absolute SHAP values as a bar plot or by looking at beeswarm plots of SHAP values.
R
# Three types of variable importance plots sv_importance(shp) sv_importance(shp, kind = "bar") sv_importance(shp, kind = "both", alpha = 0.2, width = 0.2)
Beeswarm plot

Bar plot

Beeswarm plot overlaid with bar plot

A scatterplot of SHAP values of a feature like color against its observed values gives a great impression on the feature effect on the response. Vertical scatter gives additional info on interaction effects. shapviz offers a heuristic to pick another feature on the color scale with potential strongest interaction.
R
sv_dependence(shp, v = "color", "auto")
Dependence plot with automatic interaction colorization

Summary
- The “shapviz” has a single purpose: making SHAP plots.
- Its interface is optimized for existing SHAP crunching packages and can easily be used in future packages as well.
- All plots are highly customizable. Furthermore, they are all written with ggplot and allow corresponding modifications.
The complete R script can be found here.

References

Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30 (2017).
2022-06-10
Let the flashlight shine with plotly
There are different R packages devoted to model agnostic interpretability, DALEX and iml being among the best known. In 2019, I added flashlight

for a couple of reasons:
1. Its explainers work with case weights.
2. Multiple explainers can be combined to a multi-explainer.
3. Stratified calculation is possible.
Since almost all plots in flashlight are constructed with ggplot, it is super easy to turn them into interactive plotly objects: just add a simple ggplotly() to the end of the call.

However… it is not straightforward to show interactive plots in a blog! Thus, we show only screenshots of the resulting plots here and refer to the complete HTML report here: https://mayer79.github.io/flashlight_plotly/flashlight_plotly.html

We will use a sweet dataset with more than 20’000 houses to model house prices by a set of derived features such as the logarithmic living area. The location will be represented by the postal code.

Data preparation

We first load the data and prepare some of the columns for modeling. Furthermore, we specify the set of features and the response.
R
library(dplyr) library(flashlight) library(plotly) library(ranger) library(lme4) library(moderndive) library(splitTools) library(MetricsWeighted) set.seed(4933) data("house_prices") prep <- house_prices %>% mutate( log_price = log(price), log_sqft_living = log(sqft_living), log_sqft_lot = log(sqft_lot), log_sqft_basement = log1p(sqft_basement), year = as.numeric(format(date, '%Y')), age = year - yr_built ) x <- c( "year", "age", "log_sqft_living", "log_sqft_lot", "bedrooms", "bathrooms", "log_sqft_basement", "condition", "waterfront", "zipcode" ) y <- "log_price" head(prep[c(y, x)]) ## # A tibble: 6 x 11 ## log_price year age log_sqft_living log_sqft_lot bedrooms bathrooms ## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> ## 1 12.3 2014 59 7.07 8.64 3 1 ## 2 13.2 2014 63 7.85 8.89 3 2.25 ## 3 12.1 2015 82 6.65 9.21 2 1 ## 4 13.3 2014 49 7.58 8.52 4 3 ## 5 13.1 2015 28 7.43 9.00 3 2 ## 6 14.0 2014 13 8.60 11.5 4 4.5 ## # ... with 4 more variables: log_sqft_basement <dbl>, condition <fct>, ## # waterfront <lgl>, zipcode <fct>
Train / test split

Then, we split the dataset into 80% training and 20% test rows, stratified on the (binned) response log_price.
R
idx <- partition(prep[[y]], c(train = 0.8, test = 0.2), type = "stratified") train <- prep[idx$train, ] test <- prep[idx$test, ]
Models

We fit two models:
1. A linear mixed model with random postal code effect.
2. A random forest with 500 trees.
R
# Mixed-effects model fit_lmer <- lmer( update(reformulate(x, "log_price"), . ~ . - zipcode + (1 | zipcode)), data = train ) # Random forest fit_rf <- ranger( reformulate(x, "log_price"), always.split.variables = "zipcode", data = train ) cat("R-squared OOB:", fit_rf$r.squared) ## R-squared OOB: 0.8463311
Model inspection

Now, we are ready to inspect our two models regarding performance, variable importance, and effects.

Set up explainers

First, we pack all model dependent information into flashlights (the explainer objects) and combine them to a multiflashlight. As evaluation dataset, we pass the test data. This ensures that interpretability tools using the response (e.g., performance measures and permutation importance) are not being biased by overfitting.
R
fl_lmer <- flashlight(model = fit_lmer, label = "LMER") fl_rf <- flashlight( model = fit_rf, label = "RF", predict_function = function(mod, X) predict(mod, X)$predictions ) fls <- multiflashlight( list(fl_lmer, fl_rf), y = "log_price", data = test, metrics = list(RMSE = rmse, `R-squared` = r_squared) )
Model performance

Let’s evaluate model RMSE and R-squared on the hold-out dataset. Here, the mixed-effects model performs a tiny little bit better than the random forest:
R
(light_performance(fls) %>% plot(fill = "darkred") + labs(title = "Model performance", x = element_blank())) %>% ggplotly()
Model performance (png)

Permutation importance

Next, we inspect the variable strength based on permutation importance. It shows by how much the RMSE is being increased when shuffling a variable before prediction. The results are quite similar between the two models.
R
(light_importance(fls, v = x) %>% plot(fill = "darkred") + labs(title = "Permutation importance", y = "Drop in RMSE")) %>% ggplotly()
Variable importance (png)

ICE plot

To get an impression of the effect of the living area, we select 200 observations and profile their predictions with increasing (log) living area, keeping everything else fixed (Ceteris Paribus). These ICE (individual conditional expectation) plots are vertically centered in order to highlight potential interaction effects. If all curves coincide, there are no interaction effects and we can say that the effect of the feature is modelled in an additive way (no surprise for the additive linear mixed-effects model).
R
(light_ice(fls, v = "log_sqft_living", n_max = 200, center = "middle") %>% plot(alpha = 0.05, color = "darkred") + labs(title = "Centered ICE plot", y = "log_price (shifted)")) %>% ggplotly()
Partial dependence plots

Averaging many uncentered ICE curves provides the famous partial dependence plot, introduced in Friedman’s seminal paper on gradient boosting machines (2001).
R
(light_profile(fls, v = "log_sqft_living", n_bins = 21) %>% plot(rotate_x = FALSE) + labs(title = "Partial dependence plot", y = y) + scale_colour_viridis_d(begin = 0.2, end = 0.8)) %>% ggplotly()
Partial dependence plots (png)

Multiple effects visualized together

The last figure extends the partial dependence plot with three additional curves, all evaluated on the hold-out dataset:
- Average observed values
- Average predictions
- ALE plot (“accumulated local effects”, an alternative to partial dependence plots with relaxed Ceteris Paribus assumption)
R
(light_effects(fls, v = "log_sqft_living", n_bins = 21) %>% plot(use = "all") + labs(title = "Different effect estimates", y = y) + scale_colour_viridis_d(begin = 0.2, end = 0.8)) %>% ggplotly()
Multiple effects together (png)

Conclusion

Combining flashlight with plotly works well and provides nice, interactive plots. Using rmarkdown, an analysis like this look quite neat if shipped as an HTML like this one here: https://mayer79.github.io/flashlight_plotly/flashlight_plotly.html

The rmarkdown script can be found here on github.
2022-04-25
Random Forests with Monotonic Constraints
Lost in Translation between R and Python 7

Hello random forest friends

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

Monotonic constraints

On ML competition platforms like Kaggle, complex and unintuitively behaving models dominate. In this respect, reality is completely different. There, the majority of models do not serve as pure prediction machines but rather as fruitful source of information. Furthermore, even if used as prediction machine, the users of the models might expect a certain degree of consistency when “playing” with input values.

A classic example are statistical house appraisal models. An additional bathroom or an additional square foot of ground area is expected to raise the appraisal, everything else being fixed (ceteris paribus). The user might lose trust in the model if the opposite happens.

One way to enforce such consistency is to monitor the signs of coefficients of a linear regression model. Another useful strategy is to impose monotonicity constraints on selected model effects.

Trees and monotonic constraints

Monotonicity constraints are especially simple to implement for decision trees. The rule is basically as follows:
If a monotonicity constraint would be violated by a split on feature X, it is rejected. (Or a large penalty is subtracted from the corresponding split gain.) This will imply monotonic behavior of predictions in X, keeping all other features fixed.

Tree ensembles like boosted trees or random forests will automatically inherit this property.

Boosted trees

Most implementations of boosted trees offer monotonicity constraints. Here is a selection:
- R & Python: XGBoost
- R & Python: LightGBM
- R & Python: CatBoost
- Python: Scikit-Learn histogram gradient booster
- R: gbm
What about random forests?

Unfortunately, the picture is completely different for random forests. At the time of writing, I am not aware of any random forest implementation in R or Python offering this useful feature.

Some options
1. Implement monotonic constrainted random forests from scratch.
2. Ask for this feature in existing implementations.
3. Be creative and use XGBoost to emulate random forests.
For the moment, let’s stick to option 3. In our last R <-> Python blog post, we demonstrated that XGBoost’s random forest mode works essentially as good as standard random forest implementations, at least in regression settings and using sensible defaults.

Warning: Be careful with imposing monotonicity constraints

Ask yourself: does the constraint really make sense for all possible values of other features? You will see that the answer is often “no”.

An example: If your house price model uses the features “number of rooms” and “living area”, then a monotonic constraint on “living area” might make sense (given any number of rooms), while such constraint would be non-sensical for the number of rooms. Why? Because having six rooms in a 1200 square feet home is not necessarily better than having just five rooms in an equally sized home.

Let’s try it out

We use a nice dataset containing information on over 20,000 sold houses in Kings County. Along with the sale price, different features describe the size and location of the properties. The dataset is available on OpenML.org with ID 42092.

Some rows and columns from the Kings County house dataset.

The following R and Python codes
- fetch the data,
- prepare the ML setting,
- fit unconstrained XGBoost random forests using log sales price as response,
- and visualize the effect of log ground area by individual conditional expectation (ICE) curves.
An ICE curve for variable X shows how the prediction of one specific observation changes if the value of X changes. Repeating this for multiple observations gives an idea of the effect of X. The average over multiple ICE curves produces the famous partial dependent plot.
R

Python
library(farff) library(OpenML) library(dplyr) library(xgboost) set.seed(83454) rmse <- function(y, pred) { sqrt(mean((y-pred)^2)) } # Load King Country house prices dataset on OpenML # ID 42092, https://www.openml.org/d/42092 df <- getOMLDataSet(data.id = 42092)$data head(df) # Prepare df <- df %>% mutate( log_price = log(price), log_sqft_lot = log(sqft_lot), year = as.numeric(substr(date, 1, 4)), building_age = year - yr_built, zipcode = as.integer(as.character(zipcode)) ) # Define response and features y <- "log_price" x <- c("grade", "year", "building_age", "sqft_living", "log_sqft_lot", "bedrooms", "bathrooms", "floors", "zipcode", "lat", "long", "condition", "waterfront") # random split ix <- sample(nrow(df), 0.8 * nrow(df)) y_test <- df[[y]][-ix] # Fit untuned, but good(!) XGBoost random forest dtrain <- xgb.DMatrix(data.matrix(df[ix, x]), label = df[ix, y]) params <- list( objective = "reg:squarederror", learning_rate = 1, num_parallel_tree = 500, subsample = 0.63, colsample_bynode = 1/3, reg_lambda = 0, max_depth = 20, min_child_weight = 2 ) system.time( # 25 s unconstrained <- xgb.train( params, data = dtrain, nrounds = 1, verbose = 0 ) ) pred <- predict(unconstrained, data.matrix(df[-ix, x])) # Test RMSE: 0.172 rmse(y_test, pred) # ICE curves via our flashlight package library(flashlight) pred_xgb <- function(m, X) predict(m, data.matrix(X[, x])) fl <- flashlight( model = unconstrained, label = "unconstrained", data = df[ix, ], predict_function = pred_xgb ) light_ice(fl, v = "log_sqft_lot", indices = 1:9, evaluate_at = seq(7, 11, by = 0.1)) %>% plot()

# Imports import numpy as np from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.inspection import PartialDependenceDisplay from xgboost import XGBRFRegressor # Fetch data from OpenML df = fetch_openml(data_id=42092, as_frame=True)["frame"] # Prepare data df = df.assign( year=lambda x: x.date.str[0:4].astype(int), zipcode=lambda x: x.zipcode.astype(int), log_sqft_lot=lambda x: np.log(x.sqft_lot), building_age=lambda x: x.year - x.yr_built, ) # Feature list xvars = [ "grade", "year", "building_age", "sqft_living", "log_sqft_lot", "bedrooms", "bathrooms", "floors", "zipcode", "lat", "long", "condition", "waterfront", ] # Data split y_train, y_test, X_train, X_test = train_test_split( np.log(df["price"]), df[xvars], train_size=0.8, random_state=766 ) # Modeling - wall time: 39 seconds param_dict = dict( n_estimators=500, max_depth=20, learning_rate=1, subsample=0.63, colsample_bynode=1 / 3, reg_lambda=0, objective="reg:squarederror", min_child_weight=2, ) unconstrained = XGBRFRegressor(**param_dict).fit(X_train, y_train) # Test RMSE 0.176 pred = unconstrained.predict(X_test) print(f"RMSE: {mean_squared_error(y_test, pred, squared=False):.03f}") # ICE and PDP - wall time: 47 seconds PartialDependenceDisplay.from_estimator( unconstrained, X=X_train, features=["log_sqft_lot"], kind="both", subsample=20, random_state=1, )
Figure 1 (R output): ICE curves of log(ground area) for the first nine observations. Many non-monotonic parts are visible.

We clearly see many non-monotonic (and in this case counterintuitive) ICE curves.

What would a model give with monotonically increasing constraint on the ground area?
R

Python
# Monotonic increasing constraint (params$monotone_constraints <- 1 * (x == "log_sqft_lot")) system.time( # 179s monotonic <- xgb.train( params, data = dtrain, nrounds = 1, verbose = 0 ) ) pred <- predict(monotonic, data.matrix(df[-ix, x])) # Test RMSE: 0.176 rmse(y_test, pred) fl_m <- flashlight( model = monotonic, label = "monotonic", data = df[ix, ], predict_function = pred_xgb ) light_ice(fl_m, v = "log_sqft_lot", indices = 1:9, evaluate_at = seq(7, 11, by = 0.1)) %>% plot()

# One needs to pass the constraints as single string, which is rather ugly mc = "(" + ",".join([str(int(x == "log_sqft_lot")) for x in xvars]) + ")" print(mc) # Modeling - wall time 49 seconds constrained = XGBRFRegressor(monotone_constraints=mc, **param_dict) constrained.fit(X_train, y_train) # Test RMSE: 0.178 pred = constrained.predict(X_test) print(f"RMSE: {mean_squared_error(y_test, pred, squared=False):.03f}") # ICE and PDP - wall time 39 seconds PartialDependenceDisplay.from_estimator( constrained, X=X_train, features=["log_sqft_lot"], kind="both", subsample=20, random_state=1, )
Figure 2 (R output): ICE curves of the same observations as in Figure 1, but now with monotonic constraint. All curves are monotonically increasing.

We see:
1. It works! Each ICE curve in log(lot area) is monotonically increasing. This means that predictions are monotonically increasing in lot area, keeping all other feature values fixed.
2. The model performance is slightly worse. This is the price paid for receiving a more intuitive behaviour in an important feature.
3. In Python, both models take about the same time to fit (30-40 s on a 4 core i7 CPU laptop). Curiously, in R, the constrained model takes about six times longer to fit than the unconstrained one (170 s vs 30 s).
Summary
- Monotonic constraints help to create intuitive models.
- Unfortunately, as per now, native random forest implementations do not offer such constraints.
- Using XGBoost’s random forest mode is a temporary solution until native random forest implementations add this feature.
- Be careful to add too many constraints: does a constraint really make sense for all other (fixed) choices of feature values?
The Python notebook and R code can be found at:
- https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-11-13%20random%20forests%20monotone.ipynb
- https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-11-13%20random%20forests%20monotone.r
2021-11-13
Personal Highlights of Scikit-Learn 1.0
Yes! After more than 10 years, scikit-learn released its 1.0 version on 24 September 2021. In this post, I’d like to point out some personal highlights apart from the release highlights.

1. Feature Names

This one is listed in the release highlights, but deserves to be mentioned again.
```
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import pandas as pd

df = pd.DataFrame({
    "pet": ["dog", "cat", "fish"],
    "age": [3, 7, 1],
    "noise": [-99, pd.NA, 1e-10],
    "target": [1, 0, 1],
})
y = df.pop("target")
X = df

preprocessor = ColumnTransformer(
    [
        ("numerical", StandardScaler(), ["age"]),
        ("categorical", OneHotEncoder(), ["pet"]),
    ],
    verbose_feature_names_out=False,
    remainder="drop",
)

pipe = make_pipeline(preprocessor, LogisticRegression())
pipe.fit(X, y)
pipe[:-1].get_feature_names_out()
```
```
array(['age', 'pet_cat', 'pet_dog', 'pet_fish'], dtype=object)
```
This is not yet available for all estimators and transformers, but it is a big step towards SLEP007.

2. ColumnTransformer allows changed order of columns

Before this release, ColumnTransformer recorded the order of the columns of a dataframe during the fit method and required that a dataframe X passed to transform had the exact same columns and in the exact same order.

This was a big pain point in productive settings because fit and predict of a model pipeline, both calling transform, often get data from different sources, and, for instance, SQL does not care about the order of columns. On top, remainder=”drop” forced you to have also all dropped columns in transform. This contradicted at least my modelling workflow as I often specify all meaningful features explicitly and drop the rest by the remainder option. This then led to unwanted surprises when applying predict to new data in the end. It might also happen, that one forgets to remove the target variable from the training X and relies on the drop option. Usually, the application of the predictive model pipeline is on new data without the target variable. The error in this case, however, might be considered a good thing.

With pull request (PR) #19263, the ColumnTransformer only cares about the presence and names of the columns, not about their order. With remainder=”drop”, it only cares about the specified columns and ignores all other columns, even no matter if the dropped ones are different in fit and transform. Note that this only works with pandas dataframes as input (or an object that quacks alike).
```
df_new = pd.DataFrame({
    "age": [1, 9, 3],
    "another_noise": [pd.NA, -99, 1e-10],
    "pet": ["cat", "dog", "fish"],
})
pipe.predict(df_new)
```
You find these little code snippets as notebook at the usual place: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-10-21%20scikit-learn_v1_release_highlights.ipynb.

3. Poisson criterion for random forests

Scikit-learn v0.24 shipped with the new option criterion=”poisson” for DecisionTreeRegressor to split nodes based on the reduction of Poisson deviance. Version 1.0 passed this option further to the RandomForestRegressor in PR #19836. Random forests are often used models and valued for their ease of use. We even like to write blog posts about them:
- Strong Random Forests with XGBoost
- Feature Subsampling For Random Forest Regression
The Poisson splitting criterion has its place when modelling counts or frequencies. It allows for non-negative values to be modelled, but forbids non-positive predictions. This corresponds to $y_{train} \geq 0$ and $y_{predict} > 0$ .

4. The best example/tutorial of the year

It’s not visible from the release notes, but this deserves to be noted. PR #20281 added a fantastic example, more like a tutorial, on time-related feature engineering. You find a lot of interesting features, some of them shipped with the 1.0 release, e.g. time base cross validation, generation of cyclic b-splines and adding pairwise interactions to a linear model, usage of native categorical features in the HistGradientBoostingRegressor…

Take a look for yourself at this wonderful example.
2021-10-21