Category: Statistics

  • Quantiles And Their Estimation

    Applied statistics is dominated by the ubiquitous mean. For a change, this post is dedicated to quantiles. I will give my best to provide a good mix of theory and practical examples.

    While the mean describes only the central tendency of a distribution or random sample, quantiles are able to describe the whole distribution. They appear in box-plots, in childrens’ weight-for-age curves, in salary survey results, in risk measures like the value-at-risk in the EU-wide solvency II framework for insurance companies, in quality control and in many more fields.

    Definitions

    Often, one talks about quantiles, but rarely defines them. In what fallows, I borrow from Gneiting (2011).

    Definition 1: Quantile

    Given a cumulative probability distribution (CDF) F(x)=\mathbb{P}(X\leq x), the quantile at level \alpha \in (0,1) (ɑ-quantile for short), q_\alpha(F), is defined as

    \begin{equation*}
    q_\alpha(F) = \{x: \lim_{y\uparrow x} F(y)\leq \alpha \leq F(x)\} \,.
    \end{equation*}

    The inequalities of this definition are called coverage conditions. It is very important to note that quantiles are potentially set valued. Another way to write this set is as an interval:

    \begin{align*}
    q_\alpha(F) &\in [q_\alpha^-(F), q_\alpha^+(F)]
    \\
    q_\alpha^-(F) &= \sup\{x:F(x) < \alpha\} = \inf\{x:F(x) \geq \alpha\}
    \\
    q_\alpha^+(F) &= \inf\{x:F(x) > \alpha\} = \sup\{x:F(x) \leq \alpha\}
    \end{align*}

    For q_\alpha^-, we recover the usual quantile definition as the generalized inverse of F. But this is only one possible value. I will discuss examples of quantile intervals later on.

    To get acquainted a bit more, let’s plot the cumulative distribution function and the quantile function for some continuous distributions: Normal, Gamma and Pareto distribution. The parametrisation is such that all have mean 2, Normal and Gamma have variance 4, and Pareto has an infinite variance. For those continuous and strictly increasing distributions, all quantiles are unique, and therefore simplify to the inverse CDF q_\alpha^- in the above equations. Note that those three distributions have very different tail behaviour: The density of the Normal distribution has the famously fast decrease \propto e^{-x^2}, the Gamma density has an exponentially decreasing tail \propto e^{-x} and the Pareto density has a fat tail, i.e. an inverse power \propto \frac{1}{x^\alpha}.

    CDF (top) and quantile function (bottom) of several distributions: Normal N(\mu=2, \sigma^2=2)(left), Gamma Ga(\alpha=2, \beta=\frac{1}{2}) (middle) and Pareto Pa(\alpha=2)(right).

    There are at least two more equivalent ways to define quantiles. They will help us later to get a better visualisations.

    Definition 2: Quantile as minimiser

    Given a probability distribution F, the ɑ-quantile q_\alpha(F) is defined as any minimiser of the expected scoring function S

    \begin{align*}
    q_\alpha(F) &\in \argmin_x \mathbb{E}(S_\alpha(x, Y))
    \\
    S_\alpha(x, y) &= (\mathbf{1}_{x\geq y} - \alpha)(x - y)
    \end{align*}

    where the expectation is taken over Y \sim F.

    The scoring function or loss function S can be generalized to S_\alpha(x, y) = (\mathbb{1}_{x\geq y} – \alpha)(g(x) – g(y)) for any increasing function g, but the above version in definition 2 is by far the simplest one and coincides with the pinball loss used in quantile regression.

    This definition is super useful because it provides a tool to assess whether a given value really is a quantile. A plot will suffice.

    Having a definition in terms of a convex optimisation problem, there is another definition in terms of the first order condition of optimality. For continuous, strictly increasing distributions, this would be equivalent to setting the first derivative to zero. For our non-smooth scoring function with potentially set-valued solution, this gets more complicated, e.g. subdifferential or subgradients replacing derivatives. In the end, it amounts to a sign change of the expectation of the so called identification function V(x, y)=\mathbf{1}_{x\geq y}-\alpha.

    Empirical Distribution

    The empirical distribution provides an excellent example. Given a sample of n observations y_1, \ldots, y_n, the empirical distribution is given by F_n(x)=\frac{1}{n}\sum_{i=1}^n \mathbf{1}_{x\geq y_i}. Let us start simple and take two observations y_1=1 and y_2=2. Plugging in this distribution in the definition 1 of quantiles gives the exact quantiles of this 2-point empirical CDF:

    \begin{equation}
    q_\alpha(F_2)=
    \begin{cases}
       1 &\text{if } \alpha<\frac{1}{2} \\
       [1, 2] &\text{if } \alpha=\frac{1}{2} \\
       2 &\text{else}
    \end{cases}
    \end{equation}

    Here we encounter the interval [1, 2] for \alpha=\frac{1}{2}. Again, I plot both the (empirical) CDF F_n and the quantiles.

    Empirical distribution function and exact quantiles of observations y=1 and y=2.

    In the left plot, the big dots unambiguously mark the values at x=1 and x=2. For the quantiles in the right plot, the vertical line at probability 0.5 really means that all those values between 1 and 2 are possible 0.5-quantiles, also known as median.

    If you wonder about the value 1 for quantiles of level smaller than 50%, the minimisation formulation helps. The following plot shows \mathbb{E}(S_\alpha(x, Y)) for \alpha=0.2 with a clear unique minimum at x=1.

    Expected scoring function (pinball loss) for \alpha=0.2 for the empirical CDF with observations 1 and 2.

    A note for the interested reader: The above empirical distribution is the same as the distribution of a Bernoulli random variable, except that the x-values are shifted, i.e. the Bernoulli random variables are canonically set to 0 and 1 instead of 1 and 2. Furthermore, there is a direct connection between quantiles and classification via the cost-weighted misclassification error, see Fissler, Lorentzen & Mayer (2022).

    Empirical Quantiles

    From the empirical CDF, it is only a small step to empirical quantiles. But what’s the difference anyway? While we saw the exact quantile of the empirical distribution, q_\alpha(F_n), an empirical or sample quantile estimate the true (population) quantile given a data sample, i.e. \hat{q}_\alpha(\{y_i\}) \approx q_\alpha(F).

    As an empirical CDF estimates the CDF of the true underlying (population) distribution, F_n=\hat{F} \approx F, one immediate way to estimate a quantile is:

    1. Estimate the CDF via the empirical CDF F_n.
    2. Use the exact quantile in analogy to Eq.(1) as an estimate.

    Very importantly, this is just one way to estimate quantiles from a sample. There are many, many more. Here is the outcome of the 20%-quantile of our tiny data sample y_1=1 and y_2=2.

    import numpy as np
    
    methods = [
        'inverted_cdf',
        'averaged_inverted_cdf',
        'closest_observation',
        'interpolated_inverted_cdf',
        'hazen',
        'weibull',
        'linear',
        'median_unbiased',
        'normal_unbiased',
        'nearest',
        'lower',
        'higher',
        'midpoint',
    ]
    alpha = 0.2
    for m in methods:
        estimate = np.quantile([1, 2], 0.2, method=m)
        print(f"{m:<25} {alpha}-quantile estimate = {estimate}")
    inverted_cdf              0.2-quantile estimate = 1
    averaged_inverted_cdf     0.2-quantile estimate = 1.0
    closest_observation       0.2-quantile estimate = 1
    interpolated_inverted_cdf 0.2-quantile estimate = 1.0
    hazen                     0.2-quantile estimate = 1.0
    weibull                   0.2-quantile estimate = 1.0
    linear                    0.2-quantile estimate = 1.2
    median_unbiased           0.2-quantile estimate = 1.0
    normal_unbiased           0.2-quantile estimate = 1.0
    nearest                   0.2-quantile estimate = 1
    lower                     0.2-quantile estimate = 1
    higher                    0.2-quantile estimate = 2
    midpoint                  0.2-quantile estimate = 1.5

    Note that the first 9 methods are the ones discussed in a famous paper of Hyndman & Fan (1996). The default method of both Python’s numpy.quantile and R’s quantile is linear, i.e. number 7 in Hyndman & Fan. Somewhat surprisingly, we observe that this default method is clearly biased in this case and overestimates the true quantile.

    For large sample sizes, the differences will get tiny and all methods converge finally to a true quantile, at least for continuous distributions. In order to assess the bias with small sample sizes for each method, I do a simulation. This is where the fun starts😄

    For all three selected distributions and for quantile levels 15%, 50% and 85%, I simulate 10000 random samples, each of sample size 10 and calculate the sample quantile. Then I take the mean over all 10000 simulations as well as the 5% and the 95% quantiles as a measure of uncertainty, i.e. 90% confidence intervals. After some coding, this results in the following plot. (I spare you the code at this point. You can find it in the linked notebook at the bottom of this post).

    Small sample bias (n=10) of different empirical quantile estimation methods (x-axis and color) based on 10000 simulations. Dots are the mean values, error bars cover a 90% confidence interval. The dotted horizontal line is the theoretical quantile value.
    left: Normal distribution; mid: Gamma distribution; right: Pareto distribution
    top: 15%-quantile; mid: 50%-quantile; bottom: 85%-quantile

    For the 15%-quantile, the default linear method always overestimates, but it does surprisingly well for the 85%-quantile of the Pareto distribution. Overall, I personally would prefer the median unbiased or the Hazen method. Interestingly, the Hazen method is one of the oldest, namely from Hazen (1914), and is the only one that fulfills all proposed properties of Hyndman & Fan, who propose the median unbiased method as default.

    Quantile Regression

    So far, the interest was in the quantile of a sample or distribution. To go one step further, one might ask for the conditional quantile of a response variable Y given some features or covariates X, q_\alpha(Y|X)=q_\alpha(F_{Y|X}). This is the realm of quantile regression as invented by Koenker & Bassett (1978). To stay within linear models, one simply replaces the squared error of ordinary least squares (OLS) by the pinball loss.

    Without going into any more details, I chose the Palmer’s penguin dataset and the body mass in gram as response variable, all the other variables as feature variables. And again, I model the 15%, 50% and 85% quantiles.

    import seaborn as sns
    
    df = sns.load_dataset("penguins").dropna()
    df
    speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
    0AdelieTorgersen39.118.7181.03750.0Male
    1AdelieTorgersen39.517.4186.03800.0Female
    2AdelieTorgersen40.318.0195.03250.0Female
    4AdelieTorgersen36.719.3193.03450.0Female
    5AdelieTorgersen39.320.6190.03650.0Male
    338GentooBiscoe47.213.7214.04925.0Female
    340GentooBiscoe46.814.3215.04850.0Female
    341GentooBiscoe50.415.7222.05750.0Male
    342GentooBiscoe45.214.8212.05200.0Female
    343GentooBiscoe49.916.1213.05400.0Male
    from sklearn.base import clone
    from sklearn.compose import ColumnTransformer
    from sklearn.linear_model import QuantileRegressor
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import OneHotEncoder, SplineTransformer
    
    y = df["body_mass_g"]
    X = df.drop(columns="body_mass_g")
    qr50 = Pipeline([
        ("column_transformer",
         ColumnTransformer([
             ("ohe", OneHotEncoder(drop="first"), ["species", "island", "sex"]),
             ("spline", SplineTransformer(n_knots=3, degree=2), ["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]),
             
         ])
        ),
        ("quantile_regressor",
         QuantileRegressor(quantile=0.5, alpha=0, solver="highs")
        )
    ])
    qr15 = clone(qr50)
    qr15.set_params(quantile_regressor__quantile=0.15)
    qr85 = clone(qr50)
    qr85.set_params(quantile_regressor__quantile=0.85)
    
    qr15.fit(X, y)
    qr50.fit(X, y)
    qr85.fit(X, y)

    This code snippet gives the three fitted linear quantile models. That was the easy part. I find it much harder to produce good visualisations.

    import polars as pl  # imported and used much earlier in the notebook
    
    df_obs = df.copy()
    df_obs["type"] = "observed"
    dfs = [pl.from_pandas(df_obs)]
    for m, name in [(qr15, "15%-q"), (qr50, "median"), (qr85, "85%-q")]:
        df_pred = df.copy()
        df_pred["type"] = "predicted_" + name
        df_pred["body_mass_g"] = m.predict(X)
        dfs.append(pl.from_pandas(df_pred))
    df_pred = pl.concat(dfs).to_pandas()
    
    sns.catplot(df_pred, x="species", y="body_mass_g", hue="type", alpha=0.5)
    Quantile regression estimates and observed values for species (x-axis).

    This plot per species seems suspicious. We would expect the 85%-quantile prediction in red to mostly be larger than the median (green), instead we detect several clusters. The reason behind it is the sex of the penguins which also enters the model. To demonstrate this fact, I plot the sex separately and make the species visible by the plot style. This time, I put the flipper length on the x-axis.

    sns.relplot(
        df_pred,
        x="flipper_length_mm",
        y="body_mass_g",
        hue="type",
        col="sex",
        style="species",
        alpha=0.5,
    )
    Quantile regression estimates and observed values vs flipper length (x-axis). Left: male; right: female

    This is a really nice plot:

    • One immediately observes the differences in sex on the left and right subplot.
    • The two clusters in each subplot can be well explained by the penguin species.
    • The 3 quantiles are now vertically in the expected order, 15% quantile in yellow the lowest, 85%-quantile in red the highest, the median in the middle, and some observations beyond those predicted quantiles.
    • The models are linear in flipper length with a positive, but slightly different slope per quantile level.

    As a final check, let us compute the coverage of each quantile model (in-sample).

    [
        np.mean(y <= qr15.predict(X)),
        np.mean(y <= qr50.predict(X)),
        np.mean(y <= qr85.predict(X)),
    ]
    [0.15015015015015015, 0.5225225225225225, 0.8618618618618619]

    Those numbers are close enough to 15%, 50% and 85%.

    As always, the full Python code is available as notebook: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2023-02-11%20empirical_quantiles.ipynb.

  • SHAP + XGBoost + Tidymodels = LOVE

    In this recent post, we have explained how to use Kernel SHAP for interpreting complex linear models. As plotting backend, we used our fresh CRAN package “shapviz“.

    “shapviz” has direct connectors to a couple of packages such as XGBoost, LightGBM, H2O, kernelshap, and more. Multiple times people asked me how to combine shapviz when the XGBoost model was fitted with Tidymodels. The workflow was not 100% clear to me as well, but the answer is actually very simple, thanks to Julia’s post where the plots were made with SHAPforxgboost, another cool package for visualization of SHAP values.

    Example with shiny diamonds

    Step 1: Preprocessing

    We first write the data preprocessing recipe and apply it to the data rows that we want to explain. In our case, its 1000 randomly sampled diamonds.

    library(tidyverse)
    library(tidymodels)
    library(shapviz)
    
    # Integer encode factors
    dia_recipe <- diamonds %>%
      recipe(price ~ carat + cut + clarity + color) %>% 
      step_integer(all_nominal())
    
    # Will explain THIS dataset later
    set.seed(2)
    dia_small <- diamonds[sample(nrow(diamonds), 1000), ]
    dia_small_prep <- bake(
      prep(dia_recipe), 
      has_role("predictor"),
      new_data = dia_small, 
      composition = "matrix"
    )
    head(dia_small_prep)
    
    #     carat cut clarity color
    #[1,]  0.57   5       4     4
    #[2,]  1.01   5       2     1
    #[3,]  0.45   1       4     3
    #[4,]  1.04   4       6     5
    #[5,]  0.90   3       6     4
    #[6,]  1.20   3       4     6
    

    Step 2: Fit Model

    The next step is to tune and build the model. For simplicity, we skipped the tuning part. Bad, bad 🙂

    # Just for illustration - in practice needs tuning!
    xgboost_model <- boost_tree(
      mode = "regression",
      trees = 200,
      tree_depth = 5,
      learn_rate = 0.05,
      engine = "xgboost"
    )
    
    dia_wf <- workflow() %>%
      add_recipe(dia_recipe) %>%
      add_model(xgboost_model)
    
    fit <- dia_wf %>%
      fit(diamonds)

    Step 3: SHAP Analysis

    We now need to call shapviz() on the fitted model. In order to have neat interpretations with the original factor labels, we not only pass the prediction data prepared in Step 1 via bake(), but also the original data structure.

    shap <- shapviz(extract_fit_engine(fit), X_pred = dia_small_prep, X = dia_small)
    
    sv_importance(shap, kind = "both", show_numbers = TRUE)
    sv_dependence(shap, "carat", color_var = "auto")
    sv_dependence(shap, "clarity", color_var = "auto")
    sv_force(shap, row_id = 1)
    sv_waterfall(shap, row_id = 1)
    
    Variable importance plot overlaid with SHAP summary beeswarms
    Dependence plot for carat. Note that clarity is shown with original labels, not only integers.
    Dependence plot for clarity. Note again that the x-scale uses the original factor levels, not the integer encoded values.
    Force plot of the first observation
    Waterfall plot for the first observation

    Summary

    Making SHAP analyses with XGBoost Tidymodels is super easy.

    The complete R script can be found here.

  • Dplyr-style without dplyr

    One of the reasons why we love the “dplyr” package: it plays so well together with the forward pipe operator `%>%` from the “magrittr” package. Actually, it is not a coincidence that both packages were released quite at the same time, in 2014.

    What does the pipe do? It puts the object on its left as the first argument into the function on its right: iris %>% head() is a funny way of writing head(iris). It helps to avoid long function chains like f(g(h(x))), or repeated assignments.

    In 2021 and version 4.1, R has received its native forward pipe operator |> so that we can write nice code like this:

    Imagine this without pipe…

    Since version 4.2, the piped object can be referenced by the underscore _, but just once for now, see an example below.

    To use the native pipe via CTRL-SHIFT-M in Posit/RStudio, tick this:

    Combined with the many great functions from the standard distribution of R, we can get a real “dplyr” feeling without even loading dplyr. Don’t get me wrong: I am a huge fan of the whole Tidyverse! But it is a great way to learn “Standard R”.

    Data chains

    Here a small selection of standard functions playing well together with the pipe: They take a data frame and return a modified data frame:

    • subset(): Select rows and columns of data frame
    • transform(): Add or overwrite columns in data frame
    • aggregate(): Grouped calculations
    • rbind(), cbind(): Bind rows/columns of data frame/matrix
    • merge(): Join data frames by key
    • head(), tail(): First/last few elements of object
    • reshape(): Transposition/Reshaping of data frame (no, I don’t understand the interface)
    library(ggplot2)  # Need diamonds
    
    # What does the native pipe do?
    quote(diamonds |> head())
    
    # OUTPUT
    # head(diamonds)
    
    # Grouped statistics
    diamonds |> 
      aggregate(cbind(price, carat) ~ color, FUN = mean)
    
    # OUTPUT
    #   color    price     carat
    # 1     D 3169.954 0.6577948
    # 2     E 3076.752 0.6578667
    # 3     F 3724.886 0.7365385
    # 4     G 3999.136 0.7711902
    # 5     H 4486.669 0.9117991
    # 6     I 5091.875 1.0269273
    # 7     J 5323.818 1.1621368
    
    # Join back grouped stats to relevant columns
    diamonds |> 
      subset(select = c(price, color, carat)) |> 
      transform(price_per_color = ave(price, color)) |> 
      head()
    
    # OUTPUT
    #   price color carat price_per_color
    # 1   326     E  0.23        3076.752
    # 2   326     E  0.21        3076.752
    # 3   327     E  0.23        3076.752
    # 4   334     I  0.29        5091.875
    # 5   335     J  0.31        5323.818
    # 6   336     J  0.24        5323.818
    
    # Plot transformed values
    diamonds |> 
      transform(
        log_price = log(price),
        log_carat = log(carat)
      ) |> 
      plot(log_price ~ log_carat, col = "chartreuse4", pch = ".", data = _)
    A simple scatterplot

    The plot does not look quite as sexy as “ggplot2”, but its a start.

    Other chains

    The pipe not only works perfectly with functions that modify a data frame. It also shines with many other functions often applied in a nested way. Here two examples:

    # Distribution of color within clarity
    diamonds |> 
      subset(select = c(color, clarity)) |> 
      table() |> 
      prop.table(margin = 2) |> 
      addmargins(margin = 1) |> 
      round(3)
    
    # OUTPUT
    # clarity
    # color      I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF
    #     D   0.057 0.149 0.159 0.138 0.086 0.109 0.069 0.041
    #     E   0.138 0.186 0.186 0.202 0.157 0.196 0.179 0.088
    #     F   0.193 0.175 0.163 0.180 0.167 0.192 0.201 0.215
    #     G   0.202 0.168 0.151 0.191 0.263 0.285 0.273 0.380
    #     H   0.219 0.170 0.174 0.134 0.143 0.120 0.160 0.167
    #     I   0.124 0.099 0.109 0.095 0.118 0.072 0.097 0.080
    #     J   0.067 0.052 0.057 0.060 0.066 0.026 0.020 0.028
    #     Sum 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
    
    # Barplot from discrete column
    diamonds$color |> 
      table() |> 
      prop.table() |> 
      barplot(col = "chartreuse4", main = "Color")

    Wrap up

    • Piping is fun with and without dplyr.
    • It is a great motivation to learn standard R

    The complete R script can be found here.

  • Interpret Complex Linear Models with SHAP within Seconds

    A linear model with complex interaction effects can be almost as opaque as a typical black-box like XGBoost.

    XGBoost models are often interpreted with SHAP (Shapley Additive eXplanations): Each of e.g. 1000 randomly selected predictions is fairly decomposed into contributions of the features using the extremely fast TreeSHAP algorithm, providing a rich interpretation of the model as a whole. TreeSHAP was introduced in the Nature publication by Lundberg and Lee (2020).

    Can we do the same for non-tree-based models like a complex GLM or a neural network? Yes, but we have to resort to slower model-agnostic SHAP algorithms:

    In the limit, the two algorithms provide the same SHAP values.

    House prices

    We will use a great dataset with 14’000 house prices sold in Miami in 2016. The dataset was kindly provided by Prof. Steven Bourassa for research purposes and can be found on OpenML.

    The model

    We will model house prices by a Gamma regression with log-link. The model includes factors, linear components and natural cubic splines. The relationship of living area and distance to central district is modeled by letting the spline bases of the two features interact.

    library(OpenML)
    library(tidyverse)
    library(splines)
    library(doFuture)
    library(kernelshap)
    library(shapviz)
    
    raw <- OpenML::getOMLDataSet(43093)$data
    
    # Lump rare level 3 and log transform the land size
    prep <- raw %>%
      mutate(
        structure_quality = factor(structure_quality, labels = c(1, 2, 4, 4, 5)),
        log_landsize = log(LND_SQFOOT)
      )
    
    # 1) Build model
    xvars <- c("TOT_LVG_AREA", "log_landsize", "structure_quality",
               "CNTR_DIST", "age", "month_sold")
    
    fit <- glm(
      SALE_PRC ~ ns(log(CNTR_DIST), df = 4) * ns(log(TOT_LVG_AREA), df = 4) +
        log_landsize + structure_quality + ns(age, df = 4) + ns(month_sold, df = 4),
      family = Gamma("log"),
      data = prep
    )
    summary(fit)
    
    # Selected coefficients:
    # log_landsize: 0.22559  
    # structure_quality4: 0.63517305 
    # structure_quality5: 0.85360956   
    

    The model has 37 parameters. Some of the estimates are shown.

    Interpretation

    The workflow of a SHAP analysis is as follows:

    1. Sample 1000 rows to explain
    2. Sample 100 rows as background data used to estimate marginal expectations
    3. Calculate SHAP values. This can be done fully in parallel by looping over the rows selected in Step 1
    4. Analyze the SHAP values

    Step 2 is the only additional step compared with TreeSHAP. It is required both for SHAP sampling values and Kernel SHAP.

    # 1) Select rows to explain
    set.seed(1)
    X <- prep[sample(nrow(prep), 1000), xvars]
    
    # 2) Select small representative background data
    bg_X <- prep[sample(nrow(prep), 100), ]
    
    # 3) Calculate SHAP values in fully parallel mode
    registerDoFuture()
    plan(multisession, workers = 6)  # Windows
    # plan(multicore, workers = 6)   # Linux, macOS, Solaris
    
    system.time( # <10 seconds
      shap_values <- kernelshap(
        fit, X, bg_X = bg_X, parallel = T, parallel_args = list(.packages = "splines")
      )
    )

    Thanks to parallel processing and some implementation tricks, we were able to decompose 1000 predictions within 10 seconds! By default, kernelshap() uses exact calculations up to eight features (exact regarding the background data), which would need an infinite amount of Monte-Carlo-sampling steps.

    Note that glm() has a very efficient predict() function. GAMs, neural networks, random forests etc. usually take more time, e.g. 5 minutes to do the crunching.

    Analyze the SHAP values

    # 4) Analyze them
    sv <- shapviz(shap_values)
    
    sv_importance(sv, show_numbers = TRUE) +
      ggtitle("SHAP Feature Importance")
    
    sv_dependence(sv, "log_landsize")
    sv_dependence(sv, "structure_quality")
    sv_dependence(sv, "age")
    sv_dependence(sv, "month_sold")
    sv_dependence(sv, "TOT_LVG_AREA", color_var = "auto")
    sv_dependence(sv, "CNTR_DIST", color_var = "auto")
    
    # Slope of log_landsize: 0.2255946
    diff(sv$S[1:2, "log_landsize"]) / diff(sv$X[1:2, "log_landsize"])
    
    # Difference between structure quality 4 and 5: 0.2184365
    diff(sv$S[2:3, "structure_quality"])
    SHAP Importance: Living area and the distance to the central district are the two most important predictors. The month (within 2016) impacts the predicted prices by +-1.3% on average.
    SHAP dependence plot of “log_landsize”. The effect is linear. The slope 0.22559 agrees with the model coefficient.
    Dependence plot for “structure_quality”: The difference between structure quality 4 and 5 is 0.2184365. This equals the difference in regression coefficients.
    Dependence plot of “living_area”: The effect is very steep. The more central, the steeper. We cannot easily compare these numbers with the output of the linear regression.

    Summary

    • Interpreting complex linear models with SHAP is an option. There seems to be a correspondence between regression coefficients and SHAP dependence, at least for additive components.
    • Kernel SHAP in R is fast. For models with slower predict() functions (e.g. GAMs, random forests, or neural nets), we often need to wait a couple of minutes.

    The complete R script can be found here.

  • The Unfairness of AI Fairness

    Fairness in Artificial Intelligence (AI) and Machine Learning (ML) is a recent and hot topic. As ML models are used in insurance pricing, the fairness topic also applies there. Just last month, Lindholm, Richman, Tsanakas and Wüthrich published a discussion paper on this subject that sheds new light on established AI fairness criteria. This post provides a short summary of this discussion paper with a few comments of my own. I recommend the interested reader to jump to the original: A Discussion of Discrimination and Fairness in Insurance Pricing.

    First of all, I’d like to state that fairness in the form of solidarity and risk sharing was always at the heart of insurance and, as such, is very very old. The recent discussions regarding fairness has a different focus. It comes with the rise of successful ML models that can easily make use of the information contained in large amounts of data (many feature variables). A statistician might just call that multivariate statistical models. Insurance pricing is a domain where ML models (including GLMs) are successfully applied for quite some time (at least since the 1990s), and where at the same time protected information like gender and ethnicity might be available in the data. This led the European Council to forbid gender in insurance pricing.

    The important point is—and here speaks the statistician again—that not using a certain features does in no way guarantee that this protected information is not used by a model. A car model or type, for instance, is correlated with the gender of the owner. This is called proxy discrimination.

    The brilliant idea of Lindholm et al. was to construct an example where a protected feature does not influence the actuarial best price. So, everyone would agree that this is a fair model. But it turns out that the most common (statistical) definitions of AI fairness all fail. All of them judge this best price model as unfair. To be explicit, the following three group fairness axioms were analysed:

    • Independence axiom / Statistical parity / Demographic parity
    • Separation axiom / Equalized odds / Disparate mistreatment
    • Sufficiency axiom / Predictive parity

    On top of that, these 3 fairness criteria may force different insurance companies to exclude different non-protected variables from their pricing models.

    How to conclude? It turns out that fairness is a complicated matter. It has many sociological, cultural and moral aspects. Apart from this broad spectrum, one particular challenge is to give precise mathematical definitions. This topic seems to be, as the paper suggests, open for discussion.

  • Kernel SHAP in R and Python

    Lost in Translation between R and Python 9

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    Kernel SHAP

    SHAP is one of the most used model interpretation technique in Machine Learning. It decomposes predictions into additive contributions of the features in a fair way. For tree-based methods, the fast TreeSHAP algorithm exists. For general models, one has to resort to computationally expensive Monte-Carlo sampling or the faster Kernel SHAP algorithm. Kernel SHAP uses a regression trick to get the SHAP values of an observation with a comparably small number of calls to the predict function of the model. Still, it is much slower than TreeSHAP.

    Two good references for Kernel SHAP:

    1. Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30, 2017.
    2. Ian Covert and Su-In Lee. Improving KernelSHAP: Practical Shapley Value Estimation Using Linear Regression. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:3457-3465, 2021.

    In our last post, we introduced our new “kernelshap” package in R. Since then, the package has been substantially improved, also by the big help of David Watson:

    1. The package now supports multi-dimensional predictions.
    2. It received a massive speed-up
    3. Additionally, parallel computing can be activated for even faster calculations.
    4. The interface has become more intuitive.
    5. If the number of features is small (up to ten or eleven), it can provide exact Kernel SHAP values just like the reference Python implementation.
    6. For a larger number of features, it now uses partly-exact (“hybrid”) calculations, very similar to the logic in the Python implementation.

    With those changes, the R implementation is about to meet the Python version at eye level.

    Example with four features

    In the following, we use the diamonds data to fit a linear regression with

    • log(price) as response
    • log(carat) as numeric feature
    • clarity, color and cut as categorical features (internally dummy encoded)
    • interactions between log(carat) and the other three “C” variables. Note that the interactions are very weak

    Then, we calculate SHAP decompositions for about 1000 diamonds (every 53th diamond), using 120 diamonds as background dataset. In this case, both R and Python will use exact calculations based on m=2^4 – 2 = 14 possible binary on-off vectors (a value of 1 representing a feature value picked from the original observation, a value of 0 a value picked from the background data).

    library(ggplot2)
    library(kernelshap)
    
    # Turn ordinal factors into unordered
    ord <- c("clarity", "color", "cut")
    diamonds[, ord] <- lapply(diamonds[ord], factor, ordered = FALSE)
    
    # Fit model
    fit <- lm(log(price) ~ log(carat) * (clarity + color + cut), data = diamonds)
    
    # Subset of 120 diamonds used as background data
    bg_X <- diamonds[seq(1, nrow(diamonds), 450), ]
    
    # Subset of 1018 diamonds to explain
    X_small <- diamonds[seq(1, nrow(diamonds), 53), c("carat", ord)]
    
    # Exact KernelSHAP (5 seconds)
    system.time(
      ks <- kernelshap(fit, X_small, bg_X = bg_X)  
    )
    ks
    
    # SHAP values of first 2 observations:
    #          carat     clarity     color        cut
    # [1,] -2.050074 -0.28048747 0.1281222 0.01587382
    # [2,] -2.085838  0.04050415 0.1283010 0.03731644
    
    # Using parallel backend
    library("doFuture")
    
    registerDoFuture()
    plan(multisession, workers = 2)  # Windows
    # plan(multicore, workers = 2)   # Linux, macOS, Solaris
    
    # 3 seconds on second call
    system.time(
      ks3 <- kernelshap(fit, X_small, bg_X = bg_X, parallel = TRUE)  
    )
    
    # Visualization
    library(shapviz)
    
    sv <- shapviz(ks)
    sv_importance(sv, "bee")
    import numpy as np
    import pandas as pd
    from plotnine.data import diamonds
    from statsmodels.formula.api import ols
    from shap import KernelExplainer
    
    # Turn categoricals into integers because, inconveniently, kernel SHAP
    # requires numpy array as input
    ord = ["clarity", "color", "cut"]
    x = ["carat"] + ord
    diamonds[ord] = diamonds[ord].apply(lambda x: x.cat.codes)
    X = diamonds[x].to_numpy()
    
    # Fit model with interactions and dummy variables
    fit = ols(
      "np.log(price) ~ np.log(carat) * (C(clarity) + C(cut) + C(color))", 
      data=diamonds
    ).fit()
    
    # Background data (120 rows)
    bg_X = X[0:len(X):450]
    
    # Define subset of 1018 diamonds to explain
    X_small = X[0:len(X):53]
    
    # Calculate KernelSHAP values
    ks = KernelExplainer(
      model=lambda X: fit.predict(pd.DataFrame(X, columns=x)), 
      data = bg_X
    )
    sv = ks.shap_values(X_small)  # 74 seconds
    sv[0:2]
    
    # array([[-2.05007406, -0.28048747,  0.12812216,  0.01587382],
    #        [-2.0858379 ,  0.04050415,  0.12830103,  0.03731644]])
    SHAP summary plot (R model)

    The results match, hurray!

    Example with nine features

    The computation effort of running exact Kernel SHAP explodes with the number of features. For nine features, the number of relevant on-off vectors is 2^9 – 2 = 510, i.e. about 36 times larger than with four features.

    We now modify above example, adding five additional features to the model. Note that the model structure is completely non-sensical. We just use it to get a feeling about what impact a 36 times larger workload has.

    Besides exact calculations, we use an almost exact hybrid approach for both R and Python, using 126 on-off vectors (p*(p+1) for the exact part and 4p for the sampling part, where p is the number of features), resulting in a significant speed-up both in R and Python.

    fit <- lm(
      log(price) ~ log(carat) * (clarity + color + cut) + x + y + z + table + depth, 
      data = diamonds
    )
    
    # Subset of 1018 diamonds to explain
    X_small <- diamonds[seq(1, nrow(diamonds), 53), setdiff(names(diamonds), "price")]
    
    # Exact Kernel SHAP: 61 seconds
    system.time(
      ks <- kernelshap(fit, X_small, bg_X = bg_X, exact = TRUE)  
    )
    ks
    #          carat        cut     color     clarity         depth         table          x           y            z
    # [1,] -1.842799 0.01424231 0.1266108 -0.27033874 -0.0007084443  0.0017787647 -0.1720782 0.001330275 -0.006445693
    # [2,] -1.876709 0.03856957 0.1266546  0.03932912 -0.0004202636 -0.0004871776 -0.1739880 0.001397792 -0.006560624
    
    # Default, using an almost exact hybrid algorithm: 17 seconds
    system.time(
      ks <- kernelshap(fit, X_small, bg_X = bg_X, parallel = TRUE)  
    )
    #          carat        cut     color     clarity         depth         table          x           y            z
    # [1,] -1.842799 0.01424231 0.1266108 -0.27033874 -0.0007084443  0.0017787647 -0.1720782 0.001330275 -0.006445693
    # [2,] -1.876709 0.03856957 0.1266546  0.03932912 -0.0004202636 -0.0004871776 -0.1739880 0.001397792 -0.006560624
    x = ["carat"] + ord + ["table", "depth", "x", "y", "z"]
    X = diamonds[x].to_numpy()
    
    # Fit model with interactions and dummy variables
    fit = ols(
      "np.log(price) ~ np.log(carat) * (C(clarity) + C(cut) + C(color)) + table + depth + x + y + z", 
      data=diamonds
    ).fit()
    
    # Background data (120 rows)
    bg_X = X[0:len(X):450]
    
    # Define subset of 1018 diamonds to explain
    X_small = X[0:len(X):53]
    
    # Calculate KernelSHAP values: 12 minutes
    ks = KernelExplainer(
      model=lambda X: fit.predict(pd.DataFrame(X, columns=x)), 
      data = bg_X
    )
    sv = ks.shap_values(X_small)
    sv[0:2]
    # array([[-1.84279897e+00, -2.70338744e-01,  1.26610769e-01,
    #          1.42423108e-02,  1.77876470e-03, -7.08444295e-04,
    #         -1.72078182e-01,  1.33027467e-03, -6.44569296e-03],
    #        [-1.87670887e+00,  3.93291219e-02,  1.26654599e-01,
    #          3.85695742e-02, -4.87177593e-04, -4.20263565e-04,
    #         -1.73988040e-01,  1.39779179e-03, -6.56062359e-03]])
    
    # Now, using a hybrid between exact and sampling: 5 minutes
    sv = ks.shap_values(X_small, nsamples=126)
    sv[0:2]
    # array([[-1.84279897e+00, -2.70338744e-01,  1.26610769e-01,
    #          1.42423108e-02,  1.77876470e-03, -7.08444295e-04,
    #         -1.72078182e-01,  1.33027467e-03, -6.44569296e-03],
    #        [-1.87670887e+00,  3.93291219e-02,  1.26654599e-01,
    #          3.85695742e-02, -4.87177593e-04, -4.20263565e-04,
    #         -1.73988040e-01,  1.39779179e-03, -6.56062359e-03]])

    Again, the results are essentially the same between R and Python, but also between the hybrid algorithm and the exact algorithm. This is interesting, because the hybrid algorithm is significantly faster than the exact one.

    Wrap-Up

    • R is catching up with Python’s superb “shap” package.
    • For two non-trivial linear regressions with interactions, the “kernelshap” package in R provides the same output as Python.
    • The hybrid between exact and sampling KernelSHAP (as implemented in Python and R) offers a very good trade-off between speed and accuracy.
    • kernelshap()in R is fast!

    The Python and R codes can be found here:

    The examples were run on a Windows notebook with an Intel i7-8650U 4 core CPU.

  • Kernel SHAP

    Our last posts were on SHAP, one of the major ways to shed light into black-box Machine Learning models. SHAP values decompose predictions in a fair way into additive contributions from each feature. Decomposing many predictions and then analyzing the SHAP values gives a relatively quick and informative picture of the fitted model at hand.

    In their 2017 paper on SHAP, Scott Lundberg and Su-In Lee presented Kernel SHAP, an algorithm to calculate SHAP values for any model with numeric predictions. Compared to Monte-Carlo sampling (e.g. implemented in R package “fastshap”), Kernel SHAP is much more efficient.

    I had one problem with Kernel SHAP: I never really understood how it works!

    Then I found this article by Covert and Lee (2021). The article not only explains all the details of Kernel SHAP, it also offers an version that would iterate until convergence. As a by-product, standard errors of the SHAP values can be calculated on the fly.

    This article motivated me to implement the “kernelshap” package in R, complementing “shapr” that uses a different logic.

    The new “kernelshap” package in R

    The interface is quite simple: You need to pass three things to its main function kernelshap():

    • X: matrix/data.frame/tibble/data.table of observations to explain. Each column is a feature.
    • pred_fun: function that takes an object like X and provides one number per row.
    • bg_X: matrix/data.frame/tibble/data.table representing the background dataset used to calculate marginal expectation. Typically, between 100 and 200 rows.

    Example

    We will use Keras to build a deep learning model with 631 parameters on diamonds data. Then we decompose 500 predictions with kernelshap() and visualize them with “shapviz”.

    We will fit a Gamma regression with log link the four “C” features:

    • carat
    • color
    • clarity
    • cut
    library(tidyverse)
    library(keras)
    
    # Response and covariates
    y <- as.numeric(diamonds$price)
    X <- scale(data.matrix(diamonds[c("carat", "color", "cut", "clarity")]))
    
    # Input layer: we have 4 covariates
    input <- layer_input(shape = 4)
    
    # Two hidden layers with contracting number of nodes
    output <- input %>%
      layer_dense(units = 30, activation = "tanh") %>% 
      layer_dense(units = 15, activation = "tanh") %>% 
      layer_dense(units = 1, activation = k_exp)
    
    # Create and compile model
    nn <- keras_model(inputs = input, outputs = output)
    summary(nn)
    
    # Gamma regression loss
    loss_gamma <- function(y_true, y_pred) {
      -k_log(y_true / y_pred) + y_true / y_pred
    }
    
    nn %>% 
      compile(
        optimizer = optimizer_adam(learning_rate = 0.001),
        loss = loss_gamma
      )
    
    # Callbacks
    cb <- list(
      callback_early_stopping(patience = 20),
      callback_reduce_lr_on_plateau(patience = 5)
    )
    
    # Fit model
    history <- nn %>% 
      fit(
        x = X,
        y = y,
        epochs = 100,
        batch_size = 400, 
        validation_split = 0.2,
        callbacks = cb
      )
    
    history$metrics[c("loss", "val_loss")] %>% 
      data.frame() %>% 
      mutate(epoch = row_number()) %>% 
      filter(epoch >= 3) %>% 
      pivot_longer(cols = c("loss", "val_loss")) %>% 
    ggplot(aes(x = epoch, y = value, group = name, color = name)) +
      geom_line(size = 1.4)

    Interpretation via KernelSHAP

    In order to peak into the fitted model, we apply the Kernel SHAP algorithm to decompose 500 randomly selected diamond predictions. We use the same subset as background dataset required by the Kernel SHAP algorithm.

    Afterwards, we will study

    • Some SHAP values and their standard errors
    • One waterfall plot
    • A beeswarm summary plot to get a rough picture of variable importance and the direction of the feature effects
    • A SHAP dependence plot for carat
    # Interpretation on 500 randomly selected diamonds
    library(kernelshap)
    library(shapviz)
    
    sample(1)
    ind <- sample(nrow(X), 500)
    
    dia_small <- X[ind, ]
    
    # 77 seconds
    system.time(
      ks <- kernelshap(
        dia_small, 
        pred_fun = function(X) as.numeric(predict(nn, X, batch_size = nrow(X))), 
        bg_X = dia_small
      )
    )
    ks
    
    # Output
    # 'kernelshap' object representing 
    # - SHAP matrix of dimension 500 x 4 
    # - feature data.frame/matrix of dimension 500 x 4 
    # - baseline value of 3744.153
    # 
    # SHAP values of first 2 observations:
    #         carat     color       cut   clarity
    # [1,] -110.738 -240.2758  5.254733 -720.3610
    # [2,] 2379.065  263.3112 56.413680  452.3044
    # 
    # Corresponding standard errors:
    #         carat      color       cut  clarity
    # [1,] 2.064393 0.05113337 0.1374942 2.150754
    # [2,] 2.614281 0.84934844 0.9373701 0.827563
    
    sv <- shapviz(ks, X = diamonds[ind, x])
    sv_waterfall(sv, 1)
    sv_importance(sv, "both")
    sv_dependence(sv, "carat", "auto")

    Note the small standard errors of the SHAP values of the first two diamonds. They are only approximate because the background data is only a sample from an unknown population. Still, they give a good impression on the stability of the results.

    The waterfall plot shows a diamond with not super nice clarity and color, pulling down the value of this diamond. Note that, even if the model is working with scaled numeric feature values, the plot shows the original feature values.

    SHAP waterfall plot of one diamond. Note its bad clarity.

    The SHAP summary plot shows that “carat” is, unsurprisingly, the most important variable and that high carat mean high value. “cut” is not very important, except if it is extremely bad.

    SHAP summary plot with bars representing average absolute values as measure of importance.

    Our last plot is a SHAP dependence plot for “carat”: the effect makes sense, and we can spot some interaction with color. For worse colors (H-J), the effect of carat is a bit less strong as for the very white diamonds.

    Dependence plot for “carat”

    Short wrap-up

    • Standard Kernel SHAP in R, yeahhhhh 🙂
    • The Github version is relatively fast, so you can even decompose 500 observations of a deep learning model within 1-2 minutes.

    The complete R script can be found here.

  • shapviz goes H2O

    In a recent post, I introduced the initial version of the “shapviz” package. Its motto: do one thing, but do it well: visualize SHAP values.

    The initial community feedback was very positive, and a couple of things have been improved in version 0.2.0. Here the main changes:

    1. “shapviz” now works with tree-based models of the h2o package in R.
    2. Additionally, it wraps the shapr package, which implements an improved version of Kernel SHAP taking into account feature dependence.
    3. A simple interface to collapse SHAP values of dummy variables was added.
    4. The default importance plot is now a bar plot, instead of the (slower) beeswarm plot. In later releases, the latter might be moved to a separate function sv_summary() for consistency with other packages.
    5. Importance plot and dependence plot now work neatly with ggplotly(). The other plot types cannot be translated with ggplotly() because they use geoms from outside ggplot. At least I do not know how to do this…

    Example

    Let’s build an H2O gradient boosted trees model to explain diamond prices. Then, we explain the model with our “shapviz” package. Note that H2O itself also offers some SHAP plots. “shapviz” is directly applied to the fitted H2O model. This means you don’t have to write a single superfluous line of code.

    library(shapviz)
    library(tidyverse)
    library(h2o)
    
    h2o.init()
    
    set.seed(1)
    
    # Get rid of that darn ordinals
    ord <- c("clarity", "cut", "color")
    diamonds[, ord] <- lapply(diamonds[, ord], factor, ordered = FALSE)
    
    # Minimally tuned GBM with 260 trees, determined by early-stopping with CV
    dia_h2o <- as.h2o(diamonds)
    fit <- h2o.gbm(
      c("carat", "clarity", "color", "cut"),
      y = "price",
      training_frame = dia_h2o,
      nfolds = 5,
      learn_rate = 0.05,
      max_depth = 4,
      ntrees = 10000,
      stopping_rounds = 10,
      score_each_iteration = TRUE
    )
    fit
    
    # SHAP analysis on about 2000 diamonds
    X_small <- diamonds %>%
      filter(carat <= 2.5) %>%
      sample_n(2000) %>%
      as.h2o()
    
    shp <- shapviz(fit, X_pred = X_small)
    
    sv_importance(shp, show_numbers = TRUE)
    sv_importance(shp, show_numbers = TRUE, kind = "bee")
    sv_dependence(shp, "color", "auto", alpha = 0.5)
    sv_force(shp, row_id = 1)
    sv_waterfall(shp, row_id = 1)

    Summary and importance plots

    The SHAP importance and SHAP summary plots clearly show that carat is the most important variable. On average, it impacts the prediction by 3247 USD. The effect of “cut” is much smaller. Its impact on the predictions, on average, is plus or minus 112 USD.

    SHAP summary plot
    SHAP importance plot

    SHAP dependence plot

    The SHAP dependence plot shows the effect of “color” on the prediction: The better the color (close to “D”), the higher the price. Using a correlation based heuristic, the plot selected carat on the color scale to show that the color effect is hightly influenced by carat in the sense that the impact of color increases with larger diamond weight. This clearly makes sense!

    Dependence plot for “color”

    Waterfall and force plot

    Finally, the waterfall and force plots show how a single prediction is decomposed into contributions from each feature. While this does not tell much about the model itself, it might be helpful to explain what SHAP values are and to debug strange predictions.

    Waterfall plot
    Force plot

    Short wrap-up

    • Combining “shapviz” and H2O is fun. Okay, that one was subjective :-).
    • Good visualization of ML models is extremely helpful and reassuring.

    The complete R script can be found here.

  • Visualize SHAP Values without Tears

    SHAP (SHapley Additive exPlanations, Lundberg and Lee, 2017) is an ingenious way to study black box models. SHAP values decompose – as fair as possible – predictions into additive feature contributions.

    When it comes to SHAP, the Python implementation is the de-facto standard. It not only offers many SHAP algorithms, but also provides beautiful plots. In R, the situation is a bit more confusing. Different packages contain implementations of SHAP algorithms, e.g.,

    some of which with great visualizations. Plus there is SHAPforxgboost (see my recent post), originally designed to visualize the results of SHAP values calculated from XGBoost, but it can also be used more generally by now.

    The shapviz package

    In order to entangle calculation from visualization, the shapviz package was designed. It solely focuses on visualization of SHAP values. Closely following its README, it currently provides these plots:

    • sv_waterfall(): Waterfall plots to study single predictions.
    • sv_force(): Force plots as an alternative to waterfall plots.
    • sv_importance(): Importance plots (bar and/or beeswarm plots) to study variable importance.
    • sv_dependence(): Dependence plots to study feature effects (optionally colored by heuristically strongest interacting feature).

    They require a “shapviz” object, which is built from two things only:

    1. S: Matrix of SHAP values
    2. X: Dataset with corresponding feature values

    Furthermore, a “baseline” can be passed to represent an average prediction on the scale of the SHAP values.

    A key feature of the “shapviz” package is that X is used for visualization only. Thus it is perfectly fine to use factor variables, even if the underlying model would not accept these.

    To further simplify the use of shapviz, direct connectors to the packages

    are available.

    Installation

    The package shapviz can be installed from CRAN or Github:

    • devtools::install_github("shapviz")
    • devtools::install_github("mayer79/shapviz")

    Example

    Shiny diamonds… let’s model their prices by four “c” variables with XGBoost, and create an explanation dataset with 2000 randomly picked diamonds.

    library(shapviz)
    library(ggplot2)
    library(xgboost)
    
    set.seed(3653)
    
    X <- diamonds[c("carat", "cut", "color", "clarity")]
    dtrain <- xgb.DMatrix(data.matrix(X), label = diamonds$price)
    
    fit <- xgb.train(
      params = list(learning_rate = 0.1, objective = "reg:squarederror"), 
      data = dtrain,
      nrounds = 65L
    )
    
    # Explanation dataset
    X_small <- X[sample(nrow(X), 2000L), ]

    Create “shapviz” object

    One line of code creates a shapviz object. It contains SHAP values and feature values for the set of observations we are interested in. Note again that X is solely used as explanation dataset, not for calculating SHAP values.

    In this example we construct the shapviz object directly from the fitted XGBoost model. Thus we also need to pass a corresponding prediction dataset X_pred used for calculating SHAP values by XGBoost.

    shp <- shapviz(fit, X_pred = data.matrix(X_small), X = X_small)

    Explaining one single prediction

    Let’s start by explaining a single prediction by a waterfall plot or, alternatively, a force plot.

    # Two types of visualizations
    sv_waterfall(shp, row_id = 1)
    sv_force(shp, row_id = 1
    Waterfall plot

    Factor/character variables are kept as they are, even if the underlying XGBoost model required them to be integer encoded.

    Force plot

    Explaining the model as a whole

    We have decomposed 2000 predictions, not just one. This allows us to study variable importance at a global model level by studying average absolute SHAP values as a bar plot or by looking at beeswarm plots of SHAP values.

    # Three types of variable importance plots
    sv_importance(shp)
    sv_importance(shp, kind = "bar")
    sv_importance(shp, kind = "both", alpha = 0.2, width = 0.2)
    Beeswarm plot
    Bar plot
    Beeswarm plot overlaid with bar plot

    A scatterplot of SHAP values of a feature like color against its observed values gives a great impression on the feature effect on the response. Vertical scatter gives additional info on interaction effects. shapviz offers a heuristic to pick another feature on the color scale with potential strongest interaction.

    sv_dependence(shp, v = "color", "auto")
    Dependence plot with automatic interaction colorization

    Summary

    • The “shapviz” has a single purpose: making SHAP plots.
    • Its interface is optimized for existing SHAP crunching packages and can easily be used in future packages as well.
    • All plots are highly customizable. Furthermore, they are all written with ggplot and allow corresponding modifications.

    The complete R script can be found here.

    References

    Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30 (2017).

  • From Least Squares Benchmarks to the Marchenko–Pastur Distribution

    In this blog post, I tell the story how I learned about a theorem for random matrices of the two Ukrainian🇺🇦 mathematicians Vladimir Marchenko and Leonid Pastur. It all started with benchmarking least squares solvers in scipy.

    Setting the Stage for Least Squares Solvers

    Least squares starts with a matrix A \in \mathbb{R}^{n,p} and a vector b \in \mathbb{R}^{n} and one is interested in solution vectors x \in \mathbb{R}^{p} fulfilling

    x^\star = \argmin_x ||Ax-b||_2^2 \,.

    You can read more about least squares in our earlier post Least Squares Minimal Norm Solution. There are many possible ways to tackle this problem and many algorithms are available. One standard solver is LSQR with the following properties:

    • Iterative solver, which terminates when some stopping criteria are smaller than a user specified tolerance.
    • It only uses matrix-vector products. Therefore, it is suitable for large and sparse matrices A.
    • It effectively solves the normal equations A^T A = A^Tb based on a bidiagonalization procedure of Golub and Kahan (so never really calculating A^T A).
    • It is, unfortunately, susceptible to ill-conditioned matrices A, which we demonstrated in our earlier post.

    Wait, what is an ill-conditioned matrix? This is most easily explained with the help of the singular value decomposition (SVD). Any real valued matrix permits a decomposition into three parts:

    A = U S V^T \,.

    U and V are orthogonal matrices, but not of further interest to us. The matrix S on the other side is (rectangular) diagonal with only non-negative entries s_i = S_{ii} \geq 0. A matrix A is said to be ill-conditioned if it has a large condition number, which can be defined as the ratio of largest and smallest singular value, \mathrm{cond}(A) = \frac{\max_i s_i}{\min_i s_i} = \frac{s_{\mathrm{max}}}{s_{\mathrm{min}}}. Very often, large condition numbers make numerical computations difficult or less precise.

    Benchmarking LSQR

    One day, I decided to benchmark the computation time of least squares solvers provided by scipy, in particular LSQR. I wanted results for different sizes n, p of the matrix dimensions. So I needed to somehow generate different matrices A. There are a myriad ways to do that. Naive as I was, I did the most simple thing and used standard Normal (Gaussian) distributed random matrices A_{ij} \sim \mathcal{N}(0, 1) and ran benchmarks on those. Let’s see how that looks in Python.

    from collections import OrderedDict
    from functools import partial
    
    import matplotlib.pyplot as plt
    import numpy as np
    from scipy.linalg import lstsq
    from scipy.sparse.linalg import lsqr
    import seaborn as sns
    from neurtu import Benchmark, delayed
    
    plt.ion()
    
    p_list = [100, 500]
    rng = np.random.default_rng(42)
    X = rng.standard_normal(max(p_list) ** 2 * 2)
    y = rng.standard_normal(max(p_list) * 2)
    
    def bench_cases():    
        for p in p_list:
            for n in np.arange(0.1, 2.1, 0.1):
                n = int(p*n)
                A = X[:n*p].reshape(n, p)
                b = y[:n]
                for solver in ['lstsq', 'lsqr']:
                    tags = OrderedDict(n=n, p=p, solver=solver)
                    if solver == 'lstsq':
                        solve = delayed(lstsq, tags=tags)
                    elif solver == 'lsqr':
                        solve = delayed(
                            partial(
                                lsqr, atol=1e-10, btol=1e-10, iter_lim=1e6
                            ),
                            tags=tags)
    
                    yield solve(A, b)
    
    bench = Benchmark(wall_time=True)
    df = bench(bench_cases())
    
    g = sns.relplot(x='n', y='wall_time', hue='solver', col='p',
                kind='line', facet_kws={'sharex': False, 'sharey': False},
                data=df.reset_index(), marker='o')
    g.set_titles("p = {col_name}")
    g.set_axis_labels("n", "Wall time (s)")
    g.set(xscale="linear", yscale="log")
    
    plt.subplots_adjust(top=0.9)
    g.fig.suptitle('Benchmark LSQR vs LSTSQ')
    for ax in g.axes.flatten():
        ax.tick_params(labelbottom=True)
    Timing benchmark of LSQR and standard LAPACK LSTSQ solvers for different sizes of n on the x-axis and p=100 left, p=500 right.

    The left plot already looks a bit suspicious around n=p. But what is happening on the right side? Where does this spike of LSQR come from? And why does the standard least squares solver, SVD-based lstsq, not show this spike?

    When I saw these results, I thought something might be wrong with LSQR and opened an issue on the scipy github repository, see https://github.com/scipy/scipy/issues/11204. The community there is really fantastic. Brett Naul pointed me to ….

    The Marchenko–Pastur Distribution

    The Marchenko–Pastur distribution is the distribution of the eigenvalues (singular values of square matrices) of certain random matrices in the large sample limit. Given a random matrix A \in \mathbb{R}^{n,p} with i.i.d. entries A_{ij} having zero mean, \mathbb{E}[A_{ij}] = 0, and finite variance, \mathrm{Var}[A_{ij}] = \sigma^2 < \infty, we define the matrix Y_n = \frac{1}{n}A^T A \in \mathbb{R}^{p,p}. As square and even symmetric matrix, Y_n has a simpler SVD, namely Y_n = V \Sigma V^T. One can in fact show that V is the same as in the SVD of A and that the diagonal matrix \Sigma = \frac{1}{n}S^TS contains the squared singular values of A and \min(0, p-n) extra zero values. The (diagonal) values of \Sigma are called eigenvalues \lambda_1, \ldots, \lambda_p of Y_n. Note/ that the eigenvalues are themselves random variables, and we are interested in their probability distribution or probability measure. We define the (random) measure \mu_p(B) = \frac{1}{p} \#\{\lambda_j \in B\} for all intervals B \subset \mathbb{R}.

    The theorem of Marchenko and Pastur then states that for n, p \rightarrow \infty with \frac{p}{n} \rightarrow \rho , we have \mu_p \rightarrow \mu, where

    \mu(B) = \begin{cases} (1-\frac{1}{\rho})\mathbb{1}_{0\in B} + \nu(B),\quad &\rho > 1 \\ \nu(B), \quad & 0\leq\rho\leq 1 \end{cases} \,,
    \nu(x) = \frac{1}{2\pi\sigma^2} \frac{\sqrt{(\rho_+ - x)(x - \rho_-)}}{\rho x} \mathbb{1}_{x \in [\rho_-, \rho_+]} dx\,,
    \rho_{\pm} = \sigma^2 (1\pm\sqrt{\rho})^2 \,.

    We can at least derive the point mass at zero for \rho>1 \Leftrightarrow p>n: We said above that \Sigma contains p-n extra zeros and those correspond to a density of \frac{p-n}{p}=1 – \frac{1}{\rho} at zero.

    A lot of math so far. Just note that the assumptions on A are exactly met by the one in our benchmark above. Also note that the normal equations can be expressed in terms of Y_n as n Y_n x = A^Tb.

    Empirical Confirmation of the Marchenko–Pastur Distribution

    Before we come back to the spikes in our benchmark, let us have a look and see how good the Marchenko–Pastur distribution is approximated for finite sample size. We choose n=1000, p=500 which gives \rho=\frac{1}{2}. We plot a histrogram of the eigenvalues next to the Marchenko–Pastur distribution.

    def marchenko_pastur_mu(x, rho, sigma2=1):
        x = np.atleast_1d(x).astype(float)
        rho_p = sigma2 * (1 + np.sqrt(rho)) ** 2
        rho_m = sigma2 * (1 - np.sqrt(rho)) ** 2
        mu = np.zeros_like(x)
        is_nonzero = (rho_m < x) & (x < rho_p)
        x_valid = x[is_nonzero]
        factor = 1 / (2 * np.pi * sigma2 * rho)
        mu[is_nonzero] = factor / x_valid
        mu[is_nonzero] *= np.sqrt((rho_p - x_valid) * (x_valid - rho_m))
        if rho > 1:
            mu[x == 0] = 1 - 1 / rho
        return mu
    
    fig, ax = plt.subplots()
    
    n, p = 1000, 500
    A = X.reshape(n, p)
    Y = 1/n * A.T @ A
    eigenvals, _ = np.linalg.eig(Y)
    ax.hist(eigenvals.real, bins=50, density=True, label="histogram")
    x = np.linspace(0, np.max(eigenvals.real), 100)
    ax.plot(x, marchenko_pastur_mu(x, rho=p/n), label="MP distribution")
    ax.legend()
    ax.set_xlabel("eigenvalue")
    ax.set_ylabel("probability")
    ax.set_title("Empirical evidence for n=1000, p=500, rho=0.5")
    Hitogram of eigenvalues for n=100, p=500 versus Marchenko–Pastur distribution

    I have to say, I am very impressed by this good agreement for n=1000, which is far from being huge.

    Conclusion

    Let’s visualize the Marchenko–Pastur distribution Y_n for several ratios \rho and fix \sigma=1:

    fig, ax = plt.subplots()
    
    rho_list = [0.5, 1, 1.5]
    x = np.linspace(0, 5, 1000)[1:] # exclude 0
    for rho in rho_list:
        y = marchenko_pastur_mu(x, rho)
        line, = ax.plot(x, y, label=f"rho={rho}")
        # plot zero point mass
        if rho > 1:
            ax.scatter(0, marchenko_pastur_mu(0, rho), color = line.get_color())
    
    ax.set_ylim(None, 1.2)
    ax.legend()
    ax.set_title("Marchenko-Pastur Distribution")
    ax.set_xlabel("x")
    ax.set_ylabel("dmu/dx")
    Marchenko-Pastur distribution for several ratios rho. The green dot is the point mass at zero.

    From this figure it becomes obvious that the closer the ratio \rho = 1, the higher the probability for very tiny eigenvalues. This results in a high probability for an ill-conditioned matrix A^TA coming from an ill-conditioned matrix A. Let’s confirm that:

    p = 500
    n_vec = []
    c_vec = []
    for n in np.arange(0.1, 2.05, 0.05):
        n = int(p*n)
        A = X[:n*p].reshape(n, p)
        n_vec.append(n)
        c_vec.append(np.linalg.cond(A))
    
    fig, ax = plt.subplots()
    ax.plot(n_vec, c_vec)
    ax.set_xlabel("n")
    ax.set_ylabel("condition number of A")
    ax.set_title("Condition Number of A for p=500")
    Condition number of the random matrix A

    As a result of the ill-conditioned A, the LSQR solver has problems to achieve its tolerance criterion, needs more iterations, and takes longer time. This is exactly what we observed in the benchmark plots: the peak occurred around n=p. The SVD-based lstsq solver, on the other hand, does not use an iterative scheme and does not need more time for ill-conditioned matrices.

    You find the accompanying notebook here: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2022-05-22%20from%20least%20squares%20benchmarks%20to%20the%20marchenko-pastur%20distribution.ipynb

  • Let the flashlight shine with plotly

    There are different R packages devoted to model agnostic interpretability, DALEX and iml being among the best known. In 2019, I added flashlight 

    logo.png

    for a couple of reasons:

    1. Its explainers work with case weights.
    2. Multiple explainers can be combined to a multi-explainer.
    3. Stratified calculation is possible.

    Since almost all plots in flashlight are constructed with ggplot, it is super easy to turn them into interactive plotly objects: just add a simple ggplotly() to the end of the call.

    However… it is not straightforward to show interactive plots in a blog! Thus, we show only screenshots of the resulting plots here and refer to the complete HTML report here: https://mayer79.github.io/flashlight_plotly/flashlight_plotly.html

    We will use a sweet dataset with more than 20’000 houses to model house prices by a set of derived features such as the logarithmic living area. The location will be represented by the postal code.

    Data preparation

    We first load the data and prepare some of the columns for modeling. Furthermore, we specify the set of features and the response.

    library(dplyr)
    library(flashlight)
    library(plotly)
    library(ranger)
    library(lme4)
    library(moderndive)
    library(splitTools)
    library(MetricsWeighted)
    
    set.seed(4933)
    
    data("house_prices")
    
    prep <- house_prices %>% 
      mutate(
        log_price = log(price),
        log_sqft_living = log(sqft_living),
        log_sqft_lot = log(sqft_lot),
        log_sqft_basement = log1p(sqft_basement),
        year = as.numeric(format(date, '%Y')),
        age = year - yr_built
      )
    
    x <- c(
      "year", "age", "log_sqft_living", "log_sqft_lot", 
      "bedrooms", "bathrooms", "log_sqft_basement", 
      "condition", "waterfront", "zipcode"
    )
    
    y <- "log_price"
    
    head(prep[c(y, x)])
    
    ## # A tibble: 6 x 11
    ##   log_price  year   age log_sqft_living log_sqft_lot bedrooms bathrooms
    ##       <dbl> <dbl> <dbl>           <dbl>        <dbl>    <int>     <dbl>
    ## 1      12.3  2014    59            7.07         8.64        3      1   
    ## 2      13.2  2014    63            7.85         8.89        3      2.25
    ## 3      12.1  2015    82            6.65         9.21        2      1   
    ## 4      13.3  2014    49            7.58         8.52        4      3   
    ## 5      13.1  2015    28            7.43         9.00        3      2   
    ## 6      14.0  2014    13            8.60        11.5         4      4.5 
    ## # ... with 4 more variables: log_sqft_basement <dbl>, condition <fct>,
    ## #   waterfront <lgl>, zipcode <fct>

    Train / test split

    Then, we split the dataset into 80% training and 20% test rows, stratified on the (binned) response log_price.

    idx <- partition(prep[[y]], c(train = 0.8, test = 0.2), type = "stratified")
    
    train <- prep[idx$train, ]
    test <- prep[idx$test, ]

    Models

    We fit two models:

    1. A linear mixed model with random postal code effect.
    2. A random forest with 500 trees.
    # Mixed-effects model
    fit_lmer <- lmer(
      update(reformulate(x, "log_price"), . ~ . - zipcode + (1 | zipcode)),
      data = train
    )
    
    # Random forest
    fit_rf <- ranger(
      reformulate(x, "log_price"),
      always.split.variables = "zipcode",
      data = train
    )
    cat("R-squared OOB:", fit_rf$r.squared)
    ## R-squared OOB: 0.8463311

    Model inspection

    Now, we are ready to inspect our two models regarding performance, variable importance, and effects.

    Set up explainers

    First, we pack all model dependent information into flashlights (the explainer objects) and combine them to a multiflashlight. As evaluation dataset, we pass the test data. This ensures that interpretability tools using the response (e.g., performance measures and permutation importance) are not being biased by overfitting.

    fl_lmer <- flashlight(model = fit_lmer, label = "LMER")
    fl_rf <- flashlight(
      model = fit_rf,
      label = "RF",
      predict_function = function(mod, X) predict(mod, X)$predictions
    )
    fls <- multiflashlight(
      list(fl_lmer, fl_rf),
      y = "log_price",
      data = test,
      metrics = list(RMSE = rmse, `R-squared` = r_squared)
    )

    Model performance

    Let’s evaluate model RMSE and R-squared on the hold-out dataset. Here, the mixed-effects model performs a tiny little bit better than the random forest:

    (light_performance(fls) %>%
      plot(fill = "darkred") +
        labs(title = "Model performance", x = element_blank())) %>%
      ggplotly()
    Model performance (png)

    Permutation importance

    Next, we inspect the variable strength based on permutation importance. It shows by how much the RMSE is being increased when shuffling a variable before prediction. The results are quite similar between the two models.

    (light_importance(fls, v = x) %>%
        plot(fill = "darkred") +
        labs(title = "Permutation importance", y = "Drop in RMSE")) %>%
      ggplotly()
    Variable importance (png)

    ICE plot

    To get an impression of the effect of the living area, we select 200 observations and profile their predictions with increasing (log) living area, keeping everything else fixed (Ceteris Paribus). These ICE (individual conditional expectation) plots are vertically centered in order to highlight potential interaction effects. If all curves coincide, there are no interaction effects and we can say that the effect of the feature is modelled in an additive way (no surprise for the additive linear mixed-effects model).

    (light_ice(fls, v = "log_sqft_living", n_max = 200, center = "middle") %>%
        plot(alpha = 0.05, color = "darkred") +
        labs(title = "Centered ICE plot", y = "log_price (shifted)")) %>%
      ggplotly()

    Partial dependence plots

    Averaging many uncentered ICE curves provides the famous partial dependence plot, introduced in Friedman’s seminal paper on gradient boosting machines (2001).

    (light_profile(fls, v = "log_sqft_living", n_bins = 21) %>%
        plot(rotate_x = FALSE) +
        labs(title = "Partial dependence plot", y = y) +
        scale_colour_viridis_d(begin = 0.2, end = 0.8)) %>%
      ggplotly()
    Partial dependence plots (png)

    Multiple effects visualized together

    The last figure extends the partial dependence plot with three additional curves, all evaluated on the hold-out dataset:

    • Average observed values
    • Average predictions
    • ALE plot (“accumulated local effects”, an alternative to partial dependence plots with relaxed Ceteris Paribus assumption)
    (light_effects(fls, v = "log_sqft_living", n_bins = 21) %>%
        plot(use = "all")  +
        labs(title = "Different effect estimates", y = y) +
        scale_colour_viridis_d(begin = 0.2, end = 0.8)) %>%
      ggplotly()
    Multiple effects together (png)

    Conclusion

    Combining flashlight with plotly works well and provides nice, interactive plots. Using rmarkdown, an analysis like this look quite neat if shipped as an HTML like this one here: https://mayer79.github.io/flashlight_plotly/flashlight_plotly.html

    The rmarkdown script can be found here on github.

  • X-Mas Tree with 10 Lines of R Code

    Besides the many negative aspects of going through a pandemic, there are also certain positive ones like having time to write short blog posts like this.

    This one picks up a topic that was intensively discussed a couple of years ago on Wolfram’s page: Namely that the damped sine wave

    f(t) = t sin(t)

    can be used to draw a Christmas tree. Throw in some 3D animation using the R package rgl and the tree begins to become virtual reality…

    Here is our version using just ten lines of R code:

    library(rgl)
    
    t <- seq(0, 100, by = 0.7)^0.6
    x <- t * c(sin(t), sin(t + pi))
    y <- t * c(cos(t), cos(t + pi))
    z <- -2 * c(t, t)
    color <- rep(c("darkgreen", "gold"), each = length(t))
    
    open3d(windowRect = c(100, 100, 600, 600), zoom = 0.9)
    bg3d("black")
    spheres3d(x, y, z, radius = 0.3, color = color)
    
    # On screen (skip if export)
    play3d(spin3d(axis = c(0, 0, 1), rpm = 4))
    
    # Export (requires 3rd party tool "ImageMagick" resp. magick-package)
    # movie3d(spin3d(axis = c(0, 0, 1), rpm = 4), duration = 30, dir = getwd())
    Exported as gif using magick

    Christian and me wish you a relaxing time over Christmas. Take care of the people you love and stay healthy and safe.

    Code and animation can be found on github.

  • Feature Subsampling For Random Forest Regression

    TLDR: The number of subsampled features is a main source of randomness and an important parameter in random forests. Mind the different default values across implementations.

    Randomness in Random Forests

    Random forests are very popular machine learning models. They are build from easily understandable and well visualizable decision trees and give usually good predictive performance without the need for excessive hyperparameter tuning. Some drawbacks are that they do not scale well to very large datasets and that their predictions are discontinuous on continuous features.

    A key ingredient for random forests is—no surprise here—randomness. The two main sources for randomness are:

    • Feature subsampling in every node split when fitting decision trees.
    • Row subsampling (bagging) of the training dataset for each decision tree.

    In this post, we want to investigate the first source, feature subsampling, with a special focus on regression problems on continuous targets (as opposed to classification).

    Feature Subsampling

    In his seminal paper, Leo Breiman introduced random forests and pointed out several advantages of feature subsamling per node split. We cite from his paper:

    The forests studied here consist of using randomly selected inputs or combinations of inputs at each node to grow each tree. The resulting forests give accuracy that compare favorably with Adaboost. This class of procedures has desirable characteristics:

    i Its accuracy is as good as Adaboost and sometimes better.

    ii It’s relatively robust to outliers and noise.

    iii It’s faster than bagging or boosting.

    iv It gives useful internal estimates of error, strength, correlation and variable importance.

    v It’s simple and easily parallelized.

    Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).

    Note the focus on comparing with Adaboost at that time and the, in today’s view, relatively small datasets used for empirical studies in this paper.

    If the input data as p number of features (columns), implementations of random forests usually allow to specify how many features to consider at each split:

    ImplementationLanguageParameterDefault
    scikit-learn RandomForestRegressorPythonmax_featuresp
    scikit-learn RandomForestClassifierPythonmax_featuressqrt(p)
    rangerRmtrysqrt(p)
    randomForest regressionRmtryp/3
    randomForest classificationRmtrysqrt(p)
    H2O regressionPython & Rmtriesp/3
    H2O classificationPython & Rmtriessqrt(p)

    Note that the default of scikit-learn for regression is surprising because it switches of the randomness from feature subsampling rendering it equal to bagged trees!

    While empirical studies on the impact of feature for good default choices focus on classification problems, see the literature review in Probst et al 2019, we consider a set of regression problems with continuous targets. Note that different results might be more related to different feature spaces than to the difference between classification and regression.

    The hyperparameters mtry, sample size and node size are the parameters that control the randomness of the RF. […]. Out of these parameters, mtry is most influential both according to the literature and in our own experiments. The best value of mtry depends on the number of variables that are related to the outcome.

    Probst, P. et al. “Hyperparameters and tuning strategies for random forest.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (2019): n. pag.

    Benchmarks

    We selected the following 13 datasets with regression problems:

    Datasetnumber of samplesnumber of used features p
    Allstate188,318130
    Bike_Sharing_Demand17,37912
    Brazilian_houses10’69212
    ames1’46079
    black_friday166’8219
    colleges7’06349
    delays_zurich_transport27’32717
    diamonds53’9406
    la_crimes1’468’82525
    medical_charges_nominal163’06511
    nyc-taxi-green-dec-2016581’83514
    particulate-matter-ukair-2017394’2999
    taxi581’83518

    Note that among those, there is no high dimensional dataset in the sense that p>number of samples.

    On these, we fitted the scikit-learn (version 0.24) RandomForestRegressor (within a short pipeline handling missing values) with default parameters. We used 5-fold cross validation with 4 different values max_feature=p/3 (blue), sqrt(p) (orange), 0.9 p (green), and p (red). Now, we show the mean squared error with uncertainty bars (± one standard deviation of cross validation splits), the lower the better.

    In addition, we also report the fit time of each (5-fold) fit in seconds, again the lower the better.

    Note that sqrt(p) is often smaller than p/3. With this in mind, this graphs show that fit time is about proportional to the number of features subsampled.

    Conclusion

    • The main tuning parameter of random forests is the number of features used for feature subsampling (max_features, mtry). Depending on the dataset, it has a relevant impact on the predictive performance.
    • The default of scikit-learn’s RandomForestRegressor seems odd. It produces bagged trees. This is a bit like using ridge regression with a zero penalty😉. However, it can be justified by our benchmarks above.

    The full code can be found here:

    https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-08-19%20random_forests_max_features.ipynb

  • SHAP Analysis in 9 Lines

    Hello ML world

    Recently, together with Yang Liu, we have been investing some time to extend the R package SHAPforxgboost. This package is designed to make beautiful SHAP plots for XGBoost models, using the native treeshap implementation shipped with XGBoost.

    Some of the new features of SHAPforxgboost

    • Added support for LightGBM models, using the native treeshap implementation for LightGBM. So don’t get tricked by the package name “SHAPforxgboost” :-).
    • The function shap.plot.dependence() has received the option to select the heuristically strongest interacting feature on the color scale, see last section for details.
    • shap.plot.dependence() now allows jitter and alpha transparency.
    • The new function shap.importance() returns SHAP importances without plotting them.
    • Added vignette with basic workflow to CRAN.
    • Added logo:
    logo.png

    An interesting alternative to calculate and plot SHAP values for different tree-based models is the treeshap package by Szymon Maksymiuk et al. Keep an eye on this one – it is actively being developed!

    What is SHAP?

    A couple of years ago, the concept of Shapely values from game theory from the 1950ies was discovered e.g. by Scott Lundberg as an interesting approach to explain predictions of ML models.

    The basic idea is to decompose a prediction in a fair way into additive contributions of features. Repeating the process for many predictions provides a brilliant way to investigate the model as a whole.

    The main resource on the topic is Scott Lundberg’s site. Besides this, I’d recomment to go through these two fantastic blog posts, even if you already know what SHAP values are:

    Illustration

    As an example, we will try to model log house prices of 20’000 sold houses in Kings County. The dataset is available e.g. on OpenML.org under ID 42092.

    Some rows and columns from the Kings County house dataset.

    Fetch and prepare data

    We start by downloading the data and preparing it for modelling.

    library(farff)
    library(OpenML)
    library(dplyr)
    library(xgboost)
    library(ggplot2)
    library(SHAPforxgboost)
    
    # Load King Country house prices dataset on OpenML
    # ID 42092, https://www.openml.org/d/42092
    df <- getOMLDataSet(data.id = 42092)$data
    head(df)
    
    # Prepare
    df <- df %>%
      mutate(
        log_price = log(price),
        log_sqft_lot = log(sqft_lot),
        year = as.numeric(substr(date, 1, 4)),
        building_age = year - yr_built,
        zipcode = as.integer(as.character(zipcode))
      )
    
    # Define response and features
    y <- "log_price"
    x <- c("grade", "year", "building_age", "sqft_living",
           "log_sqft_lot", "bedrooms", "bathrooms", "floors", "zipcode",
           "lat", "long", "condition", "waterfront")
    
    # random split
    set.seed(83454)
    ix <- sample(nrow(df), 0.8 * nrow(df))

    Fit XGBoost model

    Next, we fit a manually tuned XGBoost model to the data.

    dtrain <- xgb.DMatrix(data.matrix(df[ix, x]),
                          label = df[ix, y])
    dvalid <- xgb.DMatrix(data.matrix(df[-ix, x]),
                          label = df[-ix, y])
    
    params <- list(
      objective = "reg:squarederror",
      learning_rate = 0.05,
      subsample = 0.9,
      colsample_bynode = 1,
      reg_lambda = 2,
      max_depth = 5
    )
    
    fit_xgb <- xgb.train(
      params,
      data = dtrain,
      watchlist = list(valid = dvalid),
      early_stopping_rounds = 20,
      print_every_n = 100,
      nrounds = 10000 # early stopping
    )

    The resulting model consists of about 600 trees and reaches a validation RMSE of 0.16. This means that about 2/3 of the predictions are within 16% of the observed price, using the empirical rule.

    Compact SHAP analysis

    ML models are rarely of any use without interpreting its results, so let’s use SHAP to peak into the model.

    The analysis includes a first plot with SHAP importances. Then, with decreasing importance, dependence plots are shown to get an impression on the effects of each feature.

    # Step 1: Select some observations
    X <- data.matrix(df[sample(nrow(df), 1000), x])
    
    # Step 2: Crunch SHAP values
    shap <- shap.prep(fit_xgb, X_train = X)
    
    # Step 3: SHAP importance
    shap.plot.summary(shap)
    
    # Step 4: Loop over dependence plots in decreasing importance
    for (v in shap.importance(shap, names_only = TRUE)) {
      p <- shap.plot.dependence(shap, v, color_feature = "auto", 
                                alpha = 0.5, jitter_width = 0.1) +
        ggtitle(v)
      print(p)
    }

    Some of the plots are shown below. The code actually produces all plots, see the corresponding html output on github.

    Figure 1: SHAP importance for XGBoost model. The results make intuitive sense. Location and size are among the strongest predictors.
    Figure 2: SHAP dependence for the second strongest predictor. The larger the living area, the higher the log price. There is not much vertical scatter, indicating that living area acts quite additively on the predictions on the log scale.
    Figure 3: SHAP dependence for a less important predictor. The effect of “condition” 4 vs 3 seems to depend on the zipcode (see the color). For some zipcodes, the condition does not have a big effect on the price, while for other zipcodes, the effect is clearly higher.

    Same workflow for LightGBM

    Let’s try out the SHAPforxgboost package with LightGBM.

    Note: LightGBM Version 3.2.1 on CRAN is not working properly under Windows. This will be fixed in the next release of LightGBM. As a temporary solution, you need to build it from the current master branch.

    library(lightgbm)
    
    dtrain <- lgb.Dataset(data.matrix(df[ix, x]),
                          label = df[ix, y])
    dvalid <- lgb.Dataset(data.matrix(df[-ix, x]),
                          label = df[-ix, y])
    
    params <- list(
      objective = "regression",
      learning_rate = 0.05,
      subsample = 0.9,
      reg_lambda = 2,
      num_leaves = 15
    )
    
    fit_lgb <- lgb.train(
      params,
      data = dtrain,
      valids = list(valid = dvalid),
      early_stopping_rounds = 20,
      eval_freq = 100,
      eval = "rmse",
      nrounds = 10000
    )

    Early stopping on the validation data selects about 900 trees as being optimal and results in a validation RMSE of also 0.16.

    SHAP analysis

    We use exactly the same short snippet to analyze the model by SHAP.

    X <- data.matrix(df[sample(nrow(df), 1000), x])
    shap <- shap.prep(fit_lgb, X_train = X)
    shap.plot.summary(shap)
    
    for (v in shap.importance(shap, names_only = TRUE)) {
      p <- shap.plot.dependence(shap, v, color_feature = "auto", 
                                alpha = 0.5, jitter_width = 0.1) +
        ggtitle(v)
      print(p)
    }

    Again, we only show some of the output and refer to the html of the corresponding rmarkdown. Overall, the model seems to be very similar to the one obtained by XGBoost.

    Figure 4: SHAP importance for LightGBM. By chance, the order of importance is the same as for XGBoost.
    Figure 5: The dependence plot for the living area also looks identical in shape than for the XGBoost model.

    How does the dependence plot selects the color variable?

    By default, Scott’s shap package for Python uses a statistical heuristic to colorize the points in the dependence plot by the variable with possibly strongest interaction. The heuristic used by SHAPforxgboost is slightly different and directly uses conditional variances. More specifically, the variable X on the x-axis as well as each other feature Z_k is binned into categories. Then, for each Z_k, the conditional variance across binned X and Z_k is calculated. The Z_k with the highest conditional variance is selected as the color variable.

    Note that the heuristic does not depend on “shap interaction values” in order to save time (and because these would not be available for LightGBM).

    The following simple example shows how/that it is working. First, a dataset is created and a model with three features and strong interaction between x1 and x2 is being fitted. Then, we look at the dependence plots to see if it is consistent with the model/data situation.

    n <- 1000
    
    set.seed(334)
    
    df <- data.frame(
      x1 = runif(n),
      x2 = runif(n),
      x3 = runif(n)
      ) %>% 
      mutate(
        y = x1 * x2 + x3 + runif(n)
      )
    x <- c("x1", "x2", "x3")
    dtrain <- lgb.Dataset(data.matrix(df[, x]),
                          label = df[, "y"])
    
    params <- list(
      objective = "regression",
      learning_rate = 0.05,
      subsample = 0.9,
      reg_lambda = 2,
      num_leaves = 15
    )
    
    fit_lgb <- lgb.train(
      params,
      data = dtrain,
      eval = "rmse",
      nrounds = 100
    )
    
    shap <- shap.prep(fit_lgb, X_train = data.matrix(df[, x]))
    shap.plot.summary(shap)
    
    shap.plot.dependence(shap, "x1", color_feature = "auto")
    shap.plot.dependence(shap, "x2", color_feature = "auto")
    shap.plot.dependence(shap, "x3", color_feature = "auto")

    Here the dependence plots for x1 and x3.

    Figure 6: The dependence plots for x1 shows a clear interaction effect with the color variable x2. This is as simulated in the data.
    Figure 7: The dependence plots for x3 does not show clear interaction effects, consistent with the data situation.

    The full R script and rmarkdown file of this post can be found on github.

  • Strong Random Forests with XGBoost

    Lost in Translation between R and Python 6

    Hello random forest friends

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    The last one was on diamond duplicates and grouped sampling.

    XGBoost’s random forests

    For sure, XGBoost is well known for its excellent gradient boosting trees implementation. Although less obvious, it is no secret that it also offers a way to fit single trees in parallel, emulating random forests, see the great explanations on the official XGBoost page. Still, there seems to exist quite some confusion on how to choose certain parameters in order to get good results. It is the aim of this post to clarify this.

    Also LightGBM offers a random forest mode. We will investigate it in a later post.

    Why would you want to use XGBoost to fit a random forest?

    1. Interaction & monotonic constraints are available for XGBoost, but typically not for random forest implementations. A separate post will follow to illustrate this in the random forest setting.
    2. XGBoost can natively deal with missing values in an elegant way, unlike many random forest algorithms.
    3. You can stick to the same data preparation pipeline.

    I had additional reasons in mind, e.g. using non-standard loss functions, but this did not turn out to work well. This is possibly due to the fact that XGBoost uses a quadratic approximation to the loss, which is exact only for the mean squared error loss (MSE).

    How to enable the ominous random forest mode?

    Following the official explanations, we would need to set

    • num_parallel_tree to the number of trees in the forest,
    • learning_rate and num_boost_round to 1.

    There are further valuable tips, e.g. to set row and column subsampling to values below one to resemble true random forests.

    Still, most of the regularization parameters of XGBoost tend to favour simple trees, while the idea of a random forest is to aggregate deep, overfitted trees. These regularization parameters have to be changed as well in order to get good results.

    So voila my suggestions.

    Suggestions for parameters

    • learning_rate=1 (see above)
    • num_boost_round=1 (see above)
      Has to be set in train(), not in the parameter list. It is called nrounds in R.
    • subsample=0.63
      A random forest draws a bootstrap sample to fit each tree. This means about 0.63 of the rows will enter one or multiple times into the model, leaving 37% out. While XGBoost does not offer such sampling with replacement, we can still introduce the necessary randomness in the dataset used to fit a tree by skipping 37% of the rows per tree.
    • colsample_bynode=floor(sqrt(m))/m
      Column subsampling per split is the main source of randomness in a random forest. A good default is usually to sample the square root of the number of features m or m/3. XGBoost offers different colsample_by* parameters, but it is important to sample per split resp. per node, not by tree. Otherwise, it might happen that important features are missing in a tree altogether, leading to overall bad predictions.
    • num_parallel_tree
      The number of trees. Native implementations of random forests usually use a default value between 100 and 500. The more, the better—but slower.
    • reg_lambda=0
      XGBoost uses a default L2 penalty of 1! This will typically lead to shallow trees, colliding with the idea of a random forest to have deep, wiggly trees. In my experience, leaving this parameter at its default will lead to extremely bad XGBoost random forest fits.
      Set it to zero or a value close to zero.
    • max_depth=20
      Random forests usually train very deep trees, while XGBoost’s default is 6. A value of 20 corresponds to the default in the h2o random forest, so let’s go for their choice.
    • min_child_weight=2
      The default of XGBoost is 1, which tends to be slightly too greedy in random forest mode. For binary classification, you would need to set it to a value close or equal to 0.

    Of course these parameters can be tuned by cross-validation, but one of the reasons to love random forests is their good performance even with default parameters.

    Compared to optimized random forests, XGBoost’s random forest mode is quite slow. At the cost of performance, choose

    • lower max_depth,
    • higher min_child_weight, and/or
    • smaller num_parallel_tree.

    Let’s try it out with regression

    We will use a nice house price dataset, consisting of information on over 20,000 sold houses in Kings County. Along with the sale price, different features describe the size and location of the properties. The dataset is available on OpenML.org with ID 42092.

    Some rows and columns from the Kings County house dataset.

    The following R resp. Python codes fetch the data, prepare the ML setting and fit a native random forest with good defaults. In R, we use the ranger package, in Python the implementation of scikit-learn.

    The response variable is the logarithmic sales price. A healthy set of 13 variables are used as features.

    library(farff)
    library(OpenML)
    library(dplyr)
    library(ranger)
    library(xgboost)
    
    set.seed(83454)
    
    rmse <- function(y, pred) {
      sqrt(mean((y-pred)^2))
    }
    
    # Load King Country house prices dataset on OpenML
    # ID 42092, https://www.openml.org/d/42092
    df <- getOMLDataSet(data.id = 42092)$data
    head(df)
    
    # Prepare
    df <- df %>%
      mutate(
        log_price = log(price),
        year = as.numeric(substr(date, 1, 4)),
        building_age = year - yr_built,
        zipcode = as.integer(as.character(zipcode))
    )
    
    # Define response and features
    y <- "log_price"
    x <- c("grade", "year", "building_age", "sqft_living",
           "sqft_lot", "bedrooms", "bathrooms", "floors", "zipcode",
           "lat", "long", "condition", "waterfront")
    m <- length(x)
    
    # random split
    ix <- sample(nrow(df), 0.8 * nrow(df))
    
    # Fit untuned random forest
    system.time( # 3 s
      fit_rf <- ranger(reformulate(x, y), data = df[ix, ])
    )
    y_test <- df[-ix, y]
    
    # Test RMSE: 0.173
    rmse(y_test, predict(fit_rf, df[-ix, ])$pred)
    # object.size(fit_rf) # 180 MB
    # Imports
    import numpy as np
    import pandas as pd
    
    from sklearn.datasets import fetch_openml
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_squared_error
    
    
    def rmse(y_true, y_pred):
        return np.sqrt(mean_squared_error(y_true, y_pred))
      
    # Fetch data from OpenML
    df = fetch_openml(data_id=42092, as_frame=True)["frame"]
    print("Shape: ", df.shape)
    df.head()
    
    # Prepare data
    df = df.assign(
        year = lambda x: x.date.str[0:4].astype(int),
        zipcode = lambda x: x.zipcode.astype(int)
    ).assign(
        building_age = lambda x: x.year - x.yr_built,
    )
    
    # Feature list
    xvars = [
        "grade", "year", "building_age", "sqft_living", 
        "sqft_lot", "bedrooms", "bathrooms", "floors", 
        "zipcode", "lat", "long", "condition", "waterfront"
    ]
    
    # Data split
    y_train, y_test, X_train, X_test = train_test_split(
        np.log(df["price"]), df[xvars], 
        train_size=0.8, random_state=766
    )
    
    # Fit scikit-learn rf
    rf = RandomForestRegressor(
        n_estimators=500, 
        max_features="sqrt", 
        max_depth=20,
        n_jobs=-1, 
        random_state=104
    )
    
    rf.fit(X_train, y_train)  # Wall time 3 s
    
    # Test RMSE: 0.176
    print(f"RMSE: {rmse(y_test, rf.predict(X_test)):.03f}")

    Both in R and Python, the test RMSE is between 0.17 and 0.18, i.e. about 2/3 of the test predictions are within 18% of the observed value. Not bad!
    Note: The test performance depends on the split seed, so it does not make sense to directly compare the R and Python performance.

    With XGBoost’s random forest mode

    Now let’s try to reach the same performance with XGBoost’s random forest implementation using the above parameter suggestions.

    # Fit untuned, but good(!) XGBoost random forest
    dtrain <- xgb.DMatrix(data.matrix(df[ix, x]),
                          label = df[ix, y])
    
    params <- list(
      objective = "reg:squarederror",
      learning_rate = 1,
      num_parallel_tree = 500,
      subsample = 0.63,
      colsample_bynode = floor(sqrt(m)) / m,
      reg_lambda = 0,
      max_depth = 20,
      min_child_weight = 2
    )
    
    system.time( # 20 s
      fit_xgb <- xgb.train(
        params,
        data = dtrain,
        nrounds = 1,
        verbose = 0
      )
    )
    
    pred <- predict(fit_xgb, data.matrix(df[-ix, x]))
    
    # Test RMSE: 0.174
    rmse(y_test, pred)
    # xgb.save(fit_xgb, "xgb.model") # 140 MB
    import xgboost as xgb
    
    dtrain = xgb.DMatrix(X_train, label=y_train)
    
    m = len(xvars)
    
    params = dict(
        objective="reg:squarederror",
        learning_rate=1,
        num_parallel_tree=500,
        subsample=0.63,
        colsample_bynode=int(np.sqrt(m))/m,
        reg_lambda=0,
        max_depth=20,
        min_child_weight=2
    )
    
    rf_xgb = xgb.train(  # Wall time 34 s
        params, 
        dtrain, 
        num_boost_round=1
    )
    preds = rf_xgb.predict(xgb.DMatrix(X_test))
    
    # 0.177
    print(f"RMSE: {rmse(y_test, preds):.03f}")

    We see:

    • The performance of the XGBoost random forest is essentially as good as the native random forest implementations. And all this without any parameter tuning!
    • XGBoost is much slower than the optimized random forest implementations. If this is a problem, e.g. reduce the tree depth. In this example, Python takes almost twice as much time as R. No idea why!
      The timings were made on a usual 4 core i7 processor.
    • Disk space required to store the model objects is comparable between XGBoost and native random forest implementations.

    What if you would run the same model with XGBoost defaults?

    • With default reg_lambda=1:
      The performance would end up at a catastrophic RMSE of 0.35!
    • With default max_depth=6:
      The RMSE would be much worse (0.23) as well.
    • With colsample_bytree instead of colsample_bynode:
      The RMSE would deteriorate to 0.27.

    Thus: It is essential to set some values to a good “random forest” default!

    Does it always work that good?

    Definitively not in classification settings. However, in regression settings with the MSE loss, XGBoost’s random forest mode is often as accurate as native implementations.

    • Classification models
      In my experience, the XGBoost random forest mode does not work as good as a native random forest for classification, possibly due to the fact that it uses only an approximation to the loss function.
    • Other regression examples
      Using the setting of our last “R <–> Python” post (diamond duplicates and grouped sampling) and the same parameters as above, we get the following test RMSEs: With ranger (R code in link below): 0.1043, with XGBoost: 0.1042. Sweet!

    Wrap up

    • With the right default parameters, XGBoost’s random forest mode reaches similar performance on regression problems than native random forest packages. Without any tuning!
    • For losses other than MSE, it does not work so well.

    The Python notebook and R code can be found at: