Category: Programming

  • Interpret Complex Linear Models with SHAP within Seconds

    A linear model with complex interaction effects can be almost as opaque as a typical black-box like XGBoost.

    XGBoost models are often interpreted with SHAP (Shapley Additive eXplanations): Each of e.g. 1000 randomly selected predictions is fairly decomposed into contributions of the features using the extremely fast TreeSHAP algorithm, providing a rich interpretation of the model as a whole. TreeSHAP was introduced in the Nature publication by Lundberg and Lee (2020).

    Can we do the same for non-tree-based models like a complex GLM or a neural network? Yes, but we have to resort to slower model-agnostic SHAP algorithms:

    In the limit, the two algorithms provide the same SHAP values.

    House prices

    We will use a great dataset with 14’000 house prices sold in Miami in 2016. The dataset was kindly provided by Prof. Steven Bourassa for research purposes and can be found on OpenML.

    The model

    We will model house prices by a Gamma regression with log-link. The model includes factors, linear components and natural cubic splines. The relationship of living area and distance to central district is modeled by letting the spline bases of the two features interact.

    library(OpenML)
    library(tidyverse)
    library(splines)
    library(doFuture)
    library(kernelshap)
    library(shapviz)
    
    raw <- OpenML::getOMLDataSet(43093)$data
    
    # Lump rare level 3 and log transform the land size
    prep <- raw %>%
      mutate(
        structure_quality = factor(structure_quality, labels = c(1, 2, 4, 4, 5)),
        log_landsize = log(LND_SQFOOT)
      )
    
    # 1) Build model
    xvars <- c("TOT_LVG_AREA", "log_landsize", "structure_quality",
               "CNTR_DIST", "age", "month_sold")
    
    fit <- glm(
      SALE_PRC ~ ns(log(CNTR_DIST), df = 4) * ns(log(TOT_LVG_AREA), df = 4) +
        log_landsize + structure_quality + ns(age, df = 4) + ns(month_sold, df = 4),
      family = Gamma("log"),
      data = prep
    )
    summary(fit)
    
    # Selected coefficients:
    # log_landsize: 0.22559  
    # structure_quality4: 0.63517305 
    # structure_quality5: 0.85360956   
    

    The model has 37 parameters. Some of the estimates are shown.

    Interpretation

    The workflow of a SHAP analysis is as follows:

    1. Sample 1000 rows to explain
    2. Sample 100 rows as background data used to estimate marginal expectations
    3. Calculate SHAP values. This can be done fully in parallel by looping over the rows selected in Step 1
    4. Analyze the SHAP values

    Step 2 is the only additional step compared with TreeSHAP. It is required both for SHAP sampling values and Kernel SHAP.

    # 1) Select rows to explain
    set.seed(1)
    X <- prep[sample(nrow(prep), 1000), xvars]
    
    # 2) Select small representative background data
    bg_X <- prep[sample(nrow(prep), 100), ]
    
    # 3) Calculate SHAP values in fully parallel mode
    registerDoFuture()
    plan(multisession, workers = 6)  # Windows
    # plan(multicore, workers = 6)   # Linux, macOS, Solaris
    
    system.time( # <10 seconds
      shap_values <- kernelshap(
        fit, X, bg_X = bg_X, parallel = T, parallel_args = list(.packages = "splines")
      )
    )

    Thanks to parallel processing and some implementation tricks, we were able to decompose 1000 predictions within 10 seconds! By default, kernelshap() uses exact calculations up to eight features (exact regarding the background data), which would need an infinite amount of Monte-Carlo-sampling steps.

    Note that glm() has a very efficient predict() function. GAMs, neural networks, random forests etc. usually take more time, e.g. 5 minutes to do the crunching.

    Analyze the SHAP values

    # 4) Analyze them
    sv <- shapviz(shap_values)
    
    sv_importance(sv, show_numbers = TRUE) +
      ggtitle("SHAP Feature Importance")
    
    sv_dependence(sv, "log_landsize")
    sv_dependence(sv, "structure_quality")
    sv_dependence(sv, "age")
    sv_dependence(sv, "month_sold")
    sv_dependence(sv, "TOT_LVG_AREA", color_var = "auto")
    sv_dependence(sv, "CNTR_DIST", color_var = "auto")
    
    # Slope of log_landsize: 0.2255946
    diff(sv$S[1:2, "log_landsize"]) / diff(sv$X[1:2, "log_landsize"])
    
    # Difference between structure quality 4 and 5: 0.2184365
    diff(sv$S[2:3, "structure_quality"])
    SHAP Importance: Living area and the distance to the central district are the two most important predictors. The month (within 2016) impacts the predicted prices by +-1.3% on average.
    SHAP dependence plot of “log_landsize”. The effect is linear. The slope 0.22559 agrees with the model coefficient.
    Dependence plot for “structure_quality”: The difference between structure quality 4 and 5 is 0.2184365. This equals the difference in regression coefficients.
    Dependence plot of “living_area”: The effect is very steep. The more central, the steeper. We cannot easily compare these numbers with the output of the linear regression.

    Summary

    • Interpreting complex linear models with SHAP is an option. There seems to be a correspondence between regression coefficients and SHAP dependence, at least for additive components.
    • Kernel SHAP in R is fast. For models with slower predict() functions (e.g. GAMs, random forests, or neural nets), we often need to wait a couple of minutes.

    The complete R script can be found here.

  • Histograms, Gradient Boosted Trees, Group-By Queries and One-Hot Encoding

    This post shows how filling histograms can be done in very different ways thereby connecting very different areas: from gradient boosted trees to SQL queries to one-hot encoding. Let’s jump into it!

    Modern gradient boosted trees (GBT) like LightGBM, XGBoost and the HistGradientBoostingRegressor of scikit-learn all use two techniques on top of standard gradient boosting:

    • 2nd order Taylor expansion of the loss which amounts to using gradients and hessians.
    • One histogram per feature: bin the feature and fill the histogram with the gradients and hessians.

    The filling of the histograms is often the bottleneck when fitting GBTs. While filling a single histogram is very fast, this operation is executed many times: for each boosting round, for each tree split and for each feature. This is the reason why GBT implementations have dedicated routines for it. We look into this operation from different angles.

    For the coming (I)Python code snippets to work (# %% indicates a new notebook cell), we need the following imports.

    import duckdb                    # v0.5.1
    import matplotlib.pyplot as plt  # v.3.6.1
    from matplotlib.ticker import MultipleLocator
    import numpy as np               # v1.23.4
    import pandas as pd              # v1.5.0
    import pyarrow as pa             # v9.0.0
    import tabmat                    # v3.1.2
    
    from sklearn.ensemble._hist_gradient_boosting.histogram import (
        _build_histogram_root,
    )                                # v1.1.2
    from sklearn.ensemble._hist_gradient_boosting.common import (
      HISTOGRAM_DTYPE
    )

    Naive Histogram Visualisation

    As a starter, we create a small table with two columns: bin index and value of the hessian.

    def highlight(df):
        if df["bin"] == 0:
            return ["background-color: rgb(255, 128, 128)"] * len(df)
        elif df["bin"] == 1:
            return ["background-color: rgb(128, 255, 128)"] * len(df)
        else:
            return ['background-color: rgb(128, 128, 255)'] * len(df)
    
    df = pd.DataFrame({"bin": [0, 2, 1, 0, 1], "hessian": [1.5, 1, 2, 2.5, 3]})
    df.style.apply(highlight, axis=1)
      bin hessian
    0 0 1.500000
    1 2 1.000000
    2 1 2.000000
    3 0 2.500000
    4 1 3.000000

    A histogram then sums up all the hessian values belonging to the same bin. The result looks like the following.

    Above table visualised as histogram

    Dedicated Method

    We simulate filling the histogram of a single feature. Therefore, we draw 1,000,000 random variables for gradients and hessians as well as the bin indices.

    import duckdb
    import pyarrow as pa
    import numpy as np
    import tabmat
    
    from sklearn.ensemble._hist_gradient_boosting.histogram import (
        _build_histogram_root,
    )
    from sklearn.ensemble._hist_gradient_boosting.common import HISTOGRAM_DTYPE
    
    
    rng = np.random.default_rng(42)
    n_obs = 1000_000
    n_bins = 256
    binned_feature = rng.integers(0, n_bins, size=n_obs, dtype=np.uint8)
    gradients = rng.normal(size=n_obs).astype(np.float32)
    hessians = rng.lognormal(size=n_obs).astype(np.float32)

    Now we use the dedicated (and private!) and single-threaded method _build_histogram_root from sckit-learn to fill a histogram.

    hist_root = np.zeros((1, n_bins), dtype=HISTOGRAM_DTYPE)
    %time _build_histogram_root(0, binned_feature, gradients, hessians, hist_root)
    # Wall time: 1.38 ms

    This executes in around 1.4 ms. This is quite fast. But again, imagine 100 boosting rounds with 10 tree splits on average and 100 features. This means this is done around 100,000 times and would therefore take roughly 2 minutes.

    Let’s have a look at the first 5 bins:

    hist_root[:, 0:5]
    array([[(-79.72386998, 6508.89500265, 3894),
            ( 37.98393589, 6460.63222205, 3998),
            ( 53.54256977, 6492.22722797, 3805),
            ( 21.19542398, 6797.34159299, 3928),
            ( 16.24716742, 6327.03757573, 3875)]],
          dtype=[('sum_gradients', '<f8'), ('sum_hessians', '<f8'), ('count', '<u4')])

    SQL Group-By Query

    Someone familiar with SQL and database queries might immediately see how this task can be formulated as SQL group-by-aggregate query. To demonstrate it on our simulated data, we use DuckDB as well as Apache Arrow (the file format as well as the Python library pyarrow). You can read more about DuckDB in our post DuckDB: Quacking SQL.

    # %%
    con = duckdb.connect()
    arrow_table = pa.Table.from_pydict(
        {
            "bin": binned_feature,
            "gradients": gradients,
            "hessians": hessians,
    })
    # Read data once to make timing fairer
    arrow_result = con.execute("SELECT * FROM arrow_table")
    
    # %%
    %%time
    arrow_result = con.execute("""
    SELECT
        bin as bin,
        SUM(gradients) as sum_gradients,
        SUM(hessians) as sum_hessians,
        COUNT() as count
    FROM arrow_table
    GROUP BY bin
    """).arrow()
    # Wall time: 6.52 ms

    On my laptop, this takes about 6.5 ms and, upon sorting, gives the same results:

    arrow_result.sort_by("bin").slice(length=5)
    pyarrow.Table
    bin: uint8
    sum_gradients: double
    sum_hessians: double
    count: int64
    ----
    bin: [[0,1,2,3,4]]
    sum_gradients: [[-79.72386997545254,37.98393589106854,53.54256977112527,21.195423980039777,16.247167424764484]]
    sum_hessians: [[6508.895002648234,6460.632222048938,6492.227227974683,6797.341592986137,6327.037575732917]]
    count: [[3894,3998,3805,3928,3875]]

    As we have the table as an Arrow table, we can stay within pyarrow:

    %%time
    arrow_result = arrow_table.group_by("bin").aggregate(
        [
            ("gradients", "sum"),
            ("hessians", "sum"),
            ("bin", "count"),
        ]
    )
    # Wall time: 10.8 ms

    The fact that DuckDB is faster than Arrow on this task might have to do with the large invested effort on parallelised group-by operations, see their post Parallel Grouped Aggregation in DuckDB for more infos.

    One-Hot encoded Matrix Multiplication

    I think it is very interesting that filling histograms can be written as a matrix multiplication! The trick is to view the feature as a categorical feature and use its one-hot encoded matrix representation. This blows up memory, of course. Note that one-hot encoding is usually met with generalized linear models (GLM) in order to incorporate nominal categorical feature variables with no internal ordering in the design matrix.

    For our demonstration, we use a numpy index trick to construct the one-hot encoded matrix employing the fact that the binned feature already contains the right indices.

    # %%
    %%time
    m_OHE = np.eye(n_bins)[binned_feature].T
    vec = np.column_stack((gradients, hessians, np.ones_like(gradients)))
    # Wall time: 770 ms
    
    # %%
    %time result_ohe = m_OHE @ vec
    # Wall time: 199 ms
    
    # %%
    result_ohe[:5]
    array([[ -79.72386998, 6508.89500265, 3894.        ],
           [  37.98393589, 6460.63222205, 3998.        ],
           [  53.54256977, 6492.22722797, 3805.        ],
           [  21.19542398, 6797.34159299, 3928.        ],
           [  16.24716742, 6327.03757573, 3875.        ]])

    This is way slower, but, somehow surprisingly, produces the same result.

    The one-hot encoded matrix is very sparse, with only one non-zero value per column, i.e. only one out of 256 (number of bins) values is non-zero. This structure can be exploited to reduce both CPU time as well as memory consumption, with the help of the package tabmat that was built to accelerate GLMs. Unfortunately, tabmat only provides a matrix-vector multiplication (and the sandwich product, of course), but no matrix-matrix multiplication. So we have to do a little extra work.

    # %%
    %time m_categorical = tabmat.CategoricalMatrix(cat_vec=binned_feature)
    # Wall time: 21.5 ms
    
    # %%
    # tabmat needs contigous arrays with dtype = Python float = float64
    vec = np.asfortranarray(vec, dtype=float)
    
    # %%
    %%time
    tabmat_result = np.column_stack(
        (
            vec[:, 0] @ m_categorical,
            vec[:, 1] @ m_categorical,
            vec[:, 2] @ m_categorical,
        )
    )
    # Wall time: 4.82 ms
    
    # %%
    tabmat_result[0:5]
    array([[ -79.72386998, 6508.89500265, 3894.        ],
           [  37.98393589, 6460.63222205, 3998.        ],
           [  53.54256977, 6492.22722797, 3805.        ],
           [  21.19542398, 6797.34159299, 3928.        ],
           [  16.24716742, 6327.03757573, 3875.        ]])

    While the timing of this approach is quite good, the construction of a CategoricalMatrix requires more time than the matrix-vector multiplication.

    Conclusion

    In the end, the special (Cython) routine of scikit-learn ist the fastest of our tested methods for filling histograms. The other GBT libraries have their own even more specialised routines which might be a reason for even faster fit times. What we learned in this post is that this seemingly simple task plays a very crucial part in modern GBTs and can be accomplished by very different approaches. These different approaches uncover connections of algorithms of quite different domains.

    The full code as ipython notebook can be found at https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2022-10-31%20histogram-GBT-GroupBy-OHE.ipynb.

  • Kernel SHAP in R and Python

    Lost in Translation between R and Python 9

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    Kernel SHAP

    SHAP is one of the most used model interpretation technique in Machine Learning. It decomposes predictions into additive contributions of the features in a fair way. For tree-based methods, the fast TreeSHAP algorithm exists. For general models, one has to resort to computationally expensive Monte-Carlo sampling or the faster Kernel SHAP algorithm. Kernel SHAP uses a regression trick to get the SHAP values of an observation with a comparably small number of calls to the predict function of the model. Still, it is much slower than TreeSHAP.

    Two good references for Kernel SHAP:

    1. Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30, 2017.
    2. Ian Covert and Su-In Lee. Improving KernelSHAP: Practical Shapley Value Estimation Using Linear Regression. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:3457-3465, 2021.

    In our last post, we introduced our new “kernelshap” package in R. Since then, the package has been substantially improved, also by the big help of David Watson:

    1. The package now supports multi-dimensional predictions.
    2. It received a massive speed-up
    3. Additionally, parallel computing can be activated for even faster calculations.
    4. The interface has become more intuitive.
    5. If the number of features is small (up to ten or eleven), it can provide exact Kernel SHAP values just like the reference Python implementation.
    6. For a larger number of features, it now uses partly-exact (“hybrid”) calculations, very similar to the logic in the Python implementation.

    With those changes, the R implementation is about to meet the Python version at eye level.

    Example with four features

    In the following, we use the diamonds data to fit a linear regression with

    • log(price) as response
    • log(carat) as numeric feature
    • clarity, color and cut as categorical features (internally dummy encoded)
    • interactions between log(carat) and the other three “C” variables. Note that the interactions are very weak

    Then, we calculate SHAP decompositions for about 1000 diamonds (every 53th diamond), using 120 diamonds as background dataset. In this case, both R and Python will use exact calculations based on m=2^4 – 2 = 14 possible binary on-off vectors (a value of 1 representing a feature value picked from the original observation, a value of 0 a value picked from the background data).

    library(ggplot2)
    library(kernelshap)
    
    # Turn ordinal factors into unordered
    ord <- c("clarity", "color", "cut")
    diamonds[, ord] <- lapply(diamonds[ord], factor, ordered = FALSE)
    
    # Fit model
    fit <- lm(log(price) ~ log(carat) * (clarity + color + cut), data = diamonds)
    
    # Subset of 120 diamonds used as background data
    bg_X <- diamonds[seq(1, nrow(diamonds), 450), ]
    
    # Subset of 1018 diamonds to explain
    X_small <- diamonds[seq(1, nrow(diamonds), 53), c("carat", ord)]
    
    # Exact KernelSHAP (5 seconds)
    system.time(
      ks <- kernelshap(fit, X_small, bg_X = bg_X)  
    )
    ks
    
    # SHAP values of first 2 observations:
    #          carat     clarity     color        cut
    # [1,] -2.050074 -0.28048747 0.1281222 0.01587382
    # [2,] -2.085838  0.04050415 0.1283010 0.03731644
    
    # Using parallel backend
    library("doFuture")
    
    registerDoFuture()
    plan(multisession, workers = 2)  # Windows
    # plan(multicore, workers = 2)   # Linux, macOS, Solaris
    
    # 3 seconds on second call
    system.time(
      ks3 <- kernelshap(fit, X_small, bg_X = bg_X, parallel = TRUE)  
    )
    
    # Visualization
    library(shapviz)
    
    sv <- shapviz(ks)
    sv_importance(sv, "bee")
    import numpy as np
    import pandas as pd
    from plotnine.data import diamonds
    from statsmodels.formula.api import ols
    from shap import KernelExplainer
    
    # Turn categoricals into integers because, inconveniently, kernel SHAP
    # requires numpy array as input
    ord = ["clarity", "color", "cut"]
    x = ["carat"] + ord
    diamonds[ord] = diamonds[ord].apply(lambda x: x.cat.codes)
    X = diamonds[x].to_numpy()
    
    # Fit model with interactions and dummy variables
    fit = ols(
      "np.log(price) ~ np.log(carat) * (C(clarity) + C(cut) + C(color))", 
      data=diamonds
    ).fit()
    
    # Background data (120 rows)
    bg_X = X[0:len(X):450]
    
    # Define subset of 1018 diamonds to explain
    X_small = X[0:len(X):53]
    
    # Calculate KernelSHAP values
    ks = KernelExplainer(
      model=lambda X: fit.predict(pd.DataFrame(X, columns=x)), 
      data = bg_X
    )
    sv = ks.shap_values(X_small)  # 74 seconds
    sv[0:2]
    
    # array([[-2.05007406, -0.28048747,  0.12812216,  0.01587382],
    #        [-2.0858379 ,  0.04050415,  0.12830103,  0.03731644]])
    SHAP summary plot (R model)

    The results match, hurray!

    Example with nine features

    The computation effort of running exact Kernel SHAP explodes with the number of features. For nine features, the number of relevant on-off vectors is 2^9 – 2 = 510, i.e. about 36 times larger than with four features.

    We now modify above example, adding five additional features to the model. Note that the model structure is completely non-sensical. We just use it to get a feeling about what impact a 36 times larger workload has.

    Besides exact calculations, we use an almost exact hybrid approach for both R and Python, using 126 on-off vectors (p*(p+1) for the exact part and 4p for the sampling part, where p is the number of features), resulting in a significant speed-up both in R and Python.

    fit <- lm(
      log(price) ~ log(carat) * (clarity + color + cut) + x + y + z + table + depth, 
      data = diamonds
    )
    
    # Subset of 1018 diamonds to explain
    X_small <- diamonds[seq(1, nrow(diamonds), 53), setdiff(names(diamonds), "price")]
    
    # Exact Kernel SHAP: 61 seconds
    system.time(
      ks <- kernelshap(fit, X_small, bg_X = bg_X, exact = TRUE)  
    )
    ks
    #          carat        cut     color     clarity         depth         table          x           y            z
    # [1,] -1.842799 0.01424231 0.1266108 -0.27033874 -0.0007084443  0.0017787647 -0.1720782 0.001330275 -0.006445693
    # [2,] -1.876709 0.03856957 0.1266546  0.03932912 -0.0004202636 -0.0004871776 -0.1739880 0.001397792 -0.006560624
    
    # Default, using an almost exact hybrid algorithm: 17 seconds
    system.time(
      ks <- kernelshap(fit, X_small, bg_X = bg_X, parallel = TRUE)  
    )
    #          carat        cut     color     clarity         depth         table          x           y            z
    # [1,] -1.842799 0.01424231 0.1266108 -0.27033874 -0.0007084443  0.0017787647 -0.1720782 0.001330275 -0.006445693
    # [2,] -1.876709 0.03856957 0.1266546  0.03932912 -0.0004202636 -0.0004871776 -0.1739880 0.001397792 -0.006560624
    x = ["carat"] + ord + ["table", "depth", "x", "y", "z"]
    X = diamonds[x].to_numpy()
    
    # Fit model with interactions and dummy variables
    fit = ols(
      "np.log(price) ~ np.log(carat) * (C(clarity) + C(cut) + C(color)) + table + depth + x + y + z", 
      data=diamonds
    ).fit()
    
    # Background data (120 rows)
    bg_X = X[0:len(X):450]
    
    # Define subset of 1018 diamonds to explain
    X_small = X[0:len(X):53]
    
    # Calculate KernelSHAP values: 12 minutes
    ks = KernelExplainer(
      model=lambda X: fit.predict(pd.DataFrame(X, columns=x)), 
      data = bg_X
    )
    sv = ks.shap_values(X_small)
    sv[0:2]
    # array([[-1.84279897e+00, -2.70338744e-01,  1.26610769e-01,
    #          1.42423108e-02,  1.77876470e-03, -7.08444295e-04,
    #         -1.72078182e-01,  1.33027467e-03, -6.44569296e-03],
    #        [-1.87670887e+00,  3.93291219e-02,  1.26654599e-01,
    #          3.85695742e-02, -4.87177593e-04, -4.20263565e-04,
    #         -1.73988040e-01,  1.39779179e-03, -6.56062359e-03]])
    
    # Now, using a hybrid between exact and sampling: 5 minutes
    sv = ks.shap_values(X_small, nsamples=126)
    sv[0:2]
    # array([[-1.84279897e+00, -2.70338744e-01,  1.26610769e-01,
    #          1.42423108e-02,  1.77876470e-03, -7.08444295e-04,
    #         -1.72078182e-01,  1.33027467e-03, -6.44569296e-03],
    #        [-1.87670887e+00,  3.93291219e-02,  1.26654599e-01,
    #          3.85695742e-02, -4.87177593e-04, -4.20263565e-04,
    #         -1.73988040e-01,  1.39779179e-03, -6.56062359e-03]])

    Again, the results are essentially the same between R and Python, but also between the hybrid algorithm and the exact algorithm. This is interesting, because the hybrid algorithm is significantly faster than the exact one.

    Wrap-Up

    • R is catching up with Python’s superb “shap” package.
    • For two non-trivial linear regressions with interactions, the “kernelshap” package in R provides the same output as Python.
    • The hybrid between exact and sampling KernelSHAP (as implemented in Python and R) offers a very good trade-off between speed and accuracy.
    • kernelshap()in R is fast!

    The Python and R codes can be found here:

    The examples were run on a Windows notebook with an Intel i7-8650U 4 core CPU.

  • Kernel SHAP

    Our last posts were on SHAP, one of the major ways to shed light into black-box Machine Learning models. SHAP values decompose predictions in a fair way into additive contributions from each feature. Decomposing many predictions and then analyzing the SHAP values gives a relatively quick and informative picture of the fitted model at hand.

    In their 2017 paper on SHAP, Scott Lundberg and Su-In Lee presented Kernel SHAP, an algorithm to calculate SHAP values for any model with numeric predictions. Compared to Monte-Carlo sampling (e.g. implemented in R package “fastshap”), Kernel SHAP is much more efficient.

    I had one problem with Kernel SHAP: I never really understood how it works!

    Then I found this article by Covert and Lee (2021). The article not only explains all the details of Kernel SHAP, it also offers an version that would iterate until convergence. As a by-product, standard errors of the SHAP values can be calculated on the fly.

    This article motivated me to implement the “kernelshap” package in R, complementing “shapr” that uses a different logic.

    The new “kernelshap” package in R

    The interface is quite simple: You need to pass three things to its main function kernelshap():

    • X: matrix/data.frame/tibble/data.table of observations to explain. Each column is a feature.
    • pred_fun: function that takes an object like X and provides one number per row.
    • bg_X: matrix/data.frame/tibble/data.table representing the background dataset used to calculate marginal expectation. Typically, between 100 and 200 rows.

    Example

    We will use Keras to build a deep learning model with 631 parameters on diamonds data. Then we decompose 500 predictions with kernelshap() and visualize them with “shapviz”.

    We will fit a Gamma regression with log link the four “C” features:

    • carat
    • color
    • clarity
    • cut
    library(tidyverse)
    library(keras)
    
    # Response and covariates
    y <- as.numeric(diamonds$price)
    X <- scale(data.matrix(diamonds[c("carat", "color", "cut", "clarity")]))
    
    # Input layer: we have 4 covariates
    input <- layer_input(shape = 4)
    
    # Two hidden layers with contracting number of nodes
    output <- input %>%
      layer_dense(units = 30, activation = "tanh") %>% 
      layer_dense(units = 15, activation = "tanh") %>% 
      layer_dense(units = 1, activation = k_exp)
    
    # Create and compile model
    nn <- keras_model(inputs = input, outputs = output)
    summary(nn)
    
    # Gamma regression loss
    loss_gamma <- function(y_true, y_pred) {
      -k_log(y_true / y_pred) + y_true / y_pred
    }
    
    nn %>% 
      compile(
        optimizer = optimizer_adam(learning_rate = 0.001),
        loss = loss_gamma
      )
    
    # Callbacks
    cb <- list(
      callback_early_stopping(patience = 20),
      callback_reduce_lr_on_plateau(patience = 5)
    )
    
    # Fit model
    history <- nn %>% 
      fit(
        x = X,
        y = y,
        epochs = 100,
        batch_size = 400, 
        validation_split = 0.2,
        callbacks = cb
      )
    
    history$metrics[c("loss", "val_loss")] %>% 
      data.frame() %>% 
      mutate(epoch = row_number()) %>% 
      filter(epoch >= 3) %>% 
      pivot_longer(cols = c("loss", "val_loss")) %>% 
    ggplot(aes(x = epoch, y = value, group = name, color = name)) +
      geom_line(size = 1.4)

    Interpretation via KernelSHAP

    In order to peak into the fitted model, we apply the Kernel SHAP algorithm to decompose 500 randomly selected diamond predictions. We use the same subset as background dataset required by the Kernel SHAP algorithm.

    Afterwards, we will study

    • Some SHAP values and their standard errors
    • One waterfall plot
    • A beeswarm summary plot to get a rough picture of variable importance and the direction of the feature effects
    • A SHAP dependence plot for carat
    # Interpretation on 500 randomly selected diamonds
    library(kernelshap)
    library(shapviz)
    
    sample(1)
    ind <- sample(nrow(X), 500)
    
    dia_small <- X[ind, ]
    
    # 77 seconds
    system.time(
      ks <- kernelshap(
        dia_small, 
        pred_fun = function(X) as.numeric(predict(nn, X, batch_size = nrow(X))), 
        bg_X = dia_small
      )
    )
    ks
    
    # Output
    # 'kernelshap' object representing 
    # - SHAP matrix of dimension 500 x 4 
    # - feature data.frame/matrix of dimension 500 x 4 
    # - baseline value of 3744.153
    # 
    # SHAP values of first 2 observations:
    #         carat     color       cut   clarity
    # [1,] -110.738 -240.2758  5.254733 -720.3610
    # [2,] 2379.065  263.3112 56.413680  452.3044
    # 
    # Corresponding standard errors:
    #         carat      color       cut  clarity
    # [1,] 2.064393 0.05113337 0.1374942 2.150754
    # [2,] 2.614281 0.84934844 0.9373701 0.827563
    
    sv <- shapviz(ks, X = diamonds[ind, x])
    sv_waterfall(sv, 1)
    sv_importance(sv, "both")
    sv_dependence(sv, "carat", "auto")

    Note the small standard errors of the SHAP values of the first two diamonds. They are only approximate because the background data is only a sample from an unknown population. Still, they give a good impression on the stability of the results.

    The waterfall plot shows a diamond with not super nice clarity and color, pulling down the value of this diamond. Note that, even if the model is working with scaled numeric feature values, the plot shows the original feature values.

    SHAP waterfall plot of one diamond. Note its bad clarity.

    The SHAP summary plot shows that “carat” is, unsurprisingly, the most important variable and that high carat mean high value. “cut” is not very important, except if it is extremely bad.

    SHAP summary plot with bars representing average absolute values as measure of importance.

    Our last plot is a SHAP dependence plot for “carat”: the effect makes sense, and we can spot some interaction with color. For worse colors (H-J), the effect of carat is a bit less strong as for the very white diamonds.

    Dependence plot for “carat”

    Short wrap-up

    • Standard Kernel SHAP in R, yeahhhhh 🙂
    • The Github version is relatively fast, so you can even decompose 500 observations of a deep learning model within 1-2 minutes.

    The complete R script can be found here.

  • shapviz goes H2O

    In a recent post, I introduced the initial version of the “shapviz” package. Its motto: do one thing, but do it well: visualize SHAP values.

    The initial community feedback was very positive, and a couple of things have been improved in version 0.2.0. Here the main changes:

    1. “shapviz” now works with tree-based models of the h2o package in R.
    2. Additionally, it wraps the shapr package, which implements an improved version of Kernel SHAP taking into account feature dependence.
    3. A simple interface to collapse SHAP values of dummy variables was added.
    4. The default importance plot is now a bar plot, instead of the (slower) beeswarm plot. In later releases, the latter might be moved to a separate function sv_summary() for consistency with other packages.
    5. Importance plot and dependence plot now work neatly with ggplotly(). The other plot types cannot be translated with ggplotly() because they use geoms from outside ggplot. At least I do not know how to do this…

    Example

    Let’s build an H2O gradient boosted trees model to explain diamond prices. Then, we explain the model with our “shapviz” package. Note that H2O itself also offers some SHAP plots. “shapviz” is directly applied to the fitted H2O model. This means you don’t have to write a single superfluous line of code.

    library(shapviz)
    library(tidyverse)
    library(h2o)
    
    h2o.init()
    
    set.seed(1)
    
    # Get rid of that darn ordinals
    ord <- c("clarity", "cut", "color")
    diamonds[, ord] <- lapply(diamonds[, ord], factor, ordered = FALSE)
    
    # Minimally tuned GBM with 260 trees, determined by early-stopping with CV
    dia_h2o <- as.h2o(diamonds)
    fit <- h2o.gbm(
      c("carat", "clarity", "color", "cut"),
      y = "price",
      training_frame = dia_h2o,
      nfolds = 5,
      learn_rate = 0.05,
      max_depth = 4,
      ntrees = 10000,
      stopping_rounds = 10,
      score_each_iteration = TRUE
    )
    fit
    
    # SHAP analysis on about 2000 diamonds
    X_small <- diamonds %>%
      filter(carat <= 2.5) %>%
      sample_n(2000) %>%
      as.h2o()
    
    shp <- shapviz(fit, X_pred = X_small)
    
    sv_importance(shp, show_numbers = TRUE)
    sv_importance(shp, show_numbers = TRUE, kind = "bee")
    sv_dependence(shp, "color", "auto", alpha = 0.5)
    sv_force(shp, row_id = 1)
    sv_waterfall(shp, row_id = 1)

    Summary and importance plots

    The SHAP importance and SHAP summary plots clearly show that carat is the most important variable. On average, it impacts the prediction by 3247 USD. The effect of “cut” is much smaller. Its impact on the predictions, on average, is plus or minus 112 USD.

    SHAP summary plot
    SHAP importance plot

    SHAP dependence plot

    The SHAP dependence plot shows the effect of “color” on the prediction: The better the color (close to “D”), the higher the price. Using a correlation based heuristic, the plot selected carat on the color scale to show that the color effect is hightly influenced by carat in the sense that the impact of color increases with larger diamond weight. This clearly makes sense!

    Dependence plot for “color”

    Waterfall and force plot

    Finally, the waterfall and force plots show how a single prediction is decomposed into contributions from each feature. While this does not tell much about the model itself, it might be helpful to explain what SHAP values are and to debug strange predictions.

    Waterfall plot
    Force plot

    Short wrap-up

    • Combining “shapviz” and H2O is fun. Okay, that one was subjective :-).
    • Good visualization of ML models is extremely helpful and reassuring.

    The complete R script can be found here.

  • Visualize SHAP Values without Tears

    SHAP (SHapley Additive exPlanations, Lundberg and Lee, 2017) is an ingenious way to study black box models. SHAP values decompose – as fair as possible – predictions into additive feature contributions.

    When it comes to SHAP, the Python implementation is the de-facto standard. It not only offers many SHAP algorithms, but also provides beautiful plots. In R, the situation is a bit more confusing. Different packages contain implementations of SHAP algorithms, e.g.,

    some of which with great visualizations. Plus there is SHAPforxgboost (see my recent post), originally designed to visualize the results of SHAP values calculated from XGBoost, but it can also be used more generally by now.

    The shapviz package

    In order to entangle calculation from visualization, the shapviz package was designed. It solely focuses on visualization of SHAP values. Closely following its README, it currently provides these plots:

    • sv_waterfall(): Waterfall plots to study single predictions.
    • sv_force(): Force plots as an alternative to waterfall plots.
    • sv_importance(): Importance plots (bar and/or beeswarm plots) to study variable importance.
    • sv_dependence(): Dependence plots to study feature effects (optionally colored by heuristically strongest interacting feature).

    They require a “shapviz” object, which is built from two things only:

    1. S: Matrix of SHAP values
    2. X: Dataset with corresponding feature values

    Furthermore, a “baseline” can be passed to represent an average prediction on the scale of the SHAP values.

    A key feature of the “shapviz” package is that X is used for visualization only. Thus it is perfectly fine to use factor variables, even if the underlying model would not accept these.

    To further simplify the use of shapviz, direct connectors to the packages

    are available.

    Installation

    The package shapviz can be installed from CRAN or Github:

    • devtools::install_github("shapviz")
    • devtools::install_github("mayer79/shapviz")

    Example

    Shiny diamonds… let’s model their prices by four “c” variables with XGBoost, and create an explanation dataset with 2000 randomly picked diamonds.

    library(shapviz)
    library(ggplot2)
    library(xgboost)
    
    set.seed(3653)
    
    X <- diamonds[c("carat", "cut", "color", "clarity")]
    dtrain <- xgb.DMatrix(data.matrix(X), label = diamonds$price)
    
    fit <- xgb.train(
      params = list(learning_rate = 0.1, objective = "reg:squarederror"), 
      data = dtrain,
      nrounds = 65L
    )
    
    # Explanation dataset
    X_small <- X[sample(nrow(X), 2000L), ]

    Create “shapviz” object

    One line of code creates a shapviz object. It contains SHAP values and feature values for the set of observations we are interested in. Note again that X is solely used as explanation dataset, not for calculating SHAP values.

    In this example we construct the shapviz object directly from the fitted XGBoost model. Thus we also need to pass a corresponding prediction dataset X_pred used for calculating SHAP values by XGBoost.

    shp <- shapviz(fit, X_pred = data.matrix(X_small), X = X_small)

    Explaining one single prediction

    Let’s start by explaining a single prediction by a waterfall plot or, alternatively, a force plot.

    # Two types of visualizations
    sv_waterfall(shp, row_id = 1)
    sv_force(shp, row_id = 1
    Waterfall plot

    Factor/character variables are kept as they are, even if the underlying XGBoost model required them to be integer encoded.

    Force plot

    Explaining the model as a whole

    We have decomposed 2000 predictions, not just one. This allows us to study variable importance at a global model level by studying average absolute SHAP values as a bar plot or by looking at beeswarm plots of SHAP values.

    # Three types of variable importance plots
    sv_importance(shp)
    sv_importance(shp, kind = "bar")
    sv_importance(shp, kind = "both", alpha = 0.2, width = 0.2)
    Beeswarm plot
    Bar plot
    Beeswarm plot overlaid with bar plot

    A scatterplot of SHAP values of a feature like color against its observed values gives a great impression on the feature effect on the response. Vertical scatter gives additional info on interaction effects. shapviz offers a heuristic to pick another feature on the color scale with potential strongest interaction.

    sv_dependence(shp, v = "color", "auto")
    Dependence plot with automatic interaction colorization

    Summary

    • The “shapviz” has a single purpose: making SHAP plots.
    • Its interface is optimized for existing SHAP crunching packages and can easily be used in future packages as well.
    • All plots are highly customizable. Furthermore, they are all written with ggplot and allow corresponding modifications.

    The complete R script can be found here.

    References

    Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30 (2017).

  • Let the flashlight shine with plotly

    There are different R packages devoted to model agnostic interpretability, DALEX and iml being among the best known. In 2019, I added flashlight 

    logo.png

    for a couple of reasons:

    1. Its explainers work with case weights.
    2. Multiple explainers can be combined to a multi-explainer.
    3. Stratified calculation is possible.

    Since almost all plots in flashlight are constructed with ggplot, it is super easy to turn them into interactive plotly objects: just add a simple ggplotly() to the end of the call.

    However… it is not straightforward to show interactive plots in a blog! Thus, we show only screenshots of the resulting plots here and refer to the complete HTML report here: https://mayer79.github.io/flashlight_plotly/flashlight_plotly.html

    We will use a sweet dataset with more than 20’000 houses to model house prices by a set of derived features such as the logarithmic living area. The location will be represented by the postal code.

    Data preparation

    We first load the data and prepare some of the columns for modeling. Furthermore, we specify the set of features and the response.

    library(dplyr)
    library(flashlight)
    library(plotly)
    library(ranger)
    library(lme4)
    library(moderndive)
    library(splitTools)
    library(MetricsWeighted)
    
    set.seed(4933)
    
    data("house_prices")
    
    prep <- house_prices %>% 
      mutate(
        log_price = log(price),
        log_sqft_living = log(sqft_living),
        log_sqft_lot = log(sqft_lot),
        log_sqft_basement = log1p(sqft_basement),
        year = as.numeric(format(date, '%Y')),
        age = year - yr_built
      )
    
    x <- c(
      "year", "age", "log_sqft_living", "log_sqft_lot", 
      "bedrooms", "bathrooms", "log_sqft_basement", 
      "condition", "waterfront", "zipcode"
    )
    
    y <- "log_price"
    
    head(prep[c(y, x)])
    
    ## # A tibble: 6 x 11
    ##   log_price  year   age log_sqft_living log_sqft_lot bedrooms bathrooms
    ##       <dbl> <dbl> <dbl>           <dbl>        <dbl>    <int>     <dbl>
    ## 1      12.3  2014    59            7.07         8.64        3      1   
    ## 2      13.2  2014    63            7.85         8.89        3      2.25
    ## 3      12.1  2015    82            6.65         9.21        2      1   
    ## 4      13.3  2014    49            7.58         8.52        4      3   
    ## 5      13.1  2015    28            7.43         9.00        3      2   
    ## 6      14.0  2014    13            8.60        11.5         4      4.5 
    ## # ... with 4 more variables: log_sqft_basement <dbl>, condition <fct>,
    ## #   waterfront <lgl>, zipcode <fct>

    Train / test split

    Then, we split the dataset into 80% training and 20% test rows, stratified on the (binned) response log_price.

    idx <- partition(prep[[y]], c(train = 0.8, test = 0.2), type = "stratified")
    
    train <- prep[idx$train, ]
    test <- prep[idx$test, ]

    Models

    We fit two models:

    1. A linear mixed model with random postal code effect.
    2. A random forest with 500 trees.
    # Mixed-effects model
    fit_lmer <- lmer(
      update(reformulate(x, "log_price"), . ~ . - zipcode + (1 | zipcode)),
      data = train
    )
    
    # Random forest
    fit_rf <- ranger(
      reformulate(x, "log_price"),
      always.split.variables = "zipcode",
      data = train
    )
    cat("R-squared OOB:", fit_rf$r.squared)
    ## R-squared OOB: 0.8463311

    Model inspection

    Now, we are ready to inspect our two models regarding performance, variable importance, and effects.

    Set up explainers

    First, we pack all model dependent information into flashlights (the explainer objects) and combine them to a multiflashlight. As evaluation dataset, we pass the test data. This ensures that interpretability tools using the response (e.g., performance measures and permutation importance) are not being biased by overfitting.

    fl_lmer <- flashlight(model = fit_lmer, label = "LMER")
    fl_rf <- flashlight(
      model = fit_rf,
      label = "RF",
      predict_function = function(mod, X) predict(mod, X)$predictions
    )
    fls <- multiflashlight(
      list(fl_lmer, fl_rf),
      y = "log_price",
      data = test,
      metrics = list(RMSE = rmse, `R-squared` = r_squared)
    )

    Model performance

    Let’s evaluate model RMSE and R-squared on the hold-out dataset. Here, the mixed-effects model performs a tiny little bit better than the random forest:

    (light_performance(fls) %>%
      plot(fill = "darkred") +
        labs(title = "Model performance", x = element_blank())) %>%
      ggplotly()
    Model performance (png)

    Permutation importance

    Next, we inspect the variable strength based on permutation importance. It shows by how much the RMSE is being increased when shuffling a variable before prediction. The results are quite similar between the two models.

    (light_importance(fls, v = x) %>%
        plot(fill = "darkred") +
        labs(title = "Permutation importance", y = "Drop in RMSE")) %>%
      ggplotly()
    Variable importance (png)

    ICE plot

    To get an impression of the effect of the living area, we select 200 observations and profile their predictions with increasing (log) living area, keeping everything else fixed (Ceteris Paribus). These ICE (individual conditional expectation) plots are vertically centered in order to highlight potential interaction effects. If all curves coincide, there are no interaction effects and we can say that the effect of the feature is modelled in an additive way (no surprise for the additive linear mixed-effects model).

    (light_ice(fls, v = "log_sqft_living", n_max = 200, center = "middle") %>%
        plot(alpha = 0.05, color = "darkred") +
        labs(title = "Centered ICE plot", y = "log_price (shifted)")) %>%
      ggplotly()

    Partial dependence plots

    Averaging many uncentered ICE curves provides the famous partial dependence plot, introduced in Friedman’s seminal paper on gradient boosting machines (2001).

    (light_profile(fls, v = "log_sqft_living", n_bins = 21) %>%
        plot(rotate_x = FALSE) +
        labs(title = "Partial dependence plot", y = y) +
        scale_colour_viridis_d(begin = 0.2, end = 0.8)) %>%
      ggplotly()
    Partial dependence plots (png)

    Multiple effects visualized together

    The last figure extends the partial dependence plot with three additional curves, all evaluated on the hold-out dataset:

    • Average observed values
    • Average predictions
    • ALE plot (“accumulated local effects”, an alternative to partial dependence plots with relaxed Ceteris Paribus assumption)
    (light_effects(fls, v = "log_sqft_living", n_bins = 21) %>%
        plot(use = "all")  +
        labs(title = "Different effect estimates", y = y) +
        scale_colour_viridis_d(begin = 0.2, end = 0.8)) %>%
      ggplotly()
    Multiple effects together (png)

    Conclusion

    Combining flashlight with plotly works well and provides nice, interactive plots. Using rmarkdown, an analysis like this look quite neat if shipped as an HTML like this one here: https://mayer79.github.io/flashlight_plotly/flashlight_plotly.html

    The rmarkdown script can be found here on github.

  • DuckDB: Quacking SQL

    Lost in Translation between R and Python 8

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    DuckDB

    DuckDB is a fantastic in-process SQL database management system written completely in C++. Check its official documentation and other blogposts like this to get a feeling of its superpowers. It is getting better and better!

    Some of the highlights:

    • Easy installation in R and Python, made possible via language bindings.
    • Multiprocessing and fast.
    • Allows to work with data bigger than RAM.
    • Can fire SQL queries on R and Pandas tables.
    • Can fire SQL queries on (multiple!) csv and/or Parquet files.
    • Quacks Apache Arrow.

    Installation

    DuckDB is super easy to install:

    • R: install.packages("duckdb")
    • Python: pip install duckdb

    Additional packages required to run the code of this post are indicated in the code.

    A first query

    Let’s start by loading a dataset, initializing DuckDB and running a simple query.

    The dataset we use here contains information on over 20,000 sold houses in Kings County. Along with the sale price, different features describe the size and location of the properties. The dataset is available on OpenML.org with ID 42092.

    library(OpenML)
    library(duckdb)
    library(tidyverse)
    
    # Load data
    df <- getOMLDataSet(data.id = 42092)$data
    
    # Initialize duckdb, register df and materialize first query
    con = dbConnect(duckdb())
    duckdb_register(con, name = "df", df = df)
    con %>% 
      dbSendQuery("SELECT * FROM df limit 5") %>% 
      dbFetch()
    import duckdb
    import pandas as pd
    from sklearn.datasets import fetch_openml
    
    # Load data
    df = fetch_openml(data_id=42092, as_frame=True)["frame"]
    
    # Initialize duckdb, register df and fire first query
    # If out-of-RAM: duckdb.connect("py.duckdb", config={"temp_directory": "a_directory"})
    con = duckdb.connect()
    con.register("df", df)
    con.execute("SELECT * FROM df limit 5").fetchdf()
    Result of first query (from R)

    Average price per grade

    If you like SQL, then you can do your data preprocessing and simple analyses with DuckDB. Here, we calculate the average house price per online grade (the higher the grade, the better the house).

    query <- 
      "
      SELECT AVG(price) avg_price, grade 
      FROM df 
      GROUP BY grade
      ORDER BY grade
      "
    avg <- con %>% 
      dbSendQuery(query) %>% 
      dbFetch()
    
    avg
    
    # Average price per grade
    query = """
      SELECT AVG(price) avg_price, grade 
      FROM df 
      GROUP BY grade
      ORDER BY grade
      """
    avg = con.execute(query).fetchdf()
    avg
    R output

    Highlight: queries to files

    The last query will be applied directly to files on disk. To demonstrate this fantastic feature, we first save “df” as a parquet file and “avg” as a csv file.

    write_parquet(df, "housing.parquet")
    write.csv(avg, "housing_avg.csv", row.names = FALSE)
    
    # Save df and avg to different file types
    df.to_parquet("housing.parquet")  # pyarrow=7
    avg.to_csv("housing_avg.csv", index=False)

    Let’s load some columns of “housing.parquet” data, but only rows with grades having an average price of one million USD. Agreed, that query does not make too much sense but I hope you get the idea…😃

    # "Complex" query
    query2 <- "
      SELECT price, sqft_living, A.grade, avg_price
      FROM 'housing.parquet' A
      LEFT JOIN 'housing_avg.csv' B
      ON A.grade = B.grade
      WHERE B.avg_price > 1000000
      "
    
    expensive_grades <- con %>% 
      dbSendQuery(query2) %>% 
      dbFetch()
    
    head(expensive_grades)
    
    # dbDisconnect(con)
    # Complex query
    query2 = """
      SELECT price, sqft_living, A.grade, avg_price
      FROM 'housing.parquet' A
      LEFT JOIN 'housing_avg.csv' B
      ON A.grade = B.grade
      WHERE B.avg_price > 1000000
      """
    expensive_grades = con.execute(query2).fetchdf()
    expensive_grades
    
    # con.close()
    R output

    Last words

    • DuckDB is cool!
    • If you have strong SQL skills but do not know R or Python so well, this is a great way to get used to those programming languages.
    • If you are unfamiliar to SQL but like R and/or Python, you can use DuckDB for a while and end up being an SQL addict.
    • If your analysis involves combining many large files during preprocessing, then you can try the trick shown in the last example of this post.

    The Python notebook and R code can be found at:

  • Avoid loops in R! Really?

    It must have been around the year 2000, when I wrote my first snipped of SPLUS/R code. One thing I’ve learned back then:

    Loops are slow. Replace them with

    1. vectorized calculations or
    2. if vectorization is not possible, use sapply() et al.

    Since then, the R core team and the community has invested tons of time to improve R and also to make it faster. There are things like RCPP and parallel computing to speed up loops.

    But what still relatively few R users know: loops are not that slow anymore. We want to demonstrate this using two examples.

    Example 1: sqrt()

    We use three ways to calculate the square root of a vector of random numbers:

    1. Vectorized calculation. This will be the way to go because it is internally optimized in C.
    2. A loop. This must be super slow for large vectors.
    3. vapply() (as safe alternative to sapply).

    The three approaches are then compared via bench::mark() regarding their speed for different numbers n of vector lengths. The results are then compared first regarding absolute median times, and secondly (using an independent run), on a relative scale (1 is the vectorized approach).

    library(tidyverse)
    library(bench)
    
    # Calculate square root for each element in loop
    sqrt_loop <- function(x) {
      out <- numeric(length(x))
      for (i in seq_along(x)) {
        out[i] <- sqrt(x[i])
      }
      out
    }
    
    # Example
    sqrt_loop(1:4) # 1.000000 1.414214 1.732051 2.000000
    
    # Compare its performance with two alternatives
    sqrt_benchmark <- function(n) {
      x <- rexp(n)
      mark(
        vectorized = sqrt(x),
        loop = sqrt_loop(x),
        vapply = vapply(x, sqrt, FUN.VALUE = 0.0),
        # relative = TRUE
      )
    }
    
    # Combine results of multiple benchmarks and plot results
    multiple_benchmarks <- function(one_bench, N) {
      res <- vector("list", length(N))
      for (i in seq_along(N)) {
        res[[i]] <- one_bench(N[i]) %>% 
          mutate(n = N[i], expression = names(expression))
      }
      
      ggplot(bind_rows(res), aes(n, median, color = expression)) +
        geom_point(size = 3) +
        geom_line(size = 1) +
        scale_x_log10() +
        ggtitle(deparse1(substitute(one_bench))) +
        theme(legend.position = c(0.8, 0.15))
    }
    
    # Apply simulation
    multiple_benchmarks(sqrt_benchmark, N = 10^seq(3, 6, 0.25))

    Absolute timings

    Absolute median times on the “sqrt()” task

    Relative timings (using a second run)

    Relative median times of a separate run on the “sqrt()” task

    We see:

    • Run times increase quite linearly with vector size.
    • Vectorization is more than ten times faster than the naive loop.
    • Most strikingly, vapply() is much slower than the naive loop. Would you have thought this?

    Example 2: paste()

    For the second example, we use a less simple function, namely

    paste(“Number”, prettyNum(x, digits = 5))

    What will our three approaches (vectorized, naive loop, vapply) show on this task?

    pretty_paste <- function(x) {
      paste("Number", prettyNum(x, digits = 5))
    }
    
    # Example
    pretty_paste(pi) # "Number 3.1416"
    
    # Again, call pretty_paste() for each element in a loop
    paste_loop <- function(x) {
      out <- character(length(x))
      for (i in seq_along(x)) {
        out[i] <- pretty_paste(x[i])
      }
      out
    }
    
    # Compare its performance with two alternatives
    paste_benchmark <- function(n) {
      x <- rexp(n)
      mark(
        vectorized = pretty_paste(x),
        loop = paste_loop(x),
        vapply = vapply(x, pretty_paste, FUN.VALUE = ""),
        # relative = TRUE
      )
    }
    
    multiple_benchmarks(paste_benchmark, N = 10^seq(3, 5, 0.25))

    Absolute timings

    Absolute median times on the “paste()” task

    Relative timings (using a second run)

    Relative median times of a separate run on the “paste()” task
    • In contrast to the first example, vapply() is now as fast as the naive loop.
    • The time advantage of the vectorized approach is much less impressive. The loop takes in median only 50% longer.

    Conclusion

    1. Vectorization is fast and easy to read. If available, use this. No surprise.
    2. If you use vapply/sapply/lapply, do it for the style, not for the speed. In some cases, the loop will be faster. And, depending on the situation and the audience, a loop might actually be even easier to read.

    The code can be found on github.

    The runs have been made on a Windows 11 system with a four core Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz processor.