Tag: Lost in Translation

Series: Lost in Translation between R and Python

Model Diagnostics: Statistics vs Machine Learning

In this post, we show how different use cases require different model diagnostics. In short, we compare (statistical) inference and prediction.

As an example, we use a simple linear model for the Munich rent index dataset, which was kindly provided by the authors of Regression – Models, Methods and Applications 2nd ed. (2021). This dataset contains monthy rents in EUR (rent) for about 3000 apartments in Munich, Germany, from 1999. The apartments have several features such as living area (area) in squared meters, year of construction (yearc), quality of location (location, 0: average, 1: good, 2: top), quality of bath rooms (bath, 0:standard, 1: premium), quality of kitchen (kitchen, 0: standard, 1: premium), indicator for central heating (cheating).

The target variable is $Y=\text{rent}$ and the goal of our model is to predict the mean rent, $E[Y]$ (we omit the conditioning on X for brevity).

Disclaimer: Before presenting the use cases, let me clearly state that I am not in the apartment rent business and everything here is merely for the purpose of demonstrating statistical good practice.

Inference

The first use case is about inference of the effect of the features. Imagine the point of view of an investor who wants to know whether the installation of a central heating is worth it (financially). To lay the ground on which to base a decision, a statistician must have answers to:

What is the effect of the variable cheating on the rent.
Is this effect statistically significant?

Prediction

The second use case is about prediction. This time, we take the point of view of someone looking out for a new apartment to rent. In order to know whether the proposed rent by the landlord is about right or improper (too high), a reference value would be very convenient. One can either ask the neighbors or ask a model to predict the rent of the apartment in question.

Model Fit

Before answering the above questions and doing some key diagnostics, we must load the data and fit a model. We choose a simple linear model and directly model rent.

Notes:

For rent indices as well as house prices, one often log-transforms the target variable before modelling or one uses a log-link and an appropriate loss function (e.g. Gamma deviance).
Our Python version uses GeneralizedLinearRegressor from the package glum. We could as well have chosen other implementations like statsmodels.regression.linear_model.OLS. This way, we have to implement the residual diagnostics ourselves which makes it clear what is plotted.

For brevity, we skip imports and data loading. Our model is then fit by:

Python

lm = glum.GeneralizedLinearRegressor(
    alpha=0,
    drop_first=True,  # this is very important if alpha=0
    formula="bs(area, degree=3, df=4) + yearc"
      	" + C(location) + C(bath) + C(kitchen) + C(cheating)"
)
lm.fit(X_train, y_train)

model = lm(
  formula = rent ~ bs(area, degree = 3, df = 4) + yearc + location + bath + kitchen + cheating,
  data = df_train
)

Diagnostics for Inference

The coefficient table will already tell us the effect of the cheating variable. For more involved models like gradient boosted trees or neural nets, one can use partial dependence and shap values to assess the effect of features.

Python

lm.coef_table(X_train, y_train)

summary(model)
confint(model)

Variable	coef	se	p_value	ci_lower	ci_upper
intercept	-3682.5	327.0	0.0	-4323	-3041
bs(area, ..)[1]	88.5	31.3	4.6e-03	27	150
bs(area,..)[2]	316.8	24.5	0.0	269	365
bs(area, ..)[3]	547.7	62.8	0.0	425	671
bs(area, ..)[4]	733.7	91.7	1.3e-15	554	913
yearc	1.9	0.2	0.0	1.6	2.3
C(location)[2]	48.2	5.9	4.4e-16	37	60
C(location)[3]	137.9	27.7	6.6e-07	84	192
C(bath)[1]	50.0	16.5	2.4e-03	18	82
C(kitchen)[1]	98.2	18.5	1.1e-07	62	134
C(cheating)[1]	107.8	10.6	0.0	87.0	128.6

We see that ceteris paribus, meaning all else equal, a central heating increases the monthly rent by about 108 EUR. Not the size of the effect of 108 EUR, but the fact that there is an effect of central heating on the rent seems statistically significant:
This is indicated by the very low probability, i.e. p-value, for the null-hypothesis of cheating having a coefficient of zero.
We also see that the confidence interval with the default confidence level of 95%: [ci_lower, ci_upper] = [87, 129].
This shows the uncertainty of the estimated effect.

For a building with 10 apartments and with an investment horizon of about 10 years, the estimated effect gives roughly a budget of 13000 EUR (range is roughly 10500 to 15500 with 95% confidence).

A good statistician should ask several further questions:

Is the dataset at hand a good representation of the population?
Are there confounders or interaction effects, in particular between cheating and other features?
Are the assumptions for the low p-value and the confidence interval of cheating valid?

Here, we will only address the last question, and even that one only partially. Which assumptions were made? The error term, $\epsilon = Y - E[Y]$ , should be homoscedastic and Normal distributed. As the error is not observable (because the true model for $E[Y]$ is unknown), one replaces $E[Y]$ by the model prediction $\hat{E}[Y]$ , this gives the residuals, $\hat{\epsilon} = Y - \hat{E}[Y] = y - \text{fitted values}$ , instead. For homoscedasticity, the residuals should look like white (random) noise. Normality, on the other hand, becomes less of a concern with larger data thanks to the central limit theorem. With about 3000 data points, we are far away from small data, but it might still be a good idea to check for normality.

The diagnostic tools to check that are residual and quantile-quatile (QQ) plots.

Python

# See notebook for a definition of residual_plot.
import seaborn as sns
fig, axes = plt.subplots(ncols=2, figsize=(4.8 * 2.1, 6.4))
ax = residual_plot(model=lm, X=X_train, y=y_train, ax=axes[0])
sns.kdeplot(
    x=lm.predict(X_train),
    y=residuals(lm, X_train, y_train, kind="studentized"),
    thresh=.02,
    fill=True,
    ax=axes[1],
).set(
    xlabel="fitted",
    ylabel="studentized residuals",
    title="Contour Plot of Residuals",
)

autoplot(model, which = c(1, 2))  # from library ggfortify
# density plot of residuals
ggplot(model, aes(x = .fitted, y = .resid)) + geom_point() +
  geom_density_2d() + geom_density_2d_filled(alpha = 0.5)

The more data points one has the less informative is a scatter plot. Therefore, we put a contour plot on the right.

Visual insights:

There seems to be a larger variability for larger fitted values. This is a hint that the homoscedasticity might be violated.
The residuals seem to be centered around 0. This is a hint that the model is well calibrated (adequate).

Python

# See notebook for a definition of qq_plot.
qq_plot(lm, X_train, y_train)

autoplot(model, which = 2)

The QQ-plot shows the quantiles of the theoretical assumed distribution of the residuals on the x-axis and the ordered values of the residuals on the y-axis. In the Python version, we decided to use the studentized residuals because normality of the error implies a student (t) distribution for these residuals.

Concluding remarks:

We might do similar plots on the test sample, but we don’t necessarily need a test sample to answer the inference questions.
It is good practice to plot the residuals vs each of the features as well.

Diagnostics for Prediction

If we are only interested in predictions of the mean rent, $\hat{E}[Y]$ , we don’t care much about the probability distribution of $Y$ . We just want to know if the predictions are close enough to the real mean of the rent $E[Y]$ . In a similar argument as for the error term and residuals, we have to accept that $E[Y]$ is not observable (it is the quantity that we want to predict). So we have to fall back to the observations of $Y$ in order to judge if our model is well calibrated, i.e., close the the ideal $E[Y]$ .

Very importantly, here we make use of the test sample in all of our diagnostics because we fear the in-sample bias.

We start simple by a look at the unconditional calibration, that is the average (negative) residual $\frac{1}{n}\sum(\hat{E}[Y_i]-Y_i)$ .

Python

compute_bias(
    y_obs=np.concatenate([y_train, y_test]),
    y_pred=lm.predict(pd.concat([X_train, X_test])),
    feature=np.array(["train"] * X_train.shape[0] + ["test"] * X_test.shape[0]),
)

print(paste("Train set mean residual:", mean(resid(model))))
print(paste("Test set mean residual: ", mean(df_test$rent - predict(model, df_test))))

set	mean bias	count	stderr	p-value
train	-3.2e-12	2465	2.8	1.0
test	2.1	617	5.8	0.72

It is no surprise that bias_mean in the train set is almost zero.
This is the balance property of (generalized) linear models (with intercept term). On the test set, however, we detect a small bias of about 2 EUR per apartment on average.

Next, we have a look a reliability diagrams which contain much more information about calibration and bias of a model than the unconditional calibration above. In fact, it assesses auto-calibration, i.e. how well the model uses its own information.
An ideal model would lie on the dotted diagonal line.

Python

fig, axes = plt.subplots(ncols=2, figsize=(4.8 * 2.1, 6.4))
plot_reliability_diagram(y_obs=y_train, y_pred=lm.predict(X_train), n_bootstrap=100, ax=axes[0])
axes[0].set_title(axes[0].get_title() + f" train set (n={X_train.shape[0]})")
plot_reliability_diagram(y_obs=y_test, y_pred=lm.predict(X_test), n_bootstrap=100, ax=axes[1])
axes[1].set_title(axes[1].get_title() + f" test set (n={X_test.shape[0]})")

iso_train = isoreg(x = model$fitted.values, y = df_train$rent)
iso_test = isoreg(x = predict(model, df_test), y = df_test$rent)
bind_rows(
  tibble(set = "train", x = iso_train$x[iso_train$ord], y = iso_train$yf),
  tibble(set = "test", x = iso_test$x[iso_test$ord], y = iso_test$yf),
) |>
  ggplot(aes(x=x, y=y, color=set)) + geom_line() +
  geom_abline(intercept = 0, slope = 1, linetype="dashed") +
  ggtitle("Reliability Diagram")

Visual insights:

Graphs on train and test set look very similar.
The larger uncertainty intervals on the test set stem from the fact that is has a smaller sample size.
The model seems to lie around the diagonal indicating good auto-calibration for the largest part of the range.
Very high predicted values seem to be systematically too low, i.e. the graph is above the diagonal.

Finally, we assess conditional calibration, i.e. the calibration with respect to the features. Therefore, we plot one of our favorite graphs for each feature. It consists of:

average observed value of $Y$ for each (binned) value of the feature
average predicted value
partial dependence
histogram of the feature (grey, right y-axis)

Python

fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 5*4), sharey=True)
for i, col in enumerate(["area", "yearc", "bath", "kitchen", "cheating"]):
    plot_marginal(
        y_obs=y_train,
        y_pred=lm.predict(X_train),
        X=X_train,
        feature_name=col,
        predict_function=lm.predict,
        ax=axes[i][0],
    )
    plot_marginal(
        y_obs=y_test,
        y_pred=lm.predict(X_test),
        X=X_test,
        feature_name=col,
        predict_function=lm.predict,
        ax=axes[i][1],
    )
    axes[i][0].set_title("Train")
    axes[i][1].set_title("Test")
    if i != 0:
        axes[i][0].get_legend().remove()
    axes[i][1].get_legend().remove()
fig.tight_layout()

xvars = c("area", "yearc", "bath", "kitchen", "cheating")
m_train = feature_effects(model, v = xvars, data = df_train, y = df_train$rent)
m_test = feature_effects(model, v = xvars, data = df_test, y = df_test$rent)

c(m_train, m_test) |> 
  plot(
    share_y = "rows",
    ncol = 2,
    byrow = FALSE,
    stats = c("y_mean", "pred_mean", "pd"),
    subplot_titles = FALSE,
    # plotly = TRUE,
    title = "Left: Train - Right: Test",
  )

Visual insights:

On the train set, the categorical features seem to have perfect calibration as average observed equals average predicted. This is again a result of the balance property. On the test set, we see a deviation, especially for the categorical level with smaller sample size. This is a good demonstration why plotting on both train and test set is a good idea.
The numerical features area and year of construction seem fine, but a closer look can’t hurt.

We next perform a bias plot, which is plotting the average difference of predicted minus observed per feature value. The values should be around zero, so we can zoom in on the y-axis.
This is very similar to the residual plot, but the information is better condensed for its purpose.

Python

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 2*4), sharey=True)
axes[0,0].set_ylim(-150, 150)
for i, col in enumerate(["area", "yearc"]):
    plot_bias(
        y_obs=y_train,
        y_pred=lm.predict(X_train),
        feature=X_train[col],
        ax=axes[i][0],
    )
    plot_bias(
        y_obs=y_test,
        y_pred=lm.predict(X_test),
        feature=X_test[col],
        ax=axes[i][1],
    )
    axes[i][0].set_title("Train")
    axes[i][1].set_title("Test")
fig.tight_layout()

c(m_train[c("area", "yearc")], m_test[c("area", "yearc")]) |> 
  plot(
    ylim = c(-150, 150),
    ncol = 2,
    byrow = FALSE,
    stats = "resid_mean",
    subplot_titles = FALSE,
    title = "Left: Train - Right: Test",
    # plotly = TRUE,
    interval = "ci"
  )

Visual insights:

For large values of area and yearc in the 1940s and 1950s, there are only few observations available. Still, the model might be improved for those regions.
The bias of yearc shows a parabolic curve. The simple linear effect in our model seems too simplistic. A refined model could use splines instead, as for area.

Concluding remarks:

The predictions for area larger than around 120 square meters and for year of construction around the 2nd world war are less reliable.
For all the rest, the bias is smaller than 50 EUR on average.
This is therefore a rough estimation of the prediction uncertainty.
It should be enough to prevent improperly high (or low) rents (on average).

The full Python and R code is available under:

2025-05-01

Dictionary for Data Scientists and Statisticians

During my journey through machine learning (ML) and statistics, I was faced some many times with surprisingly different usage of terms. To improve the understanding of data scientists and statisticians, I present a dictionary and hope the humour does not get unnoticed.

data scientist	statistician	comment
sample	observation
(training) set	sample
feature	covariate, predictor	many more terms
label	categorical response
inference	prediction, forecast
statistics	inference
training	fitting
training error	in-sample error
test/validation set	hold-out sample
regression	regression
classification	regression (on categorical response) + decision making	thus the name logistic / multinomial regression!
supervised machine learning	regression
AI	AI for funding, else regression	see EU AI Act article 3
confidence score	predicted probability	confidence scores might not represent probabilities
(binary/multiclass) cross-entropy	(binomial/multinomial) log likelihood	a.k.a. log loss
unbalanced data problem	🤷‍♂️what problem?	if any, a small data problem
SMOTE	devil’s work

Statistics is about the honest interpretation of data, which is much less appealing than less honest interpretation.

by Prof. Simon Wood, a.k.a. Mr GAM/mgcv

2025-04-07

Effect Plots in Python and R
Christian and me did some code magic: Highly effective plots that help to build and inspect any model:
- Python: https://github.com/lorentzenchr/model-diagnostics
  pip install model-diagnostics
- R: https://github.com/mayer79/effectplots
  install.packages("effectplots")
The functionality is best described by its output:

Python

R

The plots show different types of feature effects relevant in modeling:
- Average observed: Descriptive effect (also interesting without model).
- Average predicted: Combined effect of all features. Also called “M Plot” (Apley 2020).
- Partial dependence: Effect of one feature, keeping other feature values constant (Friedman 2001).
- Number of observations or sum of case weights: Feature value distribution.
- R only: Accumulated local effects, an alternative to partial dependence (Apley 2020).
Both implementations…
- are highly efficient thanks to {Polars} in Python and {collapse} in R, and work on datasets with millions of observations,
- support case weights with all their statistics, ideal in insurance applications,
- calculate average residuals (not shown in the plots above),
- provide standard deviations/errors of average observed and bias,
- allow to switch to Plotly for interactive plots, and
- are highly customizable (the R package, e.g., allows to collapse rare levels after calculating statistics via the update() method or to sort the features according to main effect importance).
In the spirit of our “Lost In Translation” series, we provide both high-quality Python and R code. We will use the same data and models as in one of our latest posts on how to build strong GLMs via ML + XAI.

Example

Let’s build a Poisson LightGBM model to explain the claim frequency given six traditional features in a pricing dataset on motor liability claims. 80% of the 1 Mio rows are used for training, the other 20% for evaluation. Hyper-parameters have been slightly tuned (not shown).
R

Python
library(OpenML) library(lightgbm) dim(df <- getOMLDataSet(data.id = 45106L)$data) # 1000000 7 head(df) # year town driver_age car_weight car_power car_age claim_nb # 0 2018 1 51 1760 173 3 0 # 1 2019 1 41 1760 248 2 0 # 2 2018 1 25 1240 111 2 0 # 3 2019 0 40 1010 83 9 0 # 4 2018 0 43 2180 169 5 0 # 5 2018 1 45 1170 149 1 1 yvar <- "claim_nb" xvars <- setdiff(colnames(df), yvar) ix <- 1:800000 train <- df[ix, ] test <- df[-ix, ] X_train <- data.matrix(train[xvars]) X_test <- data.matrix(test[xvars]) # Training, using slightly optimized parameters found via cross-validation params <- list( learning_rate = 0.05, objective = "poisson", num_leaves = 7, min_data_in_leaf = 50, min_sum_hessian_in_leaf = 0.001, colsample_bynode = 0.8, bagging_fraction = 0.8, lambda_l1 = 3, lambda_l2 = 5, num_threads = 7 ) set.seed(1) fit <- lgb.train( params = params, data = lgb.Dataset(X_train, label = train$claim_nb), nrounds = 300 )

import matplotlib.pyplot as plt from lightgbm import LGBMRegressor from sklearn.datasets import fetch_openml df = fetch_openml(data_id=45106, parser="pandas").frame df.head() # year town driver_age car_weight car_power car_age claim_nb # 0 2018 1 51 1760 173 3 0 # 1 2019 1 41 1760 248 2 0 # 2 2018 1 25 1240 111 2 0 # 3 2019 0 40 1010 83 9 0 # 4 2018 0 43 2180 169 5 0 # 5 2018 1 45 1170 149 1 1 # Train model on 80% of the data y = df.pop("claim_nb") n_train = 800_000 X_train, y_train = df.iloc[:n_train], y.iloc[:n_train] X_test, y_test = df.iloc[n_train:], y.iloc[n_train:] params = { "learning_rate": 0.05, "objective": "poisson", "num_leaves": 7, "min_child_samples": 50, "min_child_weight": 0.001, "colsample_bynode": 0.8, "subsample": 0.8, "reg_alpha": 3, "reg_lambda": 5, "verbose": -1, } model = LGBMRegressor(n_estimators=300, **params, random_state=1) model.fit(X_train, y_train)
Let’s inspect the (main effects) of the model on the test data.
R

Python
library(effectplots) # 0.3 s feature_effects(fit, v = xvars, data = X_test, y = test$claim_nb) |> plot(share_y = "all")

from model_diagnostics.calibration import plot_marginal fig, axes = plt.subplots(3, 2, figsize=(8, 8), sharey=True, layout="tight") # 2.3 s for i, (feat, ax) in enumerate(zip(X_test.columns, axes.flatten())): plot_marginal( y_obs=y_test, y_pred=model.predict(X_test), X=X_test, feature_name=feat, predict_function=model.predict, ax=ax, ) ax.set_title(feat) if i != 1: ax.legend().remove()
The output can be seen at the beginning of this blog post.

Here some model insights:
- Average predictions closely match observed frequencies. No clear bias is visible.
- Partial dependence shows that the year and the car weight almost have no impact (regarding their main effects), while the driver_age and car_power effects seem strongest. The shared y axes help to assess these.
- Except for car_weight, the partial dependence curve closely follows the average predictions. This means that the model effect seems to really come from the feature on the x axis, and not of some correlated other feature (as, e.g., with car_weight which is actually strongly correlated with car_power).
Final words
- Inspecting models has become much relaxed with above functions.
- The packages used offer much more functionality. Try them out! Or we will show them in later posts ;).
References
1. Apley, Daniel W., and Jingyu Zhu. 2020. Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82 (4): 1059–1086. doi:10.1111/rssb.12377.
2. Friedman, Jerome H. 2001. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29 (5): 1189–1232. doi:10.1214/aos/1013203451.
R script , Python notebook
2024-11-23

SHAP Values of Additive Models

Within only a few years, SHAP (Shapley additive explanations) has emerged as the number 1 way to investigate black-box models. The basic idea is to decompose model predictions into additive contributions of the features in a fair way. Studying decompositions of many predictions allows to derive global properties of the model.

What happens if we apply SHAP algorithms to additive models? Why would this ever make sense?

In the spirit of our “Lost In Translation” series, we provide both high-quality Python and R code.

The models

Let’s build the models using a dataset with three highly correlated covariates and a (deterministic) response.

Python

library(lightgbm)
library(kernelshap)
library(shapviz)

#===================================================================
# Make small data
#===================================================================

make_data <- function(n = 100) {
  x1 <- seq(0.01, 1, length = n)
  data.frame(
    x1 = x1,
    x2 = log(x1),
    x3 = x1 > 0.7
  ) |>
    transform(y = 1 + 0.2 * x1 + 0.5 * x2 + x3 + sin(2 * pi * x1))
}
df <- make_data()
head(df)
cor(df) |>
  round(2)

#      x1   x2   x3    y
# x1 1.00 0.90 0.80 0.46
# x2 0.90 1.00 0.58 0.58
# x3 0.80 0.58 1.00 0.51
# y  0.46 0.58 0.51 1.00

#===================================================================
# Additive linear model and additive boosted trees
#===================================================================

# Linear regression
fit_lm <- lm(y ~ poly(x1, 3) + poly(x2, 3) + x3, data = df)
summary(fit_lm)

# Boosted trees
xvars <- setdiff(colnames(df), "y")
X <- data.matrix(df[xvars])

params <- list(
  learning_rate = 0.05,
  objective = "mse",
  max_depth = 1,
  colsample_bynode = 0.7
)

fit_lgb <- lgb.train(
  params = params,
  data = lgb.Dataset(X, label = df$y),
  nrounds = 300
)

import numpy as np
import lightgbm as lgb
import shap
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

#===================================================================
# Make small data
#===================================================================

def make_data(n=100):
    x1 = np.linspace(0.01, 1, n)
    x2 = np.log(x1)
    x3 = x1 > 0.7
    X = np.column_stack((x1, x2, x3))

    y = 1 + 0.2 * x1 + 0.5 * x2 + x3 + np.sin(2 * np.pi * x1)
    
    return X, y

X, y = make_data()

#===================================================================
# Additive linear model and additive boosted trees
#===================================================================

# Linear model with polynomial terms
poly = PolynomialFeatures(degree=3, include_bias=False)

preprocessor =  ColumnTransformer(
    transformers=[
        ("poly0", poly, [0]),
        ("poly1", poly, [1]),
        ("other", "passthrough", [2]),
    ]
)

model_lm = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("lm", LinearRegression()),
    ]
)
_ = model_lm.fit(X, y)

# Boosted trees with single-split trees
params = dict(
    learning_rate=0.05,
    objective="mse",
    max_depth=1,
    colsample_bynode=0.7,
)

model_lgb = lgb.train(
    params=params,
    train_set=lgb.Dataset(X, label=y),
    num_boost_round=300,
)

SHAP

For both models, we use exact permutation SHAP and exact Kernel SHAP. Furthermore, the linear model is analyzed with “additive SHAP”, and the tree-based model with TreeSHAP.

Do the algorithms provide the same?

Python

system.time({  # 1s
  shap_lm <- list(
    add = shapviz(additive_shap(fit_lm, df)),
    kern = kernelshap(fit_lm, X = df[xvars], bg_X = df),
    perm = permshap(fit_lm, X = df[xvars], bg_X = df)
  )

  shap_lgb <- list(
    tree = shapviz(fit_lgb, X),
    kern = kernelshap(fit_lgb, X = X, bg_X = X),
    perm = permshap(fit_lgb, X = X, bg_X = X)
  )
})

# Consistent SHAP values for linear regression
all.equal(shap_lm$add$S, shap_lm$perm$S)
all.equal(shap_lm$kern$S, shap_lm$perm$S)

# Consistent SHAP values for boosted trees
all.equal(shap_lgb$lgb_tree$S, shap_lgb$lgb_perm$S)
all.equal(shap_lgb$lgb_kern$S, shap_lgb$lgb_perm$S)

# Linear coefficient of x3 equals slope of SHAP values
tail(coef(fit_lm), 1)                # 1.112096
diff(range(shap_lm$kern$S[, "x3"]))  # 1.112096

sv_dependence(shap_lm$add, xvars)sv_dependence(shap_lm$add, xvars, color_var = NULL)

shap_lm = {
    "add": shap.Explainer(model_lm.predict, masker=X, algorithm="additive")(X),
    "perm": shap.Explainer(model_lm.predict, masker=X, algorithm="exact")(X),
    "kern": shap.KernelExplainer(model_lm.predict, data=X).shap_values(X),
}

shap_lgb = {
    "tree": shap.Explainer(model_lgb)(X),
    "perm": shap.Explainer(model_lgb.predict, masker=X, algorithm="exact")(X),
    "kern": shap.KernelExplainer(model_lgb.predict, data=X).shap_values(X),
}

# Consistency for additive linear regression
eps = 1e-12
assert np.abs(shap_lm["add"].values - shap_lm["perm"].values).max() < eps
assert np.abs(shap_lm["perm"].values - shap_lm["kern"]).max() < eps

# Consistency for additive boosted trees
assert np.abs(shap_lgb["tree"].values - shap_lgb["perm"].values).max() < eps
assert np.abs(shap_lgb["perm"].values - shap_lgb["kern"]).max() < eps

# Linear effect of last feature in the fitted model
model_lm.named_steps["lm"].coef_[-1]  # 1.112096

# Linear effect of last feature derived from SHAP values (ignore the sign)
shap_lm["perm"][:, 2].values.ptp()    # 1.112096

shap.plots.scatter(shap_lm["add"])

SHAP dependence plot of the additive linear model and the additive explainer (Python).

Yes – the three algorithms within model provide the same SHAP values. Furthermore, the SHAP values reconstruct the additive components of the features.

Didactically, this is very helpful when introducing SHAP as a method: Pick a white-box and a black-box model and compare their SHAP dependence plots. For the white-box model, you simply see the additive components, while the dependence plots of the black-box model show scatter due to interactions.

Remark: The exact equivalence between algorithms is lost, when

there are too many features for exact procedures (~10+ features), and/or when
the background data of Kernel/Permutation SHAP does not agree with the training data. This leads to slightly different estimates of the baseline value, which itself influences the calculation of SHAP values.

Final words

SHAP algorithms applied to additive models typically give identical results. Slight differences might occur because sampling versions of the algos are used, or a different baseline value is estimated.
The resulting SHAP values describe the additive components.
Didactically, it helps to see SHAP analyses of white-box and black-box models side by side.

R script , Python notebook

2024-06-28

Geographic SHAP
Lost in Translation between R and Python 10

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

This post is heavily based on the new {shapviz} vignette.

Setting

Besides other features, a model with geographic components contains features like
- latitude and longitude,
- postal code, and/or
- other features that depend on location, e.g., distance to next restaurant.
Like any feature, the effect of a single geographic feature can be described using SHAP dependence plots. However, studying the effect of latitude (or any other location dependent feature) alone is often not very illuminating – simply due to strong interaction effects and correlations with other geographic features.

That’s where the additivity of SHAP values comes into play: The sum of SHAP values of all geographic components represent the total geographic effect, and this sum can be visualized as a heatmap or 3D scatterplot against latitude/longitude (or any other geographic representation).

A first example

For illustration, we will use a beautiful house price dataset containing information on about 14’000 houses sold in 2016 in Miami-Dade County. Some of the columns are as follows:
- SALE_PRC: Sale price in USD: Its logarithm will be our model response.
- LATITUDE, LONGITUDE: Coordinates
- CNTR_DIST: Distance to central business district
- OCEAN_DIST: Distance (ft) to the ocean
- RAIL_DIST: Distance (ft) to the next railway track
- HWY_DIST: Distance (ft) to next highway
- TOT_LVG_AREA: Living area in square feet
- LND_SQFOOT: Land area in square feet
- structure_quality: Measure of building quality (1: worst to 5: best)
- age: Age of the building in years
(Italic features are geographic components.) For more background on this dataset, see Mayer et al [2].

We will fit an XGBoost model to explain log(price) as a function of lat/long, size, and quality/age.
R

Python
devtools::install_github("ModelOriented/shapviz", dependencies = TRUE) library(xgboost) library(ggplot2) library(shapviz) # Needs development version 0.9.0 from github head(miami) x_coord <- c("LATITUDE", "LONGITUDE") x_nongeo <- c("TOT_LVG_AREA", "LND_SQFOOT", "structure_quality", "age") x <- c(x_coord, x_nongeo) # Train/valid split set.seed(1) ix <- sample(nrow(miami), 0.8 * nrow(miami)) X_train <- data.matrix(miami[ix, x]) X_valid <- data.matrix(miami[-ix, x]) y_train <- log(miami$SALE_PRC[ix]) y_valid <- log(miami$SALE_PRC[-ix]) # Fit XGBoost model with early stopping dtrain <- xgb.DMatrix(X_train, label = y_train) dvalid <- xgb.DMatrix(X_valid, label = y_valid) params <- list(learning_rate = 0.2, objective = "reg:squarederror", max_depth = 5) fit <- xgb.train( params = params, data = dtrain, watchlist = list(valid = dvalid), early_stopping_rounds = 20, nrounds = 1000, callbacks = list(cb.print.evaluation(period = 100)) )

%load_ext lab_black import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fetch_openml df = fetch_openml(data_id=43093, as_frame=True) X, y = df.data, np.log(df.target) X.head() # Data split and model from sklearn.model_selection import train_test_split import xgboost as xgb x_coord = ["LONGITUDE", "LATITUDE"] x_nongeo = ["TOT_LVG_AREA", "LND_SQFOOT", "structure_quality", "age"] x = x_coord + x_nongeo X_train, X_valid, y_train, y_valid = train_test_split( X[x], y, test_size=0.2, random_state=30 ) # Fit XGBoost model with early stopping dtrain = xgb.DMatrix(X_train, label=y_train) dvalid = xgb.DMatrix(X_valid, label=y_valid) params = dict(learning_rate=0.2, objective="reg:squarederror", max_depth=5) fit = xgb.train( params=params, dtrain=dtrain, evals=[(dvalid, "valid")], verbose_eval=100, early_stopping_rounds=20, num_boost_round=1000, )
SHAP dependence plots

Let’s first study selected SHAP dependence plots, evaluated on the validation dataset with around 2800 observations. Note that we could as well use the training data for this purpose, but it is a bit large.
R

Python
sv <- shapviz(fit, X_pred = X_valid) sv_dependence( sv, v = c("TOT_LVG_AREA", "structure_quality", "LONGITUDE", "LATITUDE"), alpha = 0.2 )

import shap xgb_explainer = shap.Explainer(fit) shap_values = xgb_explainer(X_valid) v = ["TOT_LVG_AREA", "structure_quality", "LONGITUDE", "LATITUDE"] shap.plots.scatter(shap_values[:, v], color=shap_values[:, v])
SHAP dependence plots of selected features (Python output).

Total coordindate effect

And now the two-dimensional plot of the sum of SHAP values:
R

Python
sv_dependence2D(sv, x = "LONGITUDE", y = "LATITUDE") + coord_equal()

shap_coord = shap_values[:, x_coord] plt.scatter(*list(shap_coord.data.T), c=shap_coord.values.sum(axis=1), s=4) ax = plt.gca() ax.set_aspect("equal", adjustable="box") plt.colorbar() plt.title("Total location effect") plt.show()
Sum of SHAP values on color scale against coordinates (Python output).

The last plot gives a good impression on price levels, but note:
1. Since we have modeled logarithmic prices, the effects are on relative scale (0.1 means about 10% above average).
2. Due to interaction effects with non-geographic components, the location effects might depend on features like living area. This is not visible in above plot. We will modify the model now to improve this aspect.
Two modifications

We will now change above model in two ways, not unlike the model in Mayer et al [2].
1. We will use additional geographic features like distance to railway track or to the ocean.
2. We will use interaction constraints to allow only interactions between geographic features.
The second step leads to a model that is additive in each non-geographic component and also additive in the combined location effect. According to the technical report of Mayer [1], SHAP dependence plots of additive components in a boosted trees model are shifted versions of corresponding partial dependence plots (evaluated at observed values). This allows a “Ceteris Paribus” interpretation of SHAP dependence plots of corresponding components.
R

Python
# Extend the feature set more_geo <- c("CNTR_DIST", "OCEAN_DIST", "RAIL_DIST", "HWY_DIST") x2 <- c(x, more_geo) X_train2 <- data.matrix(miami[ix, x2]) X_valid2 <- data.matrix(miami[-ix, x2]) dtrain2 <- xgb.DMatrix(X_train2, label = y_train) dvalid2 <- xgb.DMatrix(X_valid2, label = y_valid) # Build interaction constraint vector ic <- c( list(which(x2 %in% c(x_coord, more_geo)) - 1), as.list(which(x2 %in% x_nongeo) - 1) ) # Modify parameters params$interaction_constraints <- ic fit2 <- xgb.train( params = params, data = dtrain2, watchlist = list(valid = dvalid2), early_stopping_rounds = 20, nrounds = 1000, callbacks = list(cb.print.evaluation(period = 100)) ) # SHAP analysis sv2 <- shapviz(fit2, X_pred = X_valid2) # Two selected features: Thanks to additivity, structure_quality can be read as # Ceteris Paribus sv_dependence(sv2, v = c("structure_quality", "LONGITUDE"), alpha = 0.2) # Total geographic effect (Ceteris Paribus thanks to additivity) sv_dependence2D(sv2, x = "LONGITUDE", y = "LATITUDE", add_vars = more_geo) + coord_equal()

# Extend the feature set more_geo = ["CNTR_DIST", "OCEAN_DIST", "RAIL_DIST", "HWY_DIST"] x2 = x + more_geo X_train2, X_valid2 = train_test_split(X[x2], test_size=0.2, random_state=30) dtrain2 = xgb.DMatrix(X_train2, label=y_train) dvalid2 = xgb.DMatrix(X_valid2, label=y_valid) # Build interaction constraint vector ic = [x_coord + more_geo, *[[z] for z in x_nongeo]] # Modify parameters params["interaction_constraints"] = ic fit2 = xgb.train( params=params, dtrain=dtrain2, evals=[(dvalid2, "valid")], verbose_eval=100, early_stopping_rounds=20, num_boost_round=1000, ) # SHAP analysis xgb_explainer2 = shap.Explainer(fit2) shap_values2 = xgb_explainer2(X_valid2) v = ["structure_quality", "LONGITUDE"] shap.plots.scatter(shap_values2[:, v], color=shap_values2[:, v]) # Total location effect shap_coord2 = shap_values2[:, x_coord] c = shap_values2[:, x_coord + more_geo].values.sum(axis=1) plt.scatter(*list(shap_coord2.data.T), c=c, s=4) ax = plt.gca() ax.set_aspect("equal", adjustable="box") plt.colorbar() plt.title("Total location effect") plt.show()
SHAP dependence plots of an additive feature (structure quality, no vertical scatter per unique feature value) and one of the geographic features (Python output).

Sum of all geographic features (color) against coordinates. There are no interactions to non-geographic features, so the effect can be read Ceteris Paribus (Python output).

Again, the resulting total geographic effect looks reasonable.

Wrap-Up
- SHAP values of all geographic components in a model can be summed up and plotted on the color scale against coordinates (or some other geographic representation). This gives a lightning fast impression of the location effects.
- Interaction constraints between geographic and non-geographic features lead to Ceteris Paribus interpretation of total geographic effects.
The Python and R notebooks can be found here:
- R Markdown
- Python Notebook
References
1. Mayer, Michael. 2022. “SHAP for Additively Modeled Features in a Boosted Trees Model.” https://arxiv.org/abs/2207.14490.
2. Mayer, Michael, Steven C. Bourassa, Martin Hoesli, and Donato Flavio Scognamiglio. 2022. “Machine Learning Applications to Land and Structure Valuation.” Journal of Risk and Financial Management.
2023-06-07

Kernel SHAP in R and Python

Lost in Translation between R and Python 9

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

Kernel SHAP

SHAP is one of the most used model interpretation technique in Machine Learning. It decomposes predictions into additive contributions of the features in a fair way. For tree-based methods, the fast TreeSHAP algorithm exists. For general models, one has to resort to computationally expensive Monte-Carlo sampling or the faster Kernel SHAP algorithm. Kernel SHAP uses a regression trick to get the SHAP values of an observation with a comparably small number of calls to the predict function of the model. Still, it is much slower than TreeSHAP.

Two good references for Kernel SHAP:

Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30, 2017.
Ian Covert and Su-In Lee. Improving KernelSHAP: Practical Shapley Value Estimation Using Linear Regression. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:3457-3465, 2021.

In our last post, we introduced our new “kernelshap” package in R. Since then, the package has been substantially improved, also by the big help of David Watson:

The package now supports multi-dimensional predictions.
It received a massive speed-up
Additionally, parallel computing can be activated for even faster calculations.
The interface has become more intuitive.
If the number of features is small (up to ten or eleven), it can provide exact Kernel SHAP values just like the reference Python implementation.
For a larger number of features, it now uses partly-exact (“hybrid”) calculations, very similar to the logic in the Python implementation.

With those changes, the R implementation is about to meet the Python version at eye level.

Example with four features

In the following, we use the diamonds data to fit a linear regression with

log(price) as response
log(carat) as numeric feature
clarity, color and cut as categorical features (internally dummy encoded)
interactions between log(carat) and the other three “C” variables. Note that the interactions are very weak

Then, we calculate SHAP decompositions for about 1000 diamonds (every 53th diamond), using 120 diamonds as background dataset. In this case, both R and Python will use exact calculations based on m=2^4 – 2 = 14 possible binary on-off vectors (a value of 1 representing a feature value picked from the original observation, a value of 0 a value picked from the background data).

Python

library(ggplot2)
library(kernelshap)

# Turn ordinal factors into unordered
ord <- c("clarity", "color", "cut")
diamonds[, ord] <- lapply(diamonds[ord], factor, ordered = FALSE)

# Fit model
fit <- lm(log(price) ~ log(carat) * (clarity + color + cut), data = diamonds)

# Subset of 120 diamonds used as background data
bg_X <- diamonds[seq(1, nrow(diamonds), 450), ]

# Subset of 1018 diamonds to explain
X_small <- diamonds[seq(1, nrow(diamonds), 53), c("carat", ord)]

# Exact KernelSHAP (5 seconds)
system.time(
  ks <- kernelshap(fit, X_small, bg_X = bg_X)  
)
ks

# SHAP values of first 2 observations:
#          carat     clarity     color        cut
# [1,] -2.050074 -0.28048747 0.1281222 0.01587382
# [2,] -2.085838  0.04050415 0.1283010 0.03731644

# Using parallel backend
library("doFuture")

registerDoFuture()
plan(multisession, workers = 2)  # Windows
# plan(multicore, workers = 2)   # Linux, macOS, Solaris

# 3 seconds on second call
system.time(
  ks3 <- kernelshap(fit, X_small, bg_X = bg_X, parallel = TRUE)  
)

# Visualization
library(shapviz)

sv <- shapviz(ks)
sv_importance(sv, "bee")

import numpy as np
import pandas as pd
from plotnine.data import diamonds
from statsmodels.formula.api import ols
from shap import KernelExplainer

# Turn categoricals into integers because, inconveniently, kernel SHAP
# requires numpy array as input
ord = ["clarity", "color", "cut"]
x = ["carat"] + ord
diamonds[ord] = diamonds[ord].apply(lambda x: x.cat.codes)
X = diamonds[x].to_numpy()

# Fit model with interactions and dummy variables
fit = ols(
  "np.log(price) ~ np.log(carat) * (C(clarity) + C(cut) + C(color))", 
  data=diamonds
).fit()

# Background data (120 rows)
bg_X = X[0:len(X):450]

# Define subset of 1018 diamonds to explain
X_small = X[0:len(X):53]

# Calculate KernelSHAP values
ks = KernelExplainer(
  model=lambda X: fit.predict(pd.DataFrame(X, columns=x)), 
  data = bg_X
)
sv = ks.shap_values(X_small)  # 74 seconds
sv[0:2]

# array([[-2.05007406, -0.28048747,  0.12812216,  0.01587382],
#        [-2.0858379 ,  0.04050415,  0.12830103,  0.03731644]])

The results match, hurray!

Example with nine features

The computation effort of running exact Kernel SHAP explodes with the number of features. For nine features, the number of relevant on-off vectors is 2^9 – 2 = 510, i.e. about 36 times larger than with four features.

We now modify above example, adding five additional features to the model. Note that the model structure is completely non-sensical. We just use it to get a feeling about what impact a 36 times larger workload has.

Besides exact calculations, we use an almost exact hybrid approach for both R and Python, using 126 on-off vectors (p*(p+1) for the exact part and 4p for the sampling part, where p is the number of features), resulting in a significant speed-up both in R and Python.

Python

fit <- lm(
  log(price) ~ log(carat) * (clarity + color + cut) + x + y + z + table + depth, 
  data = diamonds
)

# Subset of 1018 diamonds to explain
X_small <- diamonds[seq(1, nrow(diamonds), 53), setdiff(names(diamonds), "price")]

# Exact Kernel SHAP: 61 seconds
system.time(
  ks <- kernelshap(fit, X_small, bg_X = bg_X, exact = TRUE)  
)
ks
#          carat        cut     color     clarity         depth         table          x           y            z
# [1,] -1.842799 0.01424231 0.1266108 -0.27033874 -0.0007084443  0.0017787647 -0.1720782 0.001330275 -0.006445693
# [2,] -1.876709 0.03856957 0.1266546  0.03932912 -0.0004202636 -0.0004871776 -0.1739880 0.001397792 -0.006560624

# Default, using an almost exact hybrid algorithm: 17 seconds
system.time(
  ks <- kernelshap(fit, X_small, bg_X = bg_X, parallel = TRUE)  
)
#          carat        cut     color     clarity         depth         table          x           y            z
# [1,] -1.842799 0.01424231 0.1266108 -0.27033874 -0.0007084443  0.0017787647 -0.1720782 0.001330275 -0.006445693
# [2,] -1.876709 0.03856957 0.1266546  0.03932912 -0.0004202636 -0.0004871776 -0.1739880 0.001397792 -0.006560624

x = ["carat"] + ord + ["table", "depth", "x", "y", "z"]
X = diamonds[x].to_numpy()

# Fit model with interactions and dummy variables
fit = ols(
  "np.log(price) ~ np.log(carat) * (C(clarity) + C(cut) + C(color)) + table + depth + x + y + z", 
  data=diamonds
).fit()

# Background data (120 rows)
bg_X = X[0:len(X):450]

# Define subset of 1018 diamonds to explain
X_small = X[0:len(X):53]

# Calculate KernelSHAP values: 12 minutes
ks = KernelExplainer(
  model=lambda X: fit.predict(pd.DataFrame(X, columns=x)), 
  data = bg_X
)
sv = ks.shap_values(X_small)
sv[0:2]
# array([[-1.84279897e+00, -2.70338744e-01,  1.26610769e-01,
#          1.42423108e-02,  1.77876470e-03, -7.08444295e-04,
#         -1.72078182e-01,  1.33027467e-03, -6.44569296e-03],
#        [-1.87670887e+00,  3.93291219e-02,  1.26654599e-01,
#          3.85695742e-02, -4.87177593e-04, -4.20263565e-04,
#         -1.73988040e-01,  1.39779179e-03, -6.56062359e-03]])

# Now, using a hybrid between exact and sampling: 5 minutes
sv = ks.shap_values(X_small, nsamples=126)
sv[0:2]
# array([[-1.84279897e+00, -2.70338744e-01,  1.26610769e-01,
#          1.42423108e-02,  1.77876470e-03, -7.08444295e-04,
#         -1.72078182e-01,  1.33027467e-03, -6.44569296e-03],
#        [-1.87670887e+00,  3.93291219e-02,  1.26654599e-01,
#          3.85695742e-02, -4.87177593e-04, -4.20263565e-04,
#         -1.73988040e-01,  1.39779179e-03, -6.56062359e-03]])

Again, the results are essentially the same between R and Python, but also between the hybrid algorithm and the exact algorithm. This is interesting, because the hybrid algorithm is significantly faster than the exact one.

Wrap-Up

R is catching up with Python’s superb “shap” package.
For two non-trivial linear regressions with interactions, the “kernelshap” package in R provides the same output as Python.
The hybrid between exact and sampling KernelSHAP (as implemented in Python and R) offers a very good trade-off between speed and accuracy.
kernelshap()in R is fast!

The Python and R codes can be found here:

The examples were run on a Windows notebook with an Intel i7-8650U 4 core CPU.

2022-09-30

DuckDB: Quacking SQL
Lost in Translation between R and Python 8

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

DuckDB

DuckDB is a fantastic in-process SQL database management system written completely in C++. Check its official documentation and other blogposts like this to get a feeling of its superpowers. It is getting better and better!

Some of the highlights:
- Easy installation in R and Python, made possible via language bindings.
- Multiprocessing and fast.
- Allows to work with data bigger than RAM.
- Can fire SQL queries on R and Pandas tables.
- Can fire SQL queries on (multiple!) csv and/or Parquet files.
- Quacks Apache Arrow.
Installation

DuckDB is super easy to install:
- R: install.packages("duckdb")
- Python: pip install duckdb
Additional packages required to run the code of this post are indicated in the code.

A first query

Let’s start by loading a dataset, initializing DuckDB and running a simple query.

The dataset we use here contains information on over 20,000 sold houses in Kings County. Along with the sale price, different features describe the size and location of the properties. The dataset is available on OpenML.org with ID 42092.
R

Python
library(OpenML) library(duckdb) library(tidyverse) # Load data df <- getOMLDataSet(data.id = 42092)$data # Initialize duckdb, register df and materialize first query con = dbConnect(duckdb()) duckdb_register(con, name = "df", df = df) con %>% dbSendQuery("SELECT * FROM df limit 5") %>% dbFetch()

import duckdb import pandas as pd from sklearn.datasets import fetch_openml # Load data df = fetch_openml(data_id=42092, as_frame=True)["frame"] # Initialize duckdb, register df and fire first query # If out-of-RAM: duckdb.connect("py.duckdb", config={"temp_directory": "a_directory"}) con = duckdb.connect() con.register("df", df) con.execute("SELECT * FROM df limit 5").fetchdf()
Result of first query (from R)

Average price per grade

If you like SQL, then you can do your data preprocessing and simple analyses with DuckDB. Here, we calculate the average house price per online grade (the higher the grade, the better the house).
R

Python
query <- " SELECT AVG(price) avg_price, grade FROM df GROUP BY grade ORDER BY grade " avg <- con %>% dbSendQuery(query) %>% dbFetch() avg

# Average price per grade query = """ SELECT AVG(price) avg_price, grade FROM df GROUP BY grade ORDER BY grade """ avg = con.execute(query).fetchdf() avg
R output

Highlight: queries to files

The last query will be applied directly to files on disk. To demonstrate this fantastic feature, we first save “df” as a parquet file and “avg” as a csv file.
R

Python
write_parquet(df, "housing.parquet") write.csv(avg, "housing_avg.csv", row.names = FALSE)

# Save df and avg to different file types df.to_parquet("housing.parquet") # pyarrow=7 avg.to_csv("housing_avg.csv", index=False)
Let’s load some columns of “housing.parquet” data, but only rows with grades having an average price of one million USD. Agreed, that query does not make too much sense but I hope you get the idea…😃
R

Python
# "Complex" query query2 <- " SELECT price, sqft_living, A.grade, avg_price FROM 'housing.parquet' A LEFT JOIN 'housing_avg.csv' B ON A.grade = B.grade WHERE B.avg_price > 1000000 " expensive_grades <- con %>% dbSendQuery(query2) %>% dbFetch() head(expensive_grades) # dbDisconnect(con)

# Complex query query2 = """ SELECT price, sqft_living, A.grade, avg_price FROM 'housing.parquet' A LEFT JOIN 'housing_avg.csv' B ON A.grade = B.grade WHERE B.avg_price > 1000000 """ expensive_grades = con.execute(query2).fetchdf() expensive_grades # con.close()
R output

Last words
- DuckDB is cool!
- If you have strong SQL skills but do not know R or Python so well, this is a great way to get used to those programming languages.
- If you are unfamiliar to SQL but like R and/or Python, you can use DuckDB for a while and end up being an SQL addict.
- If your analysis involves combining many large files during preprocessing, then you can try the trick shown in the last example of this post.
The Python notebook and R code can be found at:
- R code
- Python notebook
2022-04-02
Random Forests with Monotonic Constraints
Lost in Translation between R and Python 7

Hello random forest friends

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

Monotonic constraints

On ML competition platforms like Kaggle, complex and unintuitively behaving models dominate. In this respect, reality is completely different. There, the majority of models do not serve as pure prediction machines but rather as fruitful source of information. Furthermore, even if used as prediction machine, the users of the models might expect a certain degree of consistency when “playing” with input values.

A classic example are statistical house appraisal models. An additional bathroom or an additional square foot of ground area is expected to raise the appraisal, everything else being fixed (ceteris paribus). The user might lose trust in the model if the opposite happens.

One way to enforce such consistency is to monitor the signs of coefficients of a linear regression model. Another useful strategy is to impose monotonicity constraints on selected model effects.

Trees and monotonic constraints

Monotonicity constraints are especially simple to implement for decision trees. The rule is basically as follows:
If a monotonicity constraint would be violated by a split on feature X, it is rejected. (Or a large penalty is subtracted from the corresponding split gain.) This will imply monotonic behavior of predictions in X, keeping all other features fixed.

Tree ensembles like boosted trees or random forests will automatically inherit this property.

Boosted trees

Most implementations of boosted trees offer monotonicity constraints. Here is a selection:
- R & Python: XGBoost
- R & Python: LightGBM
- R & Python: CatBoost
- Python: Scikit-Learn histogram gradient booster
- R: gbm
What about random forests?

Unfortunately, the picture is completely different for random forests. At the time of writing, I am not aware of any random forest implementation in R or Python offering this useful feature.

Some options
1. Implement monotonic constrainted random forests from scratch.
2. Ask for this feature in existing implementations.
3. Be creative and use XGBoost to emulate random forests.
For the moment, let’s stick to option 3. In our last R <-> Python blog post, we demonstrated that XGBoost’s random forest mode works essentially as good as standard random forest implementations, at least in regression settings and using sensible defaults.

Warning: Be careful with imposing monotonicity constraints

Ask yourself: does the constraint really make sense for all possible values of other features? You will see that the answer is often “no”.

An example: If your house price model uses the features “number of rooms” and “living area”, then a monotonic constraint on “living area” might make sense (given any number of rooms), while such constraint would be non-sensical for the number of rooms. Why? Because having six rooms in a 1200 square feet home is not necessarily better than having just five rooms in an equally sized home.

Let’s try it out

We use a nice dataset containing information on over 20,000 sold houses in Kings County. Along with the sale price, different features describe the size and location of the properties. The dataset is available on OpenML.org with ID 42092.

Some rows and columns from the Kings County house dataset.

The following R and Python codes
- fetch the data,
- prepare the ML setting,
- fit unconstrained XGBoost random forests using log sales price as response,
- and visualize the effect of log ground area by individual conditional expectation (ICE) curves.
An ICE curve for variable X shows how the prediction of one specific observation changes if the value of X changes. Repeating this for multiple observations gives an idea of the effect of X. The average over multiple ICE curves produces the famous partial dependent plot.
R

Python
library(farff) library(OpenML) library(dplyr) library(xgboost) set.seed(83454) rmse <- function(y, pred) { sqrt(mean((y-pred)^2)) } # Load King Country house prices dataset on OpenML # ID 42092, https://www.openml.org/d/42092 df <- getOMLDataSet(data.id = 42092)$data head(df) # Prepare df <- df %>% mutate( log_price = log(price), log_sqft_lot = log(sqft_lot), year = as.numeric(substr(date, 1, 4)), building_age = year - yr_built, zipcode = as.integer(as.character(zipcode)) ) # Define response and features y <- "log_price" x <- c("grade", "year", "building_age", "sqft_living", "log_sqft_lot", "bedrooms", "bathrooms", "floors", "zipcode", "lat", "long", "condition", "waterfront") # random split ix <- sample(nrow(df), 0.8 * nrow(df)) y_test <- df[[y]][-ix] # Fit untuned, but good(!) XGBoost random forest dtrain <- xgb.DMatrix(data.matrix(df[ix, x]), label = df[ix, y]) params <- list( objective = "reg:squarederror", learning_rate = 1, num_parallel_tree = 500, subsample = 0.63, colsample_bynode = 1/3, reg_lambda = 0, max_depth = 20, min_child_weight = 2 ) system.time( # 25 s unconstrained <- xgb.train( params, data = dtrain, nrounds = 1, verbose = 0 ) ) pred <- predict(unconstrained, data.matrix(df[-ix, x])) # Test RMSE: 0.172 rmse(y_test, pred) # ICE curves via our flashlight package library(flashlight) pred_xgb <- function(m, X) predict(m, data.matrix(X[, x])) fl <- flashlight( model = unconstrained, label = "unconstrained", data = df[ix, ], predict_function = pred_xgb ) light_ice(fl, v = "log_sqft_lot", indices = 1:9, evaluate_at = seq(7, 11, by = 0.1)) %>% plot()

# Imports import numpy as np from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.inspection import PartialDependenceDisplay from xgboost import XGBRFRegressor # Fetch data from OpenML df = fetch_openml(data_id=42092, as_frame=True)["frame"] # Prepare data df = df.assign( year=lambda x: x.date.str[0:4].astype(int), zipcode=lambda x: x.zipcode.astype(int), log_sqft_lot=lambda x: np.log(x.sqft_lot), building_age=lambda x: x.year - x.yr_built, ) # Feature list xvars = [ "grade", "year", "building_age", "sqft_living", "log_sqft_lot", "bedrooms", "bathrooms", "floors", "zipcode", "lat", "long", "condition", "waterfront", ] # Data split y_train, y_test, X_train, X_test = train_test_split( np.log(df["price"]), df[xvars], train_size=0.8, random_state=766 ) # Modeling - wall time: 39 seconds param_dict = dict( n_estimators=500, max_depth=20, learning_rate=1, subsample=0.63, colsample_bynode=1 / 3, reg_lambda=0, objective="reg:squarederror", min_child_weight=2, ) unconstrained = XGBRFRegressor(**param_dict).fit(X_train, y_train) # Test RMSE 0.176 pred = unconstrained.predict(X_test) print(f"RMSE: {mean_squared_error(y_test, pred, squared=False):.03f}") # ICE and PDP - wall time: 47 seconds PartialDependenceDisplay.from_estimator( unconstrained, X=X_train, features=["log_sqft_lot"], kind="both", subsample=20, random_state=1, )
Figure 1 (R output): ICE curves of log(ground area) for the first nine observations. Many non-monotonic parts are visible.

We clearly see many non-monotonic (and in this case counterintuitive) ICE curves.

What would a model give with monotonically increasing constraint on the ground area?
R

Python
# Monotonic increasing constraint (params$monotone_constraints <- 1 * (x == "log_sqft_lot")) system.time( # 179s monotonic <- xgb.train( params, data = dtrain, nrounds = 1, verbose = 0 ) ) pred <- predict(monotonic, data.matrix(df[-ix, x])) # Test RMSE: 0.176 rmse(y_test, pred) fl_m <- flashlight( model = monotonic, label = "monotonic", data = df[ix, ], predict_function = pred_xgb ) light_ice(fl_m, v = "log_sqft_lot", indices = 1:9, evaluate_at = seq(7, 11, by = 0.1)) %>% plot()

# One needs to pass the constraints as single string, which is rather ugly mc = "(" + ",".join([str(int(x == "log_sqft_lot")) for x in xvars]) + ")" print(mc) # Modeling - wall time 49 seconds constrained = XGBRFRegressor(monotone_constraints=mc, **param_dict) constrained.fit(X_train, y_train) # Test RMSE: 0.178 pred = constrained.predict(X_test) print(f"RMSE: {mean_squared_error(y_test, pred, squared=False):.03f}") # ICE and PDP - wall time 39 seconds PartialDependenceDisplay.from_estimator( constrained, X=X_train, features=["log_sqft_lot"], kind="both", subsample=20, random_state=1, )
Figure 2 (R output): ICE curves of the same observations as in Figure 1, but now with monotonic constraint. All curves are monotonically increasing.

We see:
1. It works! Each ICE curve in log(lot area) is monotonically increasing. This means that predictions are monotonically increasing in lot area, keeping all other feature values fixed.
2. The model performance is slightly worse. This is the price paid for receiving a more intuitive behaviour in an important feature.
3. In Python, both models take about the same time to fit (30-40 s on a 4 core i7 CPU laptop). Curiously, in R, the constrained model takes about six times longer to fit than the unconstrained one (170 s vs 30 s).
Summary
- Monotonic constraints help to create intuitive models.
- Unfortunately, as per now, native random forest implementations do not offer such constraints.
- Using XGBoost’s random forest mode is a temporary solution until native random forest implementations add this feature.
- Be careful to add too many constraints: does a constraint really make sense for all other (fixed) choices of feature values?
The Python notebook and R code can be found at:
- https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-11-13%20random%20forests%20monotone.ipynb
- https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-11-13%20random%20forests%20monotone.r
2021-11-13
Strong Random Forests with XGBoost
Lost in Translation between R and Python 6

Hello random forest friends

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

The last one was on diamond duplicates and grouped sampling.

XGBoost’s random forests

For sure, XGBoost is well known for its excellent gradient boosting trees implementation. Although less obvious, it is no secret that it also offers a way to fit single trees in parallel, emulating random forests, see the great explanations on the official XGBoost page. Still, there seems to exist quite some confusion on how to choose certain parameters in order to get good results. It is the aim of this post to clarify this.

Also LightGBM offers a random forest mode. We will investigate it in a later post.

Why would you want to use XGBoost to fit a random forest?
1. Interaction & monotonic constraints are available for XGBoost, but typically not for random forest implementations. A separate post will follow to illustrate this in the random forest setting.
2. XGBoost can natively deal with missing values in an elegant way, unlike many random forest algorithms.
3. You can stick to the same data preparation pipeline.
I had additional reasons in mind, e.g. using non-standard loss functions, but this did not turn out to work well. This is possibly due to the fact that XGBoost uses a quadratic approximation to the loss, which is exact only for the mean squared error loss (MSE).

How to enable the ominous random forest mode?

Following the official explanations, we would need to set
- num_parallel_tree to the number of trees in the forest,
- learning_rate and num_boost_round to 1.
There are further valuable tips, e.g. to set row and column subsampling to values below one to resemble true random forests.

Still, most of the regularization parameters of XGBoost tend to favour simple trees, while the idea of a random forest is to aggregate deep, overfitted trees. These regularization parameters have to be changed as well in order to get good results.

So voila my suggestions.

Suggestions for parameters
- learning_rate=1 (see above)
- num_boost_round=1 (see above)
  Has to be set in train(), not in the parameter list. It is called nrounds in R.
- subsample=0.63
  A random forest draws a bootstrap sample to fit each tree. This means about 0.63 of the rows will enter one or multiple times into the model, leaving 37% out. While XGBoost does not offer such sampling with replacement, we can still introduce the necessary randomness in the dataset used to fit a tree by skipping 37% of the rows per tree.
- colsample_bynode=floor(sqrt(m))/m
  Column subsampling per split is the main source of randomness in a random forest. A good default is usually to sample the square root of the number of features m or m/3. XGBoost offers different colsample_by* parameters, but it is important to sample per split resp. per node, not by tree. Otherwise, it might happen that important features are missing in a tree altogether, leading to overall bad predictions.
- num_parallel_tree
  The number of trees. Native implementations of random forests usually use a default value between 100 and 500. The more, the better—but slower.
- reg_lambda=0
  XGBoost uses a default L2 penalty of 1! This will typically lead to shallow trees, colliding with the idea of a random forest to have deep, wiggly trees. In my experience, leaving this parameter at its default will lead to extremely bad XGBoost random forest fits.
  Set it to zero or a value close to zero.
- max_depth=20
  Random forests usually train very deep trees, while XGBoost’s default is 6. A value of 20 corresponds to the default in the h2o random forest, so let’s go for their choice.
- min_child_weight=2
  The default of XGBoost is 1, which tends to be slightly too greedy in random forest mode. For binary classification, you would need to set it to a value close or equal to 0.
Of course these parameters can be tuned by cross-validation, but one of the reasons to love random forests is their good performance even with default parameters.

Compared to optimized random forests, XGBoost’s random forest mode is quite slow. At the cost of performance, choose
- lower max_depth,
- higher min_child_weight, and/or
- smaller num_parallel_tree.
Let’s try it out with regression

We will use a nice house price dataset, consisting of information on over 20,000 sold houses in Kings County. Along with the sale price, different features describe the size and location of the properties. The dataset is available on OpenML.org with ID 42092.

Some rows and columns from the Kings County house dataset.

The following R resp. Python codes fetch the data, prepare the ML setting and fit a native random forest with good defaults. In R, we use the ranger package, in Python the implementation of scikit-learn.

The response variable is the logarithmic sales price. A healthy set of 13 variables are used as features.
R

Python
library(farff) library(OpenML) library(dplyr) library(ranger) library(xgboost) set.seed(83454) rmse <- function(y, pred) { sqrt(mean((y-pred)^2)) } # Load King Country house prices dataset on OpenML # ID 42092, https://www.openml.org/d/42092 df <- getOMLDataSet(data.id = 42092)$data head(df) # Prepare df <- df %>% mutate( log_price = log(price), year = as.numeric(substr(date, 1, 4)), building_age = year - yr_built, zipcode = as.integer(as.character(zipcode)) ) # Define response and features y <- "log_price" x <- c("grade", "year", "building_age", "sqft_living", "sqft_lot", "bedrooms", "bathrooms", "floors", "zipcode", "lat", "long", "condition", "waterfront") m <- length(x) # random split ix <- sample(nrow(df), 0.8 * nrow(df)) # Fit untuned random forest system.time( # 3 s fit_rf <- ranger(reformulate(x, y), data = df[ix, ]) ) y_test <- df[-ix, y] # Test RMSE: 0.173 rmse(y_test, predict(fit_rf, df[-ix, ])$pred) # object.size(fit_rf) # 180 MB

# Imports import numpy as np import pandas as pd from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error def rmse(y_true, y_pred): return np.sqrt(mean_squared_error(y_true, y_pred)) # Fetch data from OpenML df = fetch_openml(data_id=42092, as_frame=True)["frame"] print("Shape: ", df.shape) df.head() # Prepare data df = df.assign( year = lambda x: x.date.str[0:4].astype(int), zipcode = lambda x: x.zipcode.astype(int) ).assign( building_age = lambda x: x.year - x.yr_built, ) # Feature list xvars = [ "grade", "year", "building_age", "sqft_living", "sqft_lot", "bedrooms", "bathrooms", "floors", "zipcode", "lat", "long", "condition", "waterfront" ] # Data split y_train, y_test, X_train, X_test = train_test_split( np.log(df["price"]), df[xvars], train_size=0.8, random_state=766 ) # Fit scikit-learn rf rf = RandomForestRegressor( n_estimators=500, max_features="sqrt", max_depth=20, n_jobs=-1, random_state=104 ) rf.fit(X_train, y_train) # Wall time 3 s # Test RMSE: 0.176 print(f"RMSE: {rmse(y_test, rf.predict(X_test)):.03f}")
Both in R and Python, the test RMSE is between 0.17 and 0.18, i.e. about 2/3 of the test predictions are within 18% of the observed value. Not bad!
Note: The test performance depends on the split seed, so it does not make sense to directly compare the R and Python performance.

With XGBoost’s random forest mode

Now let’s try to reach the same performance with XGBoost’s random forest implementation using the above parameter suggestions.
R

Python
# Fit untuned, but good(!) XGBoost random forest dtrain <- xgb.DMatrix(data.matrix(df[ix, x]), label = df[ix, y]) params <- list( objective = "reg:squarederror", learning_rate = 1, num_parallel_tree = 500, subsample = 0.63, colsample_bynode = floor(sqrt(m)) / m, reg_lambda = 0, max_depth = 20, min_child_weight = 2 ) system.time( # 20 s fit_xgb <- xgb.train( params, data = dtrain, nrounds = 1, verbose = 0 ) ) pred <- predict(fit_xgb, data.matrix(df[-ix, x])) # Test RMSE: 0.174 rmse(y_test, pred) # xgb.save(fit_xgb, "xgb.model") # 140 MB

import xgboost as xgb dtrain = xgb.DMatrix(X_train, label=y_train) m = len(xvars) params = dict( objective="reg:squarederror", learning_rate=1, num_parallel_tree=500, subsample=0.63, colsample_bynode=int(np.sqrt(m))/m, reg_lambda=0, max_depth=20, min_child_weight=2 ) rf_xgb = xgb.train( # Wall time 34 s params, dtrain, num_boost_round=1 ) preds = rf_xgb.predict(xgb.DMatrix(X_test)) # 0.177 print(f"RMSE: {rmse(y_test, preds):.03f}")
We see:
- The performance of the XGBoost random forest is essentially as good as the native random forest implementations. And all this without any parameter tuning!
- XGBoost is much slower than the optimized random forest implementations. If this is a problem, e.g. reduce the tree depth. In this example, Python takes almost twice as much time as R. No idea why!
  The timings were made on a usual 4 core i7 processor.
- Disk space required to store the model objects is comparable between XGBoost and native random forest implementations.
What if you would run the same model with XGBoost defaults?
- With default reg_lambda=1:
  The performance would end up at a catastrophic RMSE of 0.35!
- With default max_depth=6:
  The RMSE would be much worse (0.23) as well.
- With colsample_bytree instead of colsample_bynode:
  The RMSE would deteriorate to 0.27.
Thus: It is essential to set some values to a good “random forest” default!

Does it always work that good?

Definitively not in classification settings. However, in regression settings with the MSE loss, XGBoost’s random forest mode is often as accurate as native implementations.
- Classification models
  In my experience, the XGBoost random forest mode does not work as good as a native random forest for classification, possibly due to the fact that it uses only an approximation to the loss function.
- Other regression examples
  Using the setting of our last “R <–> Python” post (diamond duplicates and grouped sampling) and the same parameters as above, we get the following test RMSEs: With ranger (R code in link below): 0.1043, with XGBoost: 0.1042. Sweet!
Wrap up
- With the right default parameters, XGBoost’s random forest mode reaches similar performance on regression problems than native random forest packages. Without any tuning!
- For losses other than MSE, it does not work so well.
The Python notebook and R code can be found at:
- Python notebook
- R code
2021-05-21
A Curious Fact on the Diamonds Dataset
Lost in Translation between R and Python 5

Hello regression world

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

The last two included a deep dive into historic mortality rates as well as studying a beautiful regression formula.

Diamonds data

One of the most used datasets to teach regression is the diamonds dataset. It describes 54’000 diamonds by
- their price,
- the four “C” variables (carat, color, cut, clarity),
- as well as by perspective measurements table, depth, x, y, and z.
The dataset is readily available, e.g. in
- R package ggplot2,
- Python package plotnine,
- and the fantastic OpenML database.
Question: How many times did you use diamonds data to compare regression techniques like random forests and gradient boosting?

Answer: Probably a lot!

The curious fact

We recently stumbled over a curious fact regarding that dataset. 26% of the diamonds are duplicates regarding price and the four “C” variables. Within duplicates, the perspective variables table, depth, x, y, and z would differ as if a diamond had been measured from different angles.

In order to illustrate the issue, let us add the two auxilary variables
- id: group id of diamonds with identical price and four “C”, and
- id_size: number of rows in that id
to the dataset and consider a couple of examples. You can view both R and Python code – but the specific output will differ because language specific naming of group ids.
R

Python
library(tidyverse) # We add group id and its size dia <- diamonds %>% group_by(carat, cut, clarity, color, price) %>% mutate(id = cur_group_id(), id_size = n()) %>% ungroup() %>% arrange(id) # Proportion of duplicates 1 - max(dia$id) / nrow(dia) # 0.26 # Some examples dia %>% filter(id_size > 1) %>% head(10) # Most frequent dia %>% arrange(-id_size) %>% head(.$id_size[1]) # A random large diamond appearing multiple times dia %>% filter(id_size > 3) %>% arrange(-carat) %>% head(.$id_size[1])

import numpy as np import pandas as pd from plotnine.data import diamonds # Variable groups cat_vars = ["cut", "color", "clarity"] xvars = cat_vars + ["carat"] all_vars = xvars + ["price"] print("Shape: ", diamonds.shape) # Add id and id_size df = diamonds.copy() df["id"] = df.groupby(all_vars).ngroup() df["id_size"] = df.groupby(all_vars)["price"].transform(len) df.sort_values("id", inplace=True) print(f'Proportion of dupes: {1 - df["id"].max() / df.shape[0]:.0%}') print("Random examples") print(df[df.id_size > 1].head(10)) print("Most frequent") print(df.sort_values(["id_size", "id"]).tail(13)) print("A random large diamond appearing multiple times") df[df.id_size > 3].sort_values("carat").tail(6)
Table 1: Some duplicates in the four “C” variables and price (Python output).

Table 2: One of the two(!) diamonds appearing a whopping 43 times (Python output).

Table 3: A large, 2.01 carat diamond appears six times (Python output).

Of course, having the same id does not necessarily mean that the rows really describe the same diamond. price and the four “C”s could coincide purely by chance. Nevertheless: there are exactly six diamonds of 2.01 carat and a price of 16,778 USD in the dataset. And they all have the same color, cut and clarity. This cannot be coincidence!

Why would this be problematic?

In the presence of grouped data, standard validation techniques tend to reward overfitting.

This becomes immediately clear having in mind the 2.01 carat diamond from Table 3. Standard cross-validation (CV) uses random or stratified sampling and would scatter the six rows of that diamond across multiple CV folds. Highly flexible algorithms like random forests or nearest-neighbour regression could exploit this by memorizing the price of this diamond in-fold and do very well out-of-fold. As a consequence, the stated CV performance would be too good and the choice of the modeling technique and its hyperparameters suboptimal.

With grouped data, a good approach is often to randomly sample the whole group instead of single rows. Using such grouped splitting ensures that all rows in the same group would end up in the same fold, removing the above described tendency to overfit.

Note 1. In our case of duplicates, a simple alternative to grouped splitting would be to remove the duplicates altogether. However, the occurrence of duplicates is just one of many situations where grouped or clustered samples appear in reality.

Note 2. The same considerations not only apply to cross-validation but also to simple train/validation/test splits.

Evaluation

What does this mean regarding our diamonds dataset? Using five-fold CV, we will estimate the true root-mean-squared error (RMSE) of a random forest predicting log price by the four “C”. We run this experiment twice: one time, we create the folds by random splitting and the other time by grouped splitting. How heavily will the results from random splitting be biased?
R

Python
library(ranger) library(splitTools) # one of our packages on CRAN set.seed(8325) # We model log(price) dia <- dia %>% mutate(y = log(price)) # Helper function: calculate rmse rmse <- function(obs, pred) { sqrt(mean((obs - pred)^2)) } # Helper function: fit model on one fold and evaluate fit_on_fold <- function(fold, data) { fit <- ranger(y ~ carat + cut + color + clarity, data = data[fold, ]) rmse(data$y[-fold], predict(fit, data[-fold, ])$pred) } # 5-fold CV for different split types cross_validate <- function(type, data) { folds <- create_folds(data$id, k = 5, type = type) mean(sapply(folds, fit_on_fold, data = dia)) } # Apply and plot (results <- sapply(c("basic", "grouped"), cross_validate, data = dia)) barplot(results, col = "orange", ylab = "RMSE by 5-fold CV")

from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import cross_val_score, GroupKFold, KFold from sklearn.metrics import make_scorer, mean_squared_error import seaborn as sns rmse = make_scorer(mean_squared_error, squared=False) # Prepare y, X df = df.sample(frac=1, random_state=6345) y = np.log(df.price) X = df[xvars].copy() # Correctly ordered integer encoding X[cat_vars] = X[cat_vars].apply(lambda x: x.cat.codes) # Cross-validation results = {} rf = RandomForestRegressor(n_estimators=500, max_features="sqrt", min_samples_leaf=5, n_jobs=-1) for nm, strategy in zip(("basic", "grouped"), (KFold, GroupKFold)): results[nm] = cross_val_score( rf, X, y, cv=strategy(), scoring=rmse, groups=df.id ).mean() print(results) res = pd.DataFrame(results.items()) sns.barplot(x=0, y=1, data=res);
Figure 1: Test root-mean-squared error using different splitting methods (R output).

The RMSE (11%) of grouped CV is 8%-10% higher than of random CV (10%). The standard technique therefore seems to be considerably biased.

Final remarks
- The diamonds dataset is not only a brilliant example to demonstrate regression techniques but also a great way to show the importance of a clean validation strategy (in this case: grouped splitting).
- Blind or automatic ML would most probably fail to detect non-trivial data structures like in this case and therefore use inappropriate validation strategies. The resulting model would be somewhere between suboptimal and dangerous. Just that nobody would know it!
- The first step towards a good model validation strategy is data understanding. This is a mix of knowing the data source, how the data was generated, the meaning of columns and rows, descriptive statistics etc.
The Python notebook and R code can be found at:
- R-Code on github
- Python-Code on github
2021-04-16
A Beautiful Regression Formula
Lost in Translation between R and Python 4

Hello statistics aficionados

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

The last one was a deep dive into historic mortality rates.

No Covid-19, no public data for a change: This post focusses on a real beauty, namely a decomposition of the R-squared in a linear regression model
```
E(y) = \alpha + \sum_{j = 1}^p x_j \beta_j
```
fitted by least-squares. If the response y and all p covariables are standardized to variance 1 beforehand, then the R-squared can be obtained as the cross-product of the fitted coefficients and the usual correlations between each covariable and the response:
```
R^2 = \sum_{j = 1}^p \text{cor}(y, x_j)\hat\beta_j.
```
Two elegant derivations can be found in this answer to the same question, written by the number 1 contributor to crossvalidated: whuber. Look up a couple of his posts – and statistics will suddenly feel super easy and clear.

Direct consequences of the formula are:
1. If a covariable is uncorrelated with the response, it cannot contribute to the R-squared, i.e. neither improve nor worsen. This is not obvious.
2. A correlated covariable only improves R-squared if its coefficient is non-zero. Put differently: if the effect of a covariable is already fully covered by the other covariables, it does not improve the R-squared. This is somewhat obvious.
Note that all formulas refer to in-sample calculations.

Since we do not want to bore you with math, we simply demonstrate the result with short R and Python codes based on the famous iris dataset.
R

Python
y <- "Sepal.Width" x <- c("Sepal.Length", "Petal.Length", "Petal.Width") # Scaled version of iris iris2 <- data.frame(scale(iris[c(y, x)])) # Fit model fit <- lm(reformulate(x, y), data = iris2) summary(fit) # multiple R-squared: 0.524 (betas <- coef(fit)[x]) # Sepal.Length Petal.Length Petal.Width # 1.1533143 -2.3734841 0.9758767 # Correlations (scaling does not matter here) (cors <- cor(iris[, y], iris[x])) # Sepal.Length Petal.Length Petal.Width # -0.1175698 -0.4284401 -0.3661259 # The R-squared? sum(betas * cors) # 0.524

# Import packages import numpy as np import pandas as pd from sklearn import datasets from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression # Load data iris = datasets.load_iris(as_frame=True).data print("The data:", iris.head(3), sep = "\n") # Specify response yvar = "sepal width (cm)" # Correlations of everyone with response cors = iris.corrwith(iris[yvar]).drop(yvar) print("\nCorrelations:", cors, sep = "\n") # Prepare scaled response and covariables X = StandardScaler().fit_transform(iris.drop(yvar, axis=1)) y = StandardScaler().fit_transform(iris[[yvar]]) # Fit linear regression OLS = LinearRegression().fit(X, y) betas = OLS.coef_[0] print("\nScaled coefs:", betas, sep = "\n") # R-squared via scikit-learn: 0.524 print(f"\nUsual R-squared:\t {OLS.score(X, y): .3f}") # R-squared via decomposition: 0.524 rsquared = betas @ cors.values print(f"Applying the formula:\t {rsquared: .3f}")
Indeed: the cross-product of coefficients and correlations equals the R-squared of 52%.

The Python notebook and R code can be found at:
- Python: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-03-14%20rsquared_decomposition.py
- R: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-03-14%20rsquared_decomposition.r
2021-03-14
Swiss Mortality
Lost in Translation between R and Python 3

Hello again!

This is the third article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

Post 2: https://lorentzen.ch/index.php/2021/01/26/covid-19-deaths-per-mio/

Before diving into the data analysis, I would like to share with you that writing this text brought back some memories from my first year as an actuary. Indeed, I started in life insurance and soon switched to non-life. While investigating mortality may be seen as a dry matter, it reveals some interesting statistical problems which have to be properly addressed before drawing conclusions.

Similar to Post 2, we use a publicly available data, this time from the Human Mortality Database and from the Federal Statistical Office of Switzerland, in order to calculate crude death rates, i.e. number of deaths per person alive per year. This time, we look at a longer time periode of 20 and over 100 years and take all causes of death into account. We caution against any misinterpretation: We show only crude death rates (CDR) which do not take into account any demographic shifts like changing distributions of age or effects from measures taken against COVID-19.

Let us start with producing the first figure. We fetch the data from the internet, pick some countries of interest, focus on males and females combined only, aggregate and plot. The Python version uses the visualization library altair, which can generate interactive charts. Unfortunately, we can only show a static version in the blog post. If someone knows how to render Vega-Light charts in wordpress, I’d be very interested in a secure solution.
R

Python
library(tidyverse) library(lubridate) # Fetch data df_original = read_csv( "https://www.mortality.org/Public/STMF/Outputs/stmf.csv", skip = 2 ) # 1. Select countries of interest and only "both" sexes # Note: Germany "DEUTNP" and "USA" have short time series # 2. Change to ISO-3166-1 ALPHA-3 codes # 3.Create population pro rata temporis (exposure) to ease aggregation df_mortality <- df_original %>% filter(CountryCode %in% c("CAN", "CHE", "FRATNP", "GBRTENW", "SWE"), Sex == "b") %>% mutate(CountryCode = recode(CountryCode, "FRATNP" = "FRA", "GBRTENW" = "England & Wales"), population = DTotal / RTotal, Year = ymd(Year, truncated = 2)) # Data aggregation per year and country df <- df_mortality %>% group_by(Year, CountryCode) %>% summarise(CDR = sum(DTotal) / sum(population), .groups = "drop") ggplot(df, aes(x = Year, y = CDR, color = CountryCode)) + geom_line(size = 1) + ylab("Crude Death Rate per Year") + theme(legend.position = c(0.2, 0.8))

import pandas as pd import altair as alt # Fetch data df_mortality = pd.read_csv( "https://www.mortality.org/Public/STMF/Outputs/stmf.csv", skiprows=2, ) # Select countdf_mortalityf interest and only "both" sexes # Note: Germany "DEUTNP" and "USA" have short time series df_mortality = df_mortality[ df_mortality["CountryCode"].isin(["CAN", "CHE", "FRATNP", "GBRTENW", "SWE"]) & (df_mortality["Sex"] == "b") ].copy() # Change to ISO-3166-1 ALPHA-3 codes df_mortality["CountryCode"].replace( {"FRATNP": "FRA", "GBRTENW": "England & Wales"}, inplace=True ) # Create population pro rata temporis (exposure) to ease aggregation df_mortality = df_mortality.assign( population=lambda df: df["DTotal"] / df["RTotal"] ) # Data aggregation per year and country df_mortality = ( df_mortality.groupby(["Year", "CountryCode"])[["population", "DTotal"]] .sum() .assign(CDR=lambda x: x["DTotal"] / x["population"]) # .filter(items=["CDR"]) # make df even smaller .reset_index() .assign(Year=lambda x: pd.to_datetime(x["Year"], format="%Y")) ) chart = ( alt.Chart(df_mortality) .mark_line() .encode( x="Year:T", y=alt.Y("CDR:Q", scale=alt.Scale(zero=False)), color="CountryCode:N", ) .properties(title="Crude Death Rate per Year") .interactive() ) # chart.save("crude_death_rate.html") chart
Crude death rate (CDR) for Canada (CAN), Switzerland (CHE), England & Wales, France (FRA) and Sweden (SWE). Data as of 07.02.2021.

Note that the y-axis does not start at zero. Nevertheless, we see values between 0.007 and 0.014, which is twice as large. While 2020 shows a clear raise in mortality, for some countries more dramatic than for others, the values of 2021 are still preliminary. For 2021, the data is still incomplete and the yearly CDR is based on a small observation period and hence on a smaller population pro rata temporis. On top, there might be effects from seasonality. To sum up, it means that there is a larger uncertainty for 2021 than for previous whole years.

For Switzerland, it is also possible to collect data for over 100 years. As the code for data fetching and preparation becomes a bit lengthy, we won’t bother you with it. You can find it in the notebooks linked below. Note that we added the value of 2020 from the figure above. This seems legit as the CDR of both data sources agree within less than 1% relative error.

Crude death rate (CDR) for Switzerland from 1901 to 2020.

Again, note that the left y-axis does not start at zero, but the right y-axis does. One can see several interesting facts:
- The Swiss population is and always was growing for the last 120 years—with the only exception around 1976.
- The Spanish flu between 1918 and 1920 caused by far the largest peak in mortality in the last 120 years.
- The second world war is not visible in the mortality in Switzerland.
- Overall, the mortality is decreasing.
The Python notebook and R code can be found at:
- https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-02-19 swiss_mortality.ipynb
- https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-02-19 swiss_mortality.R
2021-02-19

Covid-19 Deaths per Mio

Lost in Translation between R and Python 2

Hello again!

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

Post 1: https://lorentzen.ch/index.php/2021/01/07/illustrating-the-central-limit-theorem/

In Post 2, we use a publicly available data of the European Centre for Disease Prevention and Control to calculate Covid-19 deaths per Mio persons over time and across countries . We will use slim Python and R codes to

fetch the data directly from the internet,
prepare and restructure it for plotting and
plot a curve per selected country.

Note that different countries use different definitions of whom to count as Covid-19 death and these definitions might also have changed over time. So be careful with comparisons!

Python

library(tidyverse)

# Source and countries
link <- "https://opendata.ecdc.europa.eu/covid19/casedistribution/csv"
countries <- c("Switzerland", "United_States_of_America", 
               "Germany", "Sweden")

# Import
df0 <- read_csv(link)

# Data prep
df <- df0 %>%
  mutate(Date = lubridate::dmy(dateRep),
         Deaths = deaths_weekly / (popData2019 / 1e6))  %>%
  rename(Country = countriesAndTerritories) %>%
  filter(Date >= "2020-03-01",
         Country %in% countries)

# Plot
ggplot(df, aes(x = Date, y = Deaths, color = Country)) +
  geom_line(size = 1) +
  ylab("Weekly deaths per Mio") +
  theme(legend.position = c(0.2, 0.85))

import pandas as pd

# Source and countries
url = "https://opendata.ecdc.europa.eu/covid19/casedistribution/csv"
countries = ["Switzerland", "United_States_of_America", 
             "Germany", "Sweden"]

# Fetch data
df0 = pd.read_csv(url)
# df0.head()

# Prepare data
df = df0.assign(
    Date=lambda x: pd.to_datetime(x["dateRep"], format="%d/%m/%Y"),
    Deaths=lambda x: x["deaths_weekly"] / x["popData2019"] * 1e6,
).rename(columns={"countriesAndTerritories": "Country"})
df = df.loc[
    (df["Country"].isin(countries)) & (df["Date"] >= "2020-03-01"),
    ["Country", "Date", "Deaths"],
]
df = df.pivot(index="Date", columns="Country")
df = df.droplevel(0, axis=1)

# Plot
ax = df.plot()
ax.set_ylabel('Weekly Covid-19 deaths per Mio');

Weekly Covid-19 deaths per Mio inhabitants as per January 26, 2021 (Python output).

The code can be found on https://github.com/mayer79/covid with some other analyses regarding viruses.

For more feed, visit

2021-01-26

Illustrating The Central Limit Theorem
Lost in Translation between R and Python 1

This is the first article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

Let’s start with a little bit of statistics – it wont be the last time, friends: Illustrating the Central Limit Theorem (CLT).

Take a sample of a random variable X with finite variance. The CLT says: No matter how “unnormally” distributed X is, its sample mean will be approximately normally distributed, at least if the sample size is not too small. This classic result is the basis to construct simple confidence intervals and hypothesis tests for the (true) mean of X, check out Wikipedia for a lot of additional information.

The code below illustrates this famous statistical result by simulation, using a very asymmetrically distributed X, namely X = 1 with probability 0.2 and X=0 otherwise. X could represent the result of asking a randomly picked person whether he smokes. Conducting such a poll, the mean of the collected sample of such results would be a statistical estimate of the proportion of people smoking.

Curiously, by a tiny modification, the same code will also illustrate another key result in statistics – the Law of Large Numbers: For growing sample size, the distribution of the sample mean of X contracts to the expectation E(X).
R

Python
# Fix seed, set constants set.seed(2006) sample_sizes <- c(1, 10, 30, 1000) nsims <- 10000 # Helper function: Mean of one sample of X one_mean <- function(n, p = c(0.8, 0.2)) { mean(sample(0:1, n, replace = TRUE, prob = p)) } # one_mean(10) # Simulate and plot par(mfrow = c(2, 2), mai = rep(0.4, 4)) for (n in sample_sizes) { means <- replicate(nsims, one_mean(n)) hist(means, breaks = "FD", # xlim = 0:1, # uncomment for LLN main = sprintf("n=%i", n)) }

import numpy as np import matplotlib.pyplot as plt %matplotlib inline # Fix seed, set constants np.random.seed(100) sample_sizes = [1, 10, 30, 1000] nsims = 10_000 # Helper function: Mean of one sample def one_mean(n, p=0.2): return np.random.binomial(1, p, n).mean() # Simulate and plot fig, axes = plt.subplots(2, 2, figsize=(8, 8)) for i, n in enumerate(sample_sizes): means = [one_mean(n) for ell in range(nsims)] ax = axes[i // 2, i % 2] ax.hist(means, 50) ax.title.set_text(f'$n = {n}$') ax.set_xlabel('mean') # ax.set_xlim(0, 1) # uncomment for LLN fig.tight_layout()
Result: The Central Limit Theorem

The larger the samples, the closer the histogram of the simulated means resembles a symmetric bell shaped curve (R-Output for illustration).

Result: The Law of Large Number

Fixing the x-scale illustrates – for free(!) – the Law of Large Numbers: The distribution of the mean contracts more and more to the expectation 0.2 (R-Output for illustration).

See also the python notebook https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-01-07 Illustrating The Central Limit Theorem.ipynb and for many great posts on R, http://www.R-bloggers.com.
2021-01-07

Tag: Lost in Translation

Inference

Prediction

Model Fit

Diagnostics for Inference

Diagnostics for Prediction

Example

Final words

References

The models

SHAP

Final words

Lost in Translation between R and Python 10

Setting

A first example

SHAP dependence plots

Total coordindate effect

Two modifications

Wrap-Up

References

Lost in Translation between R and Python 9

Kernel SHAP

Example with four features

Example with nine features

Wrap-Up

Lost in Translation between R and Python 8

DuckDB

Installation

A first query

Average price per grade

Highlight: queries to files

Last words

Lost in Translation between R and Python 7

Monotonic constraints

Trees and monotonic constraints

Boosted trees

What about random forests?

Warning: Be careful with imposing monotonicity constraints

Let’s try it out

Summary

Lost in Translation between R and Python 6

XGBoost’s random forests

Why would you want to use XGBoost to fit a random forest?

How to enable the ominous random forest mode?

Suggestions for parameters

Let’s try it out with regression

With XGBoost’s random forest mode

What if you would run the same model with XGBoost defaults?

Does it always work that good?

Wrap up

Lost in Translation between R and Python 5

Diamonds data

The curious fact

Evaluation

Final remarks

Lost in Translation between R and Python 4

Lost in Translation between R and Python 3

Lost in Translation between R and Python 2

Lost in Translation between R and Python 1