Tag: Python

Python programming lagnuage

  • Converting CSV to Parquet

    Conversion from CSV to Parquet in streaming mode? No problem for the two power houses Polars and DuckDB. We can even throw in some data preprocessing steps in-between, like column selection, data filters, or sorts.

    Edit: Streaming writing (or “lazy sinking”) of data with Polars was introduced with release 1.25.2 in March 2025, thanks Christian for pointing this out.

    pip install polars

    pip install duckdb


    Run times are on a normal laptop, dedicating 8 threads to the crunching.

    Let’s generate a 2 GB csv file first

    import duckdb  # 1.2.1
    import numpy as np  # 1.26.4
    import polars as pl  # 1.25.2
    
    n = 100_000_000
    
    rng = np.random.default_rng(42)
    
    df = pl.DataFrame(
        {
            "X": rng.choice(["a", "b", "c"], n),
            "Y": rng.uniform(0, 1, n),
            "Z": rng.choice([1, 2, 3, 4, 5], n),
        }
    )
    
    df.write_csv("data.csv")

    Polars

    Let’s use Polars in Lazy mode to connect to the CSV, apply some data operations, and stream the result into a Parquet file.

    # Native API with POLARS_MAX_THREADS = 8
    (
        pl.scan_csv("data.csv")
        .filter(pl.col("X") == "a")
        .drop("X")
        .sort(["Y", "Z"])
        .sink_parquet("data.parquet", row_group_size=100_000)  # "zstd" compression
    )
    # 3.5 s

    In case you prefer to write SQL code, you can alternatively use the SQL API of Polars. Curiously, run time is substantially longer:

    # Via SQL API (slower!?)
    (
        pl.scan_csv("data.csv")
        .sql("SELECT Y, Z FROM self WHERE X == 'a' ORDER BY Y, Z")
        .sink_parquet("data.parquet", row_group_size=100_000)
    )
    
    # 6.8 s

    In both cases, the result looks as expected, and the resulting Parquet file is about 170 MB large.

    pl.scan_parquet("data.parquet").head(5).collect()
    
    # Output
            Y   Z
          f64 i64
    3.7796e-8	4
    5.0273e-8	5
    5.7652e-8	4
    8.0578e-8	3
    8.1598e-8	4

    DuckDB

    As an alternative, we use DuckDB. Thread pool size and RAM limit can be set on the fly. Setting a low memory limit (e.g., 500 MB) will lead to longer run times, but it works.

    con = duckdb.connect(config={"threads": 8, "memory_limit": "4GB"})
    
    con.sql(
        """
        COPY (
            SELECT Y, Z
            FROM 'data.csv'
            WHERE X == 'a'
            ORDER BY Y, Z
        ) TO 'data.parquet' (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000)
        """
    )
    
    # 3.9 s

    Again, the output looks as expected. The Parquet file is again 170 MB large, thanks to using the same compression (“zstd”) as with Polars..

    con.sql("SELECT * FROM 'data.parquet' LIMIT 5")
    
    # Output
    ┌────────────────────────┬───────┐
    │           Y            │   Z   │
    │         double         │ int64 │
    ├────────────────────────┼───────┤
    │  3.779571322581887e-08 │     4 │
    │ 5.0273087692787044e-08 │     5 │
    │   5.76523543349694e-08 │     4 │
    │  8.057776434977626e-08 │     3 │
    │  8.159834352650108e-08 │     4 │
    └────────────────────────┴───────┘

    Final words

    • With Polars or DuckDB, conversion of CSVs to Parquet is easy and fast, even in larger-than-RAM situations.
    • We can apply filters, selects, sorts etc. on the fly.

    Python notebook

  • Effect Plots in Python and R

    Christian and me did some code magic: Highly effective plots that help to build and inspect any model:

    The functionality is best described by its output:

    Python
    R

    The plots show different types of feature effects relevant in modeling:

    • Average observed: Descriptive effect (also interesting without model).
    • Average predicted: Combined effect of all features. Also called “M Plot” (Apley 2020).
    • Partial dependence: Effect of one feature, keeping other feature values constant (Friedman 2001).
    • Number of observations or sum of case weights: Feature value distribution.
    • R only: Accumulated local effects, an alternative to partial dependence (Apley 2020).

    Both implementations…

    • are highly efficient thanks to {Polars} in Python and {collapse} in R, and work on datasets with millions of observations,
    • support case weights with all their statistics, ideal in insurance applications,
    • calculate average residuals (not shown in the plots above),
    • provide standard deviations/errors of average observed and bias,
    • allow to switch to Plotly for interactive plots, and
    • are highly customizable (the R package, e.g., allows to collapse rare levels after calculating statistics via the update() method or to sort the features according to main effect importance).

    In the spirit of our “Lost In Translation” series, we provide both high-quality Python and R code. We will use the same data and models as in one of our latest posts on how to build strong GLMs via ML + XAI.

    Example

    Let’s build a Poisson LightGBM model to explain the claim frequency given six traditional features in a pricing dataset on motor liability claims. 80% of the 1 Mio rows are used for training, the other 20% for evaluation. Hyper-parameters have been slightly tuned (not shown).

    library(OpenML)
    library(lightgbm)
    
    dim(df <- getOMLDataSet(data.id = 45106L)$data)  # 1000000 7
    head(df)
    
    #   year town driver_age car_weight car_power car_age claim_nb
    # 0 2018    1         51       1760       173       3        0
    # 1 2019    1         41       1760       248       2        0
    # 2 2018    1         25       1240       111       2        0
    # 3 2019    0         40       1010        83       9        0
    # 4 2018    0         43       2180       169       5        0
    # 5 2018    1         45       1170       149       1        1
    
    yvar <- "claim_nb"
    xvars <- setdiff(colnames(df), yvar)
    
    ix <- 1:800000
    train <- df[ix, ]
    test <- df[-ix, ]
    X_train <- data.matrix(train[xvars])
    X_test <- data.matrix(test[xvars])
    
    # Training, using slightly optimized parameters found via cross-validation
    params <- list(
      learning_rate = 0.05,
      objective = "poisson",
      num_leaves = 7,
      min_data_in_leaf = 50,
      min_sum_hessian_in_leaf = 0.001,
      colsample_bynode = 0.8,
      bagging_fraction = 0.8,
      lambda_l1 = 3,
      lambda_l2 = 5,
      num_threads = 7
    )
    
    set.seed(1)
    
    fit <- lgb.train(
      params = params,
      data = lgb.Dataset(X_train, label = train$claim_nb),
      nrounds = 300
    )
    import matplotlib.pyplot as plt
    from lightgbm import LGBMRegressor
    from sklearn.datasets import fetch_openml
    
    df = fetch_openml(data_id=45106, parser="pandas").frame
    df.head()
    
    #   year town driver_age car_weight car_power car_age claim_nb
    # 0 2018    1         51       1760       173       3        0
    # 1 2019    1         41       1760       248       2        0
    # 2 2018    1         25       1240       111       2        0
    # 3 2019    0         40       1010        83       9        0
    # 4 2018    0         43       2180       169       5        0
    # 5 2018    1         45       1170       149       1        1
    
    # Train model on 80% of the data
    y = df.pop("claim_nb")
    n_train = 800_000
    X_train, y_train = df.iloc[:n_train], y.iloc[:n_train]
    X_test, y_test = df.iloc[n_train:], y.iloc[n_train:]
    
    params = {
        "learning_rate": 0.05,
        "objective": "poisson",
        "num_leaves": 7,
        "min_child_samples": 50,
        "min_child_weight": 0.001,
        "colsample_bynode": 0.8,
        "subsample": 0.8,
        "reg_alpha": 3,
        "reg_lambda": 5,
        "verbose": -1,
    }
    
    model = LGBMRegressor(n_estimators=300, **params, random_state=1)
    model.fit(X_train, y_train)

    Let’s inspect the (main effects) of the model on the test data.

    library(effectplots)
    
    # 0.3 s
    feature_effects(fit, v = xvars, data = X_test, y = test$claim_nb) |>
      plot(share_y = "all")
    from model_diagnostics.calibration import plot_marginal
    
    fig, axes = plt.subplots(3, 2, figsize=(8, 8), sharey=True, layout="tight")
    
    # 2.3 s
    for i, (feat, ax) in enumerate(zip(X_test.columns, axes.flatten())):
        plot_marginal(
            y_obs=y_test,
            y_pred=model.predict(X_test),
            X=X_test,
            feature_name=feat,
            predict_function=model.predict,
            ax=ax,
        )
        ax.set_title(feat)
        if i != 1:
            ax.legend().remove()

    The output can be seen at the beginning of this blog post.

    Here some model insights:

    • Average predictions closely match observed frequencies. No clear bias is visible.
    • Partial dependence shows that the year and the car weight almost have no impact (regarding their main effects), while the driver_age and car_power effects seem strongest. The shared y axes help to assess these.
    • Except for car_weight, the partial dependence curve closely follows the average predictions. This means that the model effect seems to really come from the feature on the x axis, and not of some correlated other feature (as, e.g., with car_weight which is actually strongly correlated with car_power).

    Final words

    • Inspecting models has become much relaxed with above functions.
    • The packages used offer much more functionality. Try them out! Or we will show them in later posts ;).

    References

    1. Apley, Daniel W., and Jingyu Zhu. 2020. Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82 (4): 1059–1086. doi:10.1111/rssb.12377.
    2. Friedman, Jerome H. 2001. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29 (5): 1189–1232. doi:10.1214/aos/1013203451.

    R script , Python notebook

  • SHAP Values of Additive Models

    Within only a few years, SHAP (Shapley additive explanations) has emerged as the number 1 way to investigate black-box models. The basic idea is to decompose model predictions into additive contributions of the features in a fair way. Studying decompositions of many predictions allows to derive global properties of the model.

    What happens if we apply SHAP algorithms to additive models? Why would this ever make sense?

    In the spirit of our “Lost In Translation” series, we provide both high-quality Python and R code.

    The models

    Let’s build the models using a dataset with three highly correlated covariates and a (deterministic) response.

    library(lightgbm)
    library(kernelshap)
    library(shapviz)
    
    #===================================================================
    # Make small data
    #===================================================================
    
    make_data <- function(n = 100) {
      x1 <- seq(0.01, 1, length = n)
      data.frame(
        x1 = x1,
        x2 = log(x1),
        x3 = x1 > 0.7
      ) |>
        transform(y = 1 + 0.2 * x1 + 0.5 * x2 + x3 + sin(2 * pi * x1))
    }
    df <- make_data()
    head(df)
    cor(df) |>
      round(2)
    
    #      x1   x2   x3    y
    # x1 1.00 0.90 0.80 0.46
    # x2 0.90 1.00 0.58 0.58
    # x3 0.80 0.58 1.00 0.51
    # y  0.46 0.58 0.51 1.00
    
    #===================================================================
    # Additive linear model and additive boosted trees
    #===================================================================
    
    # Linear regression
    fit_lm <- lm(y ~ poly(x1, 3) + poly(x2, 3) + x3, data = df)
    summary(fit_lm)
    
    # Boosted trees
    xvars <- setdiff(colnames(df), "y")
    X <- data.matrix(df[xvars])
    
    params <- list(
      learning_rate = 0.05,
      objective = "mse",
      max_depth = 1,
      colsample_bynode = 0.7
    )
    
    fit_lgb <- lgb.train(
      params = params,
      data = lgb.Dataset(X, label = df$y),
      nrounds = 300
    )
    import numpy as np
    import lightgbm as lgb
    import shap
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.linear_model import LinearRegression
    
    #===================================================================
    # Make small data
    #===================================================================
    
    def make_data(n=100):
        x1 = np.linspace(0.01, 1, n)
        x2 = np.log(x1)
        x3 = x1 > 0.7
        X = np.column_stack((x1, x2, x3))
    
        y = 1 + 0.2 * x1 + 0.5 * x2 + x3 + np.sin(2 * np.pi * x1)
        
        return X, y
    
    X, y = make_data()
    
    #===================================================================
    # Additive linear model and additive boosted trees
    #===================================================================
    
    # Linear model with polynomial terms
    poly = PolynomialFeatures(degree=3, include_bias=False)
    
    preprocessor =  ColumnTransformer(
        transformers=[
            ("poly0", poly, [0]),
            ("poly1", poly, [1]),
            ("other", "passthrough", [2]),
        ]
    )
    
    model_lm = Pipeline(
        steps=[
            ("preprocessor", preprocessor),
            ("lm", LinearRegression()),
        ]
    )
    _ = model_lm.fit(X, y)
    
    # Boosted trees with single-split trees
    params = dict(
        learning_rate=0.05,
        objective="mse",
        max_depth=1,
        colsample_bynode=0.7,
    )
    
    model_lgb = lgb.train(
        params=params,
        train_set=lgb.Dataset(X, label=y),
        num_boost_round=300,
    )

    SHAP

    For both models, we use exact permutation SHAP and exact Kernel SHAP. Furthermore, the linear model is analyzed with “additive SHAP”, and the tree-based model with TreeSHAP.

    Do the algorithms provide the same?

    system.time({  # 1s
      shap_lm <- list(
        add = shapviz(additive_shap(fit_lm, df)),
        kern = kernelshap(fit_lm, X = df[xvars], bg_X = df),
        perm = permshap(fit_lm, X = df[xvars], bg_X = df)
      )
    
      shap_lgb <- list(
        tree = shapviz(fit_lgb, X),
        kern = kernelshap(fit_lgb, X = X, bg_X = X),
        perm = permshap(fit_lgb, X = X, bg_X = X)
      )
    })
    
    # Consistent SHAP values for linear regression
    all.equal(shap_lm$add$S, shap_lm$perm$S)
    all.equal(shap_lm$kern$S, shap_lm$perm$S)
    
    # Consistent SHAP values for boosted trees
    all.equal(shap_lgb$lgb_tree$S, shap_lgb$lgb_perm$S)
    all.equal(shap_lgb$lgb_kern$S, shap_lgb$lgb_perm$S)
    
    # Linear coefficient of x3 equals slope of SHAP values
    tail(coef(fit_lm), 1)                # 1.112096
    diff(range(shap_lm$kern$S[, "x3"]))  # 1.112096
    
    sv_dependence(shap_lm$add, xvars)sv_dependence(shap_lm$add, xvars, color_var = NULL)
    shap_lm = {
        "add": shap.Explainer(model_lm.predict, masker=X, algorithm="additive")(X),
        "perm": shap.Explainer(model_lm.predict, masker=X, algorithm="exact")(X),
        "kern": shap.KernelExplainer(model_lm.predict, data=X).shap_values(X),
    }
    
    shap_lgb = {
        "tree": shap.Explainer(model_lgb)(X),
        "perm": shap.Explainer(model_lgb.predict, masker=X, algorithm="exact")(X),
        "kern": shap.KernelExplainer(model_lgb.predict, data=X).shap_values(X),
    }
    
    # Consistency for additive linear regression
    eps = 1e-12
    assert np.abs(shap_lm["add"].values - shap_lm["perm"].values).max() < eps
    assert np.abs(shap_lm["perm"].values - shap_lm["kern"]).max() < eps
    
    # Consistency for additive boosted trees
    assert np.abs(shap_lgb["tree"].values - shap_lgb["perm"].values).max() < eps
    assert np.abs(shap_lgb["perm"].values - shap_lgb["kern"]).max() < eps
    
    # Linear effect of last feature in the fitted model
    model_lm.named_steps["lm"].coef_[-1]  # 1.112096
    
    # Linear effect of last feature derived from SHAP values (ignore the sign)
    shap_lm["perm"][:, 2].values.ptp()    # 1.112096
    
    shap.plots.scatter(shap_lm["add"])
    SHAP dependence plot of the additive linear model and the additive explainer (Python).

    Yes – the three algorithms within model provide the same SHAP values. Furthermore, the SHAP values reconstruct the additive components of the features.

    Didactically, this is very helpful when introducing SHAP as a method: Pick a white-box and a black-box model and compare their SHAP dependence plots. For the white-box model, you simply see the additive components, while the dependence plots of the black-box model show scatter due to interactions.

    Remark: The exact equivalence between algorithms is lost, when

    • there are too many features for exact procedures (~10+ features), and/or when
    • the background data of Kernel/Permutation SHAP does not agree with the training data. This leads to slightly different estimates of the baseline value, which itself influences the calculation of SHAP values.

    Final words

    • SHAP algorithms applied to additive models typically give identical results. Slight differences might occur because sampling versions of the algos are used, or a different baseline value is estimated.
    • The resulting SHAP values describe the additive components.
    • Didactically, it helps to see SHAP analyses of white-box and black-box models side by side.

    R script , Python notebook

  • A Tweedie Trilogy — Part III: From Wrights Generalized Bessel Function to Tweedie’s Compound Poisson Distribution

    TLDR: The scipy 1.7.0 release introduced Wright’s generalized Bessel function in the Python ecosystem. It is an important ingredient for the density and log-likelihood of Tweedie probabilty distributions. In this last part of the trilogy I’d like to point out why it was important to have this function and share the endeavor of implementing this inconspicuous but highly intractable special function. The fun part is exploiting a free parameter in an integral representation, which can be optimized by curve fitting to the minimal arc length.

    This trilogy celebrates the 40th birthday of Tweedie distributions in 2024 and highlights some of their very special properties.

    See part i and part ii.

    Tweedie Distributions

    As pointed out in part I and part II, the family of Tweedie distributions is a very special one with outstanding properties. They are central for estimating expectations with GLMs. The probability distributions have mainly positive (non-negative) support and are skewed, e.g. Poisson, Gamma, Inverse Gaussian and compound Poisson-Gamma.

    As members of the exponential dispersion family, a slight extension of the exponential family, the probability density can be written as

    \begin{align*}
    f(y; \theta, \phi) &= c(y, \phi) \exp\left(\frac{y\theta - \kappa(\theta)}{\phi}\right)
    \\
    \kappa(\theta) &= \kappa_p(\theta) = \frac{1}{2-p}((1-p)\theta)^{\frac{2-p}{1-p}}
    \end{align*}

    It is often more instructive to parametrise the distribution with p, \mu and \phi, using

    \begin{align*}
    \theta &= \begin{cases}
    \frac{\mu^{1-p}}{1-p}\,,\quad p\neq 1\\
    \log(\mu)\,,\quad p=1
    \end{cases}
    \\
    \kappa(\theta) &= \begin{cases}
    \frac{\mu^{2-p}}{2-p}\,,\quad p\neq 2\\
    \log(\mu)\,,\quad p=2
    \end{cases}
    \end{align*}

    and write

    \begin{align*}
    Y &\sim \mathrm{Tw}_p(\mu, \phi)
    \end{align*}
    Probability density of several Tweedie distributions.

    Compound Poisson Gamma

    A very special domain for the power parameter is between Poisson and Gamma: 1<p<2. This range results in the Compound Poisson distribution which is suitable if you have a random count process and if each count itself has a random amount. A well know example is insurance claims. Typically, there is a random number of insurance claims, and each and every claim has a random amount of claim costs.

    \begin{align*}
    N &\sim \mathrm{Poisson}(\lambda)\\
    X_i &\sim \mathrm{Gamma}(a, b)\\
    Y &= \sum_{i=0}^N X_i \sim \mathrm{CompPois}(\lambda, a, b)
    \end{align*}

    For Poisson count we have \operatorname{E}[N]=\lambda and \operatorname{Var}[N]=\lambda=\operatorname{E}[N], for the Gamma amount \operatorname{E}[X]=\frac{a}{b} and \operatorname{Var}[X]=\frac{a}{b^2}=\frac{1}{a}\operatorname{E}[X]^2. For the compound Poisson-Gamma variable, we obtain

    \begin{align*}
    \operatorname{E}[Y] &= \operatorname{E}[N] \operatorname{E}[X] = \lambda\frac{a}{b}=\mu\\
    \operatorname{Var}[Y] &=  \operatorname{Var}[N] \operatorname{E}[X]^2 +  \operatorname{E}[N] \operatorname{Var}[X] =  \phi \mu^p\\
    p &= \frac{a + 2}{a+1} \in (1, 2)\\
    \phi &= \frac{(\lambda a)^{1-p}}{(p-1)b^{2-p}}
    \end{align*}

    What’s so special here is that there is a point mass at zero, i.e., P(Y=0)=\exp(-\frac{\mu^{2-p}}{\phi(2-p)}) > 0. Hence, it is a suitable distribution for non-negative quantities with some exact zeros.

    Probability density for compound Poisson Gamma, point masses at zero are marked as points.

    Code

    The rest of this post is about how to compute the density for this parameter range. The easy part is \exp\left(\frac{y\theta - \kappa(\theta)}{\phi}\right) which can be directly implemented. The real obstacle is the term c(y, \phi) which is given by

    \begin{align*}
    c(y, \phi) &= \frac{\Phi(-\alpha, 0, t)}{y}
    \\
    \alpha &= \frac{2 - p}{1 - p}
    \\
    t &= \frac{\left(\frac{(p - 1)\phi}{y}\right)^{\alpha}}{(2-p)\phi}
    \end{align*}

    This depends on Wright’s (generalized Bessel) function \Phi(a, b, z) as introduced in a 1933 paper by E. Wright.

    Wright’s Generalized Bessel Function

    According to DLMF 10.46, the function is defined as

    \begin{equation*}
    \Phi(a, b, z) = \sum_{k=0}^{\infty} \frac{z^k}{k!\Gamma(ak+b)}, \quad a > -1, b \in R, z \in C
    \end{equation*}

    which converges everywhere because it is an entire function. We will focus on the positive real axis z=x\geq 0 and the range a\geq 0, b\geq 0 (note that a=-\alpha \in (0,\infty) for 1<p<2). For the compound Poisson-Gamma, we even have b=0.

    Implementation of such a function as done in scipy.stats.wright_bessel, even for the restricted parameter range, poses tremendous challenges. The first one is that it has three parameters which is quite a lot. Then the series representation above, for instance, can always be used, but depending on the parameters, it will require a huge amount of terms, particularly for large x. As each term involves the Gamma function, this becomes expensive very fast. One ends up using different representations and strategies for different parameter regions:

    • Small x: Taylor series according to definition
    • Small a: Taylor series in a=0
    • Large x: Asymptotic series due to Wright (1935)
    • Large a: Taylor series according to definition for a few terms around the approximate maximum term k_{max} due to Dunn & Smyth (2005)
    • General: Integral represantation due to Luchko (2008)

    Dunn & Smyth investigated several evaluation strategies for the simpler Tweedie density which amounts to Wright’s functions with b=0, see Dunn & Smyth (2005). Luchko (2008) lists most of the above strategies for the full Wright’s function.

    Note that Dunn & Smyth (2008) provide another strategy to evaluate the Tweedie distribution function by means of the inverse Fourier transform. This does not involve Wright’s function, but also encounters complicated numerical integration of oscillatory functions.

    The Integral Representation

    This brings us deep into complex analysis: We start with Hankel’s contour integral representation of the reciprocal Gamma function.

    \begin{equation*}
    \frac{1}{\Gamma(z)} = \frac{1}{2\pi i} \int_{Ha^-} \zeta^{-z} e^\zeta \; d\zeta
    \end{equation*}

    with the Hankel path Ha^- from negative infinity (A) just below the real axis, counter-clockwise with radius \epsilon>0 around the origin and just above the real axis back to minus infinity (D).

    Hankel contour Ha in the complex plane.

    In principle, one is free to choose any such path with the same start (A) and end point (D) as long as one does not cross the negative real axis. One usually lets the AB and CD be infinitesimal close to the negative real line. Very importantly, the radius \epsilon>0 is a free parameter! That is real magic🪄

    By interchanging sum and integral and using the series of the exponential, Wright’s function becomes

    \begin{align*}
    \Phi(a, b, z) &= \sum_{k=0}^{\infty} \frac{z^k}{k!} \frac{1}{2\pi i} \int_{Ha^-} \zeta^{-(ak+b)} e^\zeta \; d\zeta
    \\
    &= \frac{1}{2\pi i} \int_{Ha^-} \zeta^{-b} e^{\zeta + z\zeta^{-a}} \; d\zeta
    \end{align*}

    Now, one needs to do the tedious work and split the integral into the 3 path sections AB, BC, CD. Putting AB and CD together gives an integral over K, the circle BC gives an integral over P:

    \begin{align*}
    \Phi(a, b, x) &= \frac{1}{\pi} \int_{\epsilon}^\infty K(a, b, x, r) \; dr
    \\
     &+ \frac{\epsilon^{1-b}}{\pi} \int_0^\pi P(\epsilon, a, b, x, \varphi) \; d\varphi
    \\
    K(a, b, x, r) &= r^{-b}\exp(-r + x  r^{-a} \cos(\pi a)) 
    \\
    &\quad \sin(x \cdot r^{-a} \sin(\pi a) + \pi b)
    \\
    P(\epsilon, a, b, x, \varphi) &= \exp(\epsilon \cos(\varphi) + x  \epsilon^{-a}\cos(a \varphi))
    \\
    &\quad \cos(\epsilon \sin(\varphi) - x \cdot \epsilon^{-a} \sin(a \varphi) + (1-b) \varphi)
    \end{align*}

    What remains is to carry out the numerical integration, also known as quadrature. While this is an interesting topic in its own, let’s move to the magic part.

    Arc Length Minimization

    If you have come so far and say, wow, puh, uff, crazy, 🤯😱 Just keep on a little bit because here comes the real fun part🤞

    It turns out that most of the time, the integral over P is the most difficult. The worst behaviour an integrand can have is widely oscillatory. Here is one of my favorite examples:

    Integrands for a=5, b=1, x=100 and two choices of epsilon.

    With the naive choice of \epsilon=1, both integrands (blue) are—well—crazy. There is basically no chance the most sophisticated quadrature rule will work. And then look at the other choice of \epsilon\approx 4. Both curves seem well behaved (for P, we would need a closer look).

    So the idea is to find a good choice of \epsilon to make P well behaved. Well behaved here means most boring, if possible a straight line. What makes a straight line unique? In flat space, it is the shortest path between two points. Therefore, well behaved integrands have minimal arc length. That is what we want to minimize.

    The arc length S from x=a to x=b of a 1-dimensional function f is given by

    \begin{equation*}
    S = \int_a^b \sqrt{1 + f^\prime(x)^2} \; dx
    \end{equation*}

    Instead of f=P, we only take the oscillatory part of P and approximate the arc length as f(\varphi)=f(\varphi) = \epsilon \sin(\varphi) - x \epsilon^{-\rho} \sin(\rho \varphi) + (1-\beta) \varphi. For a single parameter point a, b, z this looks like

    Arc length and integrand P for different epsilon, given a=0.1, b=5, x=100.

    Note the logarithmic y-scale for the right plot of P. The optimal \epsilon=10 is plotted in red and behaves clearly better than smaller values of \epsilon.

    What remains to be done for an actual implementation is

    • Calculate minimal \epsilon for a large grid of values a, b, x.
    • Choose a function with some parameters.
    • Curve fitting (so again optimisation): Fit this function to the minimal \epsilon of the grid via minimising least squares.
    • Implement some quadrature rules and use this choice of \epsilon in the hope that it intra- and extrapolates well.

    This strategy turns out to work well in practice and is implemented in scipy. As the parameter space of 3 variables is huge, the integral representation breaks down in certain areas, e.g. huge values of \epsilon where the integrands just overflow numerically (in 64-bit floating point precision). But we have our other evaluation strategies for that.

    Conclusion

    An extensive notebook for Wright’s function, with all implementation strategies can be found here.

    After an adventurous journey, we arrived at one implementation strategy of Wright’s generalised Bessel function, namely the integral representation. The path went deep into complex analysis and contour integration, then further to the arc length of a function and finally curve fitting via optimisation. I am really astonished how connected all those different areas of mathematics can be.

    Wright’s function is the missing piece to compute full likelihoods and probability functions of the Tweedie distribution family and is now available in the Python ecosystem via scipy.

    We are at the very end of this Tweedie trilogy. I hope it has been entertaining and it has become clear why Tweedie deserves to be celebrated.

    Further references:

  • A Tweedie Trilogy — Part II: Offsets

    TLDR: This second part of the trilogy will have a deeper look at offsets and sample weights of a GLM. Their non-equivalence stems from the mean-variance relationship. This time, we not only have a Poisson frequency but also a Gamma severity model.

    This trilogy celebrates the 40th birthday of Tweedie distributions in 2024 and highlights some of their very special properties.

    See part I.

    From Mean-Variance Relation to Score Equations

    In the part I, we have already introduced the mean-variance relation of a Tweedie random variable Y\sim Tw_p(\mu, \phi) with Tweedie power p, mean \mu and dispersion parameter \phi:

    \begin{align*}
    \operatorname{E}[Y] &= \mu
    \\
    \operatorname{Var}[Y] &= \phi \mu^p = \phi v(\mu)
    \end{align*}

    with variance function v(\mu).

    This variance function directly impacts the estimation of GLMs. Assume the task is to estimate the expectation of a random variableY_i\sim Tw_p(\mu_i, \phi/w_i), given observations of the target y_i and of explanatories variables, aka features, x_i\in R^k. A GLM then assumes a link functiong(\mu_i) = \sum_{j=1}^k x_{ij}\beta_jwith coefficients \beta to be estimated via an optimization procedure, of which the first order condition, also called score equation, reads

    \begin{equation*}
    \sum_i w_i \frac{y_i - \mu_i}{v(\mu_i)g'(\mu_i)} x_{ij}= 0 \quad \forall j =1, \ldots, k
    \end{equation*}

    This shows that the higher the Tweedie power p, entering via v(\mu) only, the less weight is given to deviations of large values. In other words, higher Tweedie powers result in GLMs that are less and less sensitive to what happens at large (expected) values.

    This is also reflected in the deviance loss function. They can be derived from the negative log-likelihood and are given by

    \begin{equation*}
    d_p(y, \mu) =
    	2 \cdot
    	\begin{cases}
    		\frac{\max(0, y^{2-p})}{(1-p)(2-p)}-\frac{y\mu^{1-p}}{1-p}+\frac{\mu^{2-p}}{2-p} & p \in \mathrm{R}\setminus (0,1] \cup \{2\} \\
    		y\log\frac{y}{\mu} - y + \mu & p=0 \\
    		\frac{y}{\mu} - \log\frac{y}{\mu} - 1 & p=2
    	\end{cases}
    \end{equation*}

    These are the only strictly consistent scoring functions for the expectation (up to one multiplicative and one additive constant) that are homogeneous functions (of degree 2-p), see, e.g., Fissler et al (2022). The Poisson deviance (p=1), for example, has a degree of homogeneity of 1 and the same unit as the target variable. The Gamma deviance (p=2), on the other side, is zero-homogeneous and is completely agnostic to the scale of its arguments. This is another way of stating the above: the higher the Tweedie power the less it cares about large values.

    It is also connected to the fact that Tweedie distributions are the only distributions from the exponential dispersion family that are closed under scale transformations:

    \begin{align*}
    Y &\sim Tw_p(\mu, \phi) \\
    cY &\sim Tw_p(c\mu, c^{2-p}\phi) \quad \forall c>0
    \end{align*}

    Offsets and Sample Weights

    Poisson GLM

    When estimating counts with a Poisson GLM, there is often an exposure measure like time under consideration or underlying number of things (insurance policies, trees in a forest, radioactive atoms). One then often finds two different, but equivalent formulations of a Poisson GLM with log-link.

    • Sample weights: Model frequency y=\frac{N}{w} and fit with sample weights w to estimate \operatorname{E}[y] = \mu_y = \exp(x \beta).
    • Offsets: Model counts N, but account for the exposure w via an offset as \operatorname{E}[N]=\mu_N = \exp(x \beta + \log(w)) = w \mu_y.

    Note that each way models a different target, so we had to use subscripts to distinguish the mean parameters \mu.

    In this special case of a Poisson GLM with (canonical) log-link, both models are equivalent and will result in the exact same parameters \beta. You can plug it into the score equation to convince yourself.

    Tweedie GLM

    Very importantly, this simple equivalence of GLM fomulations with offsets and with sample weights does only hold for the Poisson GLM with log-link. It does not hold for any other Tweedie parameter or even other distributions from the exponential dispersion family.

    One can show that a Tweedie GLM with log-link and offset (additive in link space) \log(u) on target y with weights w is equivalent to the same Tweedie GLM but with target \frac{y}{u} and weights w u^{2-p}.

    So one can construct an equivalence between unweighted with offsets and weighted without offsets by setting u = \sqrt[2-p]{w}. But note that this does not work for a Gamma GLM which as p=2.

    Example

    We continue with the same dataset and model as in part I and show this (non-) equivalence with the offsets.

    from glum import GeneralizedLinearRegressor
    import pandas as pd
    
    # ... quite some code ... here we abbreviate.
    # Model frequency with weights (but without offsets)
    y_freq = df["ClaimNb"] / df["Exposure"]
    w_freq = df["Exposure"]
    X = df[x_vars]
    glm_params = {
        "alpha": 0,
        "drop_first": True,
        "gradient_tol": 1e-8,
    }
    glm_freq = GeneralizedLinearRegressor(
        family="poisson", **glm_params
    ).fit(X, y_freq, sample_weight=w_freq)
    
    # Model counts N = w * freq with offsets (but without weights)
    N = w_freq * y_freq
    glm_offset_freq = GeneralizedLinearRegressor(
        family="poisson", **glm_params
    ).fit(X, N, offset=np.log(w_freq))
    
    print(
        f"intercept freq{'':<8}= {glm_freq.intercept_}\n"
        f"intercept freq offset = {glm_offset_freq.intercept_}"
    )
    # intercept freq        = -3.756437676421677
    # intercept freq offset = -3.7564376764216725
    
    np.max(np.abs(glm_freq.coef_ - glm_offset_freq.coef_)) < 1e-13
    # True

    As next, we model the severity Y = \frac{loss}{N} with claim counts N as weights. As is standard, we use a Gamma GLM with log-link (which is not canonical this time).

    # Model severity with weights (but without offsets)
    y_sev = (df["ClaimAmount"] / df["ClaimNb"])
    w_sev = df["ClaimNb"].fillna(0)
    X = df[x_vars]
    # Filter out zero count (w_sev==0) rows
    w_gt_0 = w_sev > 0
    y_sev = y_sev[w_gt_0]
    X_sev = X[w_gt_0]
    w_sev = w_sev[w_gt_0]
    
    glm_sev = GeneralizedLinearRegressor(
        family="gamma", **glm_params
    ).fit(X_sev, y_sev, sample_weight=w_sev)
    
    # Note that the target is claim amount = w * sev.
    claim_amount = w_sev * y_sev
    glm_offset_sev = GeneralizedLinearRegressor(
        family="gamma", **glm_params
    ).fit(X_sev, claim_amount, offset=np.log(w_sev))
    
    print(
        f"intercept sev{'':<8}= {glm_sev.intercept_}\n"
        f"intercept sev offset = {glm_offset_sev.intercept_}"
    )
    # intercept sev        = 7.287909799461992
    # intercept sev offset = 7.236827150674156
    
    np.max(np.abs(glm_sev.coef_ - glm_offset_sev.coef_))
    # 0.2119162919285421

    The deviations might seem small, but they are there and add up:

    print(
        "Total predicted claim amounts with weights "
        f"{np.sum(w_sev * glm_sev.predict(X_sev)):_.2f}"
    )
    print(
        "Total predicted claim amounts offset       "
        f"{np.sum(glm_offset_sev.predict(X_sev, offset=np.log(w_sev))):_.2f}"
    )
    # Total predicted claim amounts with weights 49_309_687.30
    # Total predicted claim amounts offset       48_769_342.47

    Here, it becomes evident that the two models are quite different.

    Outlook

    The full notebook can be found here.

    The final part III of the Tweedie trilogy will follow in one week and go into details of the probability density function.

  • A Tweedie Trilogy — Part I: Frequency and Aggregration Invariance

    TLDR: In this first part of the Tweedie Trilogy, we will take a look at what happens to a GLM if we aggregate the data by a group-by operation. A frequency model for insurance pricing will serve as an example.

    This trilogy celebrates the 40th birthday of Tweedie distributions in 2024 and highlights some of their very special properties.

    Intro

    Tweedie distributions and Generalised Linear Models (GLM) have an intertwined relationship. While GLMs are, in my view, one of the best reference models for estimating expectations, Tweedie distributions lie at the heart of expectation estimation. In fact, basically all applied GLMs in practice use Tweedie distributions with three notable exceptions: the binomial, the multinomial and the negative binomial distribution.

    Mean-Variance Relation

    “An index which distinguishes between some important exponential families” is the original publication title of Maurice Charles Kenneth Tweedie in 1984—but note that Shaul K. Bar-Lev and Peter Enis published around the same time; as their 1986 paper was received November 1983, the distribution could also be named Bar-Lev & Enis distribution.1 This index is meanwhile called the Tweedie power parameter p. Recall that distributions of the exponential dispersion family always fulfil a mean-variance relationship. Its even a way to define them. For the Tweedie distribution, denoted Tw_p(\mu, \phi), the relation reads

    \begin{align*}
    \operatorname{E}[Y] &= \mu
    \\
    \operatorname{Var}[Y] &= \phi \mu^p
    \end{align*}

    with dispersion parameter \phi. Some very common members are given in the following table.

    pdistributiondomain Ydomain \mu
    0Normal / Gaussian\mathrm{R}\mathrm{R}
    1Poisson0, 1, 2, \ldots\mathrm{R}_+
    (1,2)Compound Poisson-Gamma\mathrm{R}_+ \cup \{0\}\mathrm{R}_+
    2Gamma\mathrm{R}_+\mathrm{R}_+
    3inverse Gaussian\mathrm{R}_+\mathrm{R}_+

    Insurance Pricing Models

    In non-life insurance pricing, most claims happen somewhat randomly, typically the occurrence as well as the size. Take the theft of your bike or a water damage of your basement due to flooding as an example. Pricing actuaries usually want to predict the expected loss E[Y|X] given some features X of a policy. The set of features could contain the purchasing price of your bike or the proximity of your house to a river.

    Instead of directly modelling the expected loss per exposure w, e.g. the time duration of the insurance contract, the most used approach is the famous frequency-severity split:

    \begin{align*}
    \operatorname{E}\left[\frac{Y}{w}\right] = \underbrace{\operatorname{E}\left[\frac{N}{w}\right]}_{frequency} \cdot
    \underbrace{\operatorname{E}\left[\left. \frac{Y}{n}\right| N=n\right]}_{severity}
    \end{align*}

    For simplicity, the conditioning on X is suppressed, it would occur in every expectation. The first part \operatorname{E}\left[\frac{N}{w}\right]is the (expected) frequency, i.e. the number of claims per exposure (time). The second term \operatorname{E}\left[\left.\frac{Y}{N}\right| N\right] is the (expected) severity, i.e. the average claim size (per claim) given a fixed number of claims. Here, we focus on the frequency part.

    Convolution and Aggregation Invariance

    This property might first seem very theoretical, but it may be one of the most important properties for the estimation of expectations E[Y|X] with GLMs. It is in fact a property valid for the whole exponential dispersion family: The weighted mean of i.i.d. random variables has (almost) the same distribution!

    If

    \begin{align*}
    Y_i &\overset{i.i.d}{\sim} \mathrm{Tw}_p(\mu, \phi/w_i) \,,
    \\
    w_+ &= \sum_i w_i \quad\text{with } w_i >0 \,,
    \end{align*}

    then

    \begin{align*}
    Y &=\sum_i^n \frac{w_i Y_i}{w_+} \sim \mathrm{Tw}_p(\mu, \phi/w_+) \,.
    \end{align*}

    It is obvious that the mean of Yis again \mu. But is is remarkable that it has the same distribution with the same power parameter, only the 2nd argument with the dispersion parameter differs. But the dispersion parameter cancels out in GLM estimations. In fact, we will show that two GLMs, one on aggregated data, give identical results. Another way of saying the same in statistical terms is that (weighted) averages are the sufficient statistic for the expectation within the exponential dispersion family.

    This is quite an essential property for data aggregation. It means that one can aggregate rows with identical features and still do an analysis (of the conditional expectation) without loss of information.

    The weighted average above can be written a bit more intuitive. For instance, a frequency Y_i=\frac{N_i}{w_i} has weighted average Y=\frac{\sum_i N_i}{\sum_i w_i}.

    Poisson Distribution

    When modelling counts, the Poisson distribution is by far the easiest distribution one can think of. It only has a single parameter, is a member of the Tweedie family, and fulfils the mean-variance relation

    \begin{equation*}
    \operatorname{E}[N] = \mu = \operatorname{Var}[N] \,.\end{equation*}

    In particular, p=1. While the distribution is strictly speaking only for counts, i.e. N takes on non-negative integer values, Poisson regression also works for any non-negative response variable like N/w \in \mathrm{R}.

    Frequency Example

    For demonstration, we fit a Poisson GLM on the french motor third-party liability claims dataset, cf. the corresponding scikit-learn example and the case study 1 of the Swiss Association of Actuaries on the same dataset.

    from glum import GeneralizedLinearRegressor
    import pandas as pd
    
    # ... quite some code ... here we abbreviate.
    y_freq = df["ClaimNb"] / df["Exposure"]
    w_freq = df["Exposure"]
    X = df[x_vars]
    glm_params = {
        "alpha": 0,
        "drop_first": True,
        "gradient_tol": 1e-8,
    }
    glm_freq = GeneralizedLinearRegressor(
        family="poisson", **glm_params
    ).fit(X, y_freq, sample_weight=w_freq)
    print(
      f"Total predicted number of claims = "
      f"{(w_freq * glm_freq.predict(X)).sum():_.2f}"
    )
    # Total predicted number of claims = 26_444.00
    
    # Now aggregated
    df_agg = df.groupby(x_vars, observed=True).sum().reset_index()
    print(
        f"Aggregation reduced number of rows from {df.shape[0]:_}"
        f"to {df_agg.shape[0]:_}."
    )
    # Aggregation reduced number of rows from 678_013 to 133_413.
    y_agg_freq = df_agg["ClaimNb"] / df_agg["Exposure"]
    w_agg_freq = df_agg["Exposure"]
    X_agg = df_agg[x_vars]
    glm_agg_freq = GeneralizedLinearRegressor(
        family="poisson", **glm_params
    ).fit(X_agg, y_agg_freq, sample_weight=w_agg_freq)
    print(
        f"Total predicted number of claims = "
        f"{(w_agg_freq * glm_agg_freq.predict(X_agg)).sum():_.2f}"
    )
    # Total predicted number of claims = 26_444.00

    In fact, both models have the same intercept term and same coefficients, they are really identical models (up to numerical precision):

    print(
        f"intercept freq{'':<18}= {glm_freq.intercept_}\n"
        f"intercept freq aggregated model = {glm_agg_freq.intercept_}"
    )
    # intercept freq                  = -3.7564376764216747
    # intercept freq aggregated model = -3.7564376764216747
    
    np.max(np.abs(glm_freq.coef_ - glm_agg_freq.coef_)) < 1e-13
    # True

    Outlook

    The full notebook can be found here.

    In the next week, part II of this trilogy will follow. There, we will meet some more of its quite remarkable properties.

    Further references:

    • Tweedie M.C.K. 1984. “An index which distinguishes between some important exponential families”. Statistics: Applications and New Directions. Proceedings of the Indian Statistical Institute Golden Jubilee International Conference, Indian Statistical Institute, Cal- cutta, pp. 579–604.
    • Bar-Lev, S.K., Enis, P. Reproducibility in the one-parameter exponential family. Metrika 32, 391–394 (1985). https://doi.org/10.1007/BF01897827
    • Shaul K. Bar-Lev. Peter Enis. “Reproducibility and Natural Exponential Families with Power Variance Functions.” Ann. Statist. 14 (4) 1507 – 1522, December, 1986. https://doi.org/10.1214/aos/1176350173
    1. A great thanks to Prof. Mario Wüthrich for pointing out the references of Bar-Lev and Enis. ↩︎
  • Building Strong GLMs in Python via ML + XAI

    In our latest post, we explained how to use ML + XAI to build strong generalized linear models with R. Let’s do the same with Python.

    Insurance pricing data

    We will use again a synthetic dataset with 1 Mio insurance policies, with reference:

    Mayer, M., Meier, D. and Wuthrich, M.V. (2023),
    SHAP for Actuaries: Explain any Model.
    https://doi.org/10.2139/ssrn.4389797

    Let’s start by loading and describing the data:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import shap
    from sklearn.datasets import fetch_openml
    from sklearn.inspection import PartialDependenceDisplay
    from sklearn.metrics import mean_poisson_deviance
    from sklearn.dummy import DummyRegressor
    from lightgbm import LGBMRegressor
    # We need preview version of glum that adds formulaic API
    # !pip install git+https://github.com/Quantco/glum@glum-v3#egg=glum
    from glum import GeneralizedLinearRegressor
    
    # Load data
    
    df = fetch_openml(data_id=45106, parser="pandas").frame
    df.head()
    
    # Continuous features
    df.hist(["driver_age", "car_weight", "car_power", "car_age"])
    _ = plt.suptitle("Histograms of continuous features", fontsize=15)
    
    # Response and discrete features
    fig, axes = plt.subplots(figsize=(8, 3), ncols=3)
    
    for v, ax in zip(["claim_nb", "year", "town"], axes):
        df[v].value_counts(sort=False).sort_index().plot(kind="bar", ax=ax, rot=0, title=v)
    plt.suptitle("Barplots of response and discrete features", fontsize=15)
    plt.tight_layout()
    plt.show()
    
    # Rank correlations
    corr = df.corr("spearman")
    mask = np.triu(np.ones_like(corr, dtype=bool))
    plt.suptitle("Rank-correlogram", fontsize=15)
    _ = sns.heatmap(
        corr, mask=mask, vmin=-0.7, vmax=0.7, center=0, cmap="vlag", square=True
    )
    
    

    Modeling

    1. We fit a tuned Boosted Trees model to model log(E(claim count)) via Poisson deviance loss.
    2. And perform a SHAP analysis to derive insights.
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(
        df.drop("claim_nb", axis=1), df["claim_nb"], test_size=0.1, random_state=30
    )
    
    # Tuning step not shown. Number of boosting rounds found via early stopping on CV performance
    params = dict(
        learning_rate=0.05,
        objective="poisson",
        num_leaves=7,
        min_child_samples=50,
        min_child_weight=0.001,
        colsample_bynode=0.8,
        subsample=0.8,
        reg_alpha=3,
        reg_lambda=5,
        verbose=-1,
    )
    
    model_lgb = LGBMRegressor(n_estimators=360, **params)
    model_lgb.fit(X_train, y_train)
    
    # SHAP analysis
    X_explain = X_train.sample(n=2000, random_state=937)
    explainer = shap.Explainer(model_lgb)
    shap_val = explainer(X_explain)
    
    plt.suptitle("SHAP importance", fontsize=15)
    shap.plots.bar(shap_val)
    
    for s in [shap_val[:, 0:3], shap_val[:, 3:]]:
        shap.plots.scatter(s, color=shap_val, ymin=-0.5, ymax=1)

    Here, we would come to the conclusions:

    1. car_weight and year might be dropped, depending on the specify aim of the model.
    2. Add a regression spline for driver_age.
    3. Add an interaction between car_power and town.

    Build strong GLM

    Let’s build a GLM with these insights. Two important things:

    1. Glum is an extremely powerful GLM implementation that was inspired by a pull request of our Christian Lorentzen.
    2. In the upcoming version 3.0, it adds a formula API based of formulaic, a very performant formula parser. This gives a very easy way to add interaction effects, regression splines, dummy encodings etc.
    model_glm = GeneralizedLinearRegressor(
        family="poisson",
        l1_ratio=1.0,
        alpha=1e-10,
        formula="car_power * C(town) + bs(driver_age, 7) + car_age",
    )
    model_glm.fit(X_train, y=y_train)  # 1 second on old laptop
    
    # PDPs of both models
    fig, ax = plt.subplots(2, 2, figsize=(7, 5))
    cols = ("tab:blue", "tab:orange")
    for color, name, model in zip(cols, ("GLM", "LGB"), (model_glm, model_lgb)):
        disp = PartialDependenceDisplay.from_estimator(
            model,
            features=["driver_age", "car_age", "car_power", "town"],
            X=X_explain,
            ax=ax if name == "GLM" else disp.axes_,
            line_kw={"label": name, "color": color},
        )
    fig.suptitle("PDPs of both models", fontsize=15)
    fig.tight_layout()
    
    # Stratified PDP of car_power
    for color, town in zip(("tab:blue", "tab:orange"), (0, 1)):
        mask = X_explain.town == town
        disp = PartialDependenceDisplay.from_estimator(
            model_glm,
            features=["car_power"],
            X=X_explain[mask],
            ax=None if town == 0 else disp.axes_,
            line_kw={"label": town, "color": color},
        )
    plt.suptitle("PDP of car_power stratified by town (0 vs 1)", fontsize=15)
    _ = plt.ylim(0, 0.2)

    In this relatively simple situation, the mean Poisson deviance of our models are very simlar now:

    model_dummy = DummyRegressor().fit(X_train, y=y_train)
    deviance_null = mean_poisson_deviance(y_test, model_dummy.predict(X_test)) 
    
    dev_imp = []
    for name, model in zip(("GLM", "LGB", "Null"), (model_glm, model_lgb, model_dummy)):
        dev_imp.append((name, mean_poisson_deviance(y_test, model.predict(X_test))))
    pd.DataFrame(dev_imp, columns=["Model", "Mean_Poisson_Deviance"])

    Final words

    • Glum is an extremely powerful GLM implementation – we have only scratched its surface. You can expect more blogposts on Glum…
    • Having a formula interface is especially useful for adding interactions. Fingers crossed that the upcoming version 3.0 will soon be released.
    • Building GLMs via ML + XAI is so smooth, especially when you work with large data. For small data, you need to be careful to not add hidden overfitting to the model.

    Click here for the full Python notebook

  • An Open Source Journey with Scikit-Learn

    In this post, I’d like to tell the story of my journey into the open source world of Python with a focus on scikit-learn. My hope is that it encourages others to start or to keep contributing and have endurance for bigger picture changes.

    How it all started

    Back in 2015/2016, I was working as a non-life pricing actuary. The standard vendor desktop applications we used for generalized linear models (GLM) had problems of system discontinuities, manual error prone steps and the lack of modern machine learning capabilities (not even out-of-sample model comparison).

    Python was then on the rise for data science. Numpy, scipy and pandas had laid the foundations, then came deep learning alias neural net frameworks leading to tensorflow and pytorch. XGBoost was also a game changer visible in Kaggle competition leaderboards. All those projects came as open source with thriving communities and possibilities to contribute.

    While the R base package always comes with splendid dataframes (I guess they invented it) and battle proven GLMs out of the box, the Python site for GLMs was not that well developed. So I started with GLMs in statsmodels and generalized linear mixed models (a.k.a. hierarchical or multilevel models) in pymc (then called pymc3). My first open source contributions in the Python world were small issues in statsmodels and a little later the bug report pymc#2640 about memory alignment issues which was caused by joblib#563.

    To my great surprise the famous machine learning library scikit-learn did not have GLMs, only penalized linear models and logistic regression, but no Poisson or Gamma GLMs which are essential in non-life insurance pricing. Fortunately, I was not the first one to notice this lack. There was already an open issue scikit-learn#5975 with many people asking for this feature. Just nobody had contributed a pull request (PR) yet.

    That’s when I said to myself: It should not fail just because no one implements it. I really like open source and gained some programming experience during my PhD in particle physics, mainly C++. Eventually, I boldly (because I was still a newbie) opened the PR scikit-learn#9405 in summer 2017.

    Becoming a scikit-learn core developer

    This PR turned out to be essential for the development of GLMs and for becoming a scikit-learn core developer. I dare say that I almost got crazy trying to convince the core developers that GLMs are really that useful for supervised machine learning and that GLMs should land in scikit-learn. In retrospective, this was the hardest part and it took me almost 2 years of patience and repeating my arguments, some examples comments are given below:

    comment example 1

    comment example 2

    comment example 3

    As I wanted to demonstrate the full utility of GLMs, this PR had become much too large for review and inclusion: +4000 lines of code with several solvers, penalty matrices, 3 examples, a lot of documentation and good test coverage (and a lot of things I would do differently today).

    The conclusion was to carve out a minimal GLM implementation using the L-BFGS solver of scipy. This way, I met Roman Yurchak with whom it was a pleasure to work with. It took a little 🇨🇭Swiss chocolate🍫 incentive to finally get scikit-learn#14300 (still +2900 loc) reviewed and merged in spring 2020. Almost 3 years after opening my original PR, it was released in scikit-learn version 0.23!

    I guess it was mainly this work and perseverance around GLMs that catched the attention of the core developers and that motivated them to vote for me: In summer 2020, I was invited to become a scikit-learn core developer and gladly accepted.

    Summary as core developer

    Further directions

    My work on GLMs was easily extensible to other estimators in the form of loss functions. Again, to my surprise, loss functions, a core element for supervised learning, were re-implemented again and again within scikit-learn. So, based on Roman’s idea in #15123, I started a project to unify them, and by unifying also extending several tree estimator classes with poisson and gamma losses (and making existing ones more stable and faster).

    As loss functions are such important core components, they have basically 2 major requirements: be numerically stable and fast. That’s why I went with Cython (preferred way for fast code in scikit-learn) in scikit-learn#20567 and guess which loop it closed? Again, I met segfault errors caused by joblib#563. This time, it motivated another core developer to quite an investment in fixing it in joblib#1254.

    Another story branch is the dedicated GLM Python library glum. The authors took my original way too long GLM PR as a starting point and developed one of the most feature rich and fastest GLM implementations out there. This is almost like a dream come true.

    A summary of my contributions over those 3 intensive years as scikit-learn core developer are best given in several categories.

    Pull requests

    A summary of my contributions in terms of code may be:

    • Unified loss module, unified naming of losses, poisson and gamma losses for GLMs and decision tree based models
    • LinearModelLoss and NewtonSolver (newton-cholesky) for GLMs like LogisticRegression and PoissonRegressor as well as further solver improvements
    • QuantileRegressor (linear quantile regression) and quantile/pinball loss for HistGradientBoostingRegressor (HGBT). BTW, linear quantile regression is much harder than GLM solvers!
    • SplineTransformer
    • Interaction constraints and feature subsampling for HGBT

    Reviewing and steering

    Among the biggest changes in newer scikit-learn history are two scikit-learn enhancement proposals (SLEP)

    For both, I did one of the 2 obligatory reviews. Then, maybe the technically most challenging review I can remember was on:

    Keep in mind that review(er)s are by far the scarcest resource of scikit-learn.

    I also like to mention PR#25753 which changed to government to be more inclusive, in particular with voting rights.

    Lessons learned

    Just before the end, a few critical words must be allowed.

    • Scikit-learn is focused a lot on stability. For some items of my wish list to land in scikit-learn, it would have again taken years. This time, I decided to release my own library model-diagnostics and I enjoy the freedom to use cutting edge components like polars.
    • As part-time statistician, I consider certain design choices like classifiers’ predict implicitly using a 50% threshold instead of returning a predicted probability (what predict_proba does) a bit poor. Hard to change!!! At least, PR#26120 might improve that to some extent.
    • I ponder a lot on the pipeline concept. At first, it was like an eye-opener for me to think of feature preprocessing as part of the estimator. The scikit-learn API is build around the pipeline design with fit, transform and predict. But the current trend of modern model classes like gradient boosted trees (XGBoost, LightGBM, HGBT) don’t need a preprocessing pipeline anymore, e.g., they can natively deal with categorical features and missing values. But it is hard to pass the information which feature to treat as categorical through a pipeline, see scikit-learn#18894.
    • It is still a very painful experience to specify design matrices of linear models, in particular interaction terms, see scikit-learn#15263, #19533 and #25412. Doing that in a pipline with a ColumnTransformer is just very complicated and prohibits a lot of optimizations (mostly for categoricals)—which is one of the reasons glum is faster.

    One of the greatest rewards of this journey was that I learned a lot, about Python, machine learning, rigorous reviews, CI/CD, open source communities, endurance. But even more so, I had the pleasure to meet and work with some kind, brilliant and gifted people like Roman Yurchak, Alexandre Gramfort, Olivier Grisel, Thomas Fan, Nicolas Hug, Adrin Jalali and many more. I am really grateful to be a part of something bigger than its parts.

  • Model Diagnostics in Python

    🚀Version 1.0.0 of the new Python package for model-diagnostics was just released on PyPI. If you use (machine learning or statistical or other) models to predict a mean, median, quantile or expectile, this library offers tools to assess the calibration of your models and to compare and decompose predictive model performance scores.🚀

    pip install model-diagnostics

    After having finished our paper (or better: user guide) “Model Comparison and Calibration Assessment: User Guide for Consistent Scoring Functions in Machine Learning and Actuarial Practice” last year, I realised that there is no Python package that supports the proposed diagnostic tools (which are not completely new). Most of the required building blocks are there, but putting them together to get a result amounts quickly to a large amount of code. Therefore, I decided to publish a new package.

    By the way, I really never wanted to write a plotting library. But it turned out that arranging results until they are ready to be visualised amounts to quite a large part of the source code. I hope this was worth the effort. Your feedback is very welcome, either here in the comments or as feature request or bug report under https://github.com/lorentzenchr/model-diagnostics/issues.

    For a jump start, I recommend to go directly to the two examples:

    To give a glimpse of the functionality, here are some short code snippets.

    from model_diagnostics.calibration import compute_bias
    from model_diagnostics.calibration import plot_reliability_diagram
    
    
    y_obs = list(range(10))
    y_pred = [2, 1, 3, 3, 6, 8, 5, 5, 8, 9.]
    plot_reliability_diagram(
        y_obs=y_obs,
        y_pred=y_pred,
        n_bootstrap=1000,
        confidence_level=0.9,
    )
    compute_bias(y_obs=y_obs, y_pred=y_pred)
    bias_meanbias_countbias_weightsbias_stderrp_value
    f64u32f64f64f64
    0.51010.00.4772610.322121
    from model_diagnostics.scoring import SquaredError, decompose
    
    
    decompose(
        y_obs=y_obs,
        y_pred=y_pred,
        scoring_function=SquaredError(),
    )
    miscalibrationdiscriminationuncertaintyscore
    f64f64f64f64
    1.2833337.2333338.252.3

    This score decomposition is additive (and unique):

    \begin{equation*}
    \mathrm{score} = \mathrm{miscalibration} - \mathrm{discrimination} + \mathrm{uncertainty}
    \end{equation*}

    As usual, the code snippets are collected in a notebook: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2023-07-16%20model-diagnostics.ipynb.

  • Geographic SHAP

    Lost in Translation between R and Python 10

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    This post is heavily based on the new {shapviz} vignette.

    Setting

    Besides other features, a model with geographic components contains features like

    • latitude and longitude,
    • postal code, and/or
    • other features that depend on location, e.g., distance to next restaurant.

    Like any feature, the effect of a single geographic feature can be described using SHAP dependence plots. However, studying the effect of latitude (or any other location dependent feature) alone is often not very illuminating – simply due to strong interaction effects and correlations with other geographic features.

    That’s where the additivity of SHAP values comes into play: The sum of SHAP values of all geographic components represent the total geographic effect, and this sum can be visualized as a heatmap or 3D scatterplot against latitude/longitude (or any other geographic representation).

    A first example

    For illustration, we will use a beautiful house price dataset containing information on about 14’000 houses sold in 2016 in Miami-Dade County. Some of the columns are as follows:

    • SALE_PRC: Sale price in USD: Its logarithm will be our model response.
    • LATITUDE, LONGITUDE: Coordinates
    • CNTR_DIST: Distance to central business district
    • OCEAN_DIST: Distance (ft) to the ocean
    • RAIL_DIST: Distance (ft) to the next railway track
    • HWY_DIST: Distance (ft) to next highway
    • TOT_LVG_AREA: Living area in square feet
    • LND_SQFOOT: Land area in square feet
    • structure_quality: Measure of building quality (1: worst to 5: best)
    • age: Age of the building in years

    (Italic features are geographic components.) For more background on this dataset, see Mayer et al [2].

    We will fit an XGBoost model to explain log(price) as a function of lat/long, size, and quality/age.

    devtools::install_github("ModelOriented/shapviz", dependencies = TRUE)
    library(xgboost)
    library(ggplot2)
    library(shapviz)  # Needs development version 0.9.0 from github
    
    head(miami)
    
    x_coord <- c("LATITUDE", "LONGITUDE")
    x_nongeo <- c("TOT_LVG_AREA", "LND_SQFOOT", "structure_quality", "age")
    x <- c(x_coord, x_nongeo)
    
    # Train/valid split
    set.seed(1)
    ix <- sample(nrow(miami), 0.8 * nrow(miami))
    X_train <- data.matrix(miami[ix, x])
    X_valid <- data.matrix(miami[-ix, x])
    y_train <- log(miami$SALE_PRC[ix])
    y_valid <- log(miami$SALE_PRC[-ix])
    
    # Fit XGBoost model with early stopping
    dtrain <- xgb.DMatrix(X_train, label = y_train)
    dvalid <- xgb.DMatrix(X_valid, label = y_valid)
    
    params <- list(learning_rate = 0.2, objective = "reg:squarederror", max_depth = 5)
    
    fit <- xgb.train(
      params = params, 
      data = dtrain, 
      watchlist = list(valid = dvalid), 
      early_stopping_rounds = 20,
      nrounds = 1000,
      callbacks = list(cb.print.evaluation(period = 100))
    )
    %load_ext lab_black
    
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets import fetch_openml
    
    df = fetch_openml(data_id=43093, as_frame=True)
    X, y = df.data, np.log(df.target)
    X.head()
    
    # Data split and model
    from sklearn.model_selection import train_test_split
    import xgboost as xgb
    
    x_coord = ["LONGITUDE", "LATITUDE"]
    x_nongeo = ["TOT_LVG_AREA", "LND_SQFOOT", "structure_quality", "age"]
    x = x_coord + x_nongeo
    
    X_train, X_valid, y_train, y_valid = train_test_split(
        X[x], y, test_size=0.2, random_state=30
    )
    
    # Fit XGBoost model with early stopping
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dvalid = xgb.DMatrix(X_valid, label=y_valid)
    
    params = dict(learning_rate=0.2, objective="reg:squarederror", max_depth=5)
    
    fit = xgb.train(
        params=params,
        dtrain=dtrain,
        evals=[(dvalid, "valid")],
        verbose_eval=100,
        early_stopping_rounds=20,
        num_boost_round=1000,
    )
    

    SHAP dependence plots

    Let’s first study selected SHAP dependence plots, evaluated on the validation dataset with around 2800 observations. Note that we could as well use the training data for this purpose, but it is a bit large.

    sv <- shapviz(fit, X_pred = X_valid)
    sv_dependence(
      sv, 
      v = c("TOT_LVG_AREA", "structure_quality", "LONGITUDE", "LATITUDE"), 
      alpha = 0.2
    )
    import shap
    
    xgb_explainer = shap.Explainer(fit)
    shap_values = xgb_explainer(X_valid)
    
    v = ["TOT_LVG_AREA", "structure_quality", "LONGITUDE", "LATITUDE"]
    shap.plots.scatter(shap_values[:, v], color=shap_values[:, v])
    SHAP dependence plots of selected features (Python output).

    Total coordindate effect

    And now the two-dimensional plot of the sum of SHAP values:

    sv_dependence2D(sv, x = "LONGITUDE", y = "LATITUDE") +
      coord_equal()
    shap_coord = shap_values[:, x_coord]
    plt.scatter(*list(shap_coord.data.T), c=shap_coord.values.sum(axis=1), s=4)
    ax = plt.gca()
    ax.set_aspect("equal", adjustable="box")
    plt.colorbar()
    plt.title("Total location effect")
    plt.show()
    Sum of SHAP values on color scale against coordinates (Python output).

    The last plot gives a good impression on price levels, but note:

    1. Since we have modeled logarithmic prices, the effects are on relative scale (0.1 means about 10% above average).
    2. Due to interaction effects with non-geographic components, the location effects might depend on features like living area. This is not visible in above plot. We will modify the model now to improve this aspect.

    Two modifications

    We will now change above model in two ways, not unlike the model in Mayer et al [2].

    1. We will use additional geographic features like distance to railway track or to the ocean.
    2. We will use interaction constraints to allow only interactions between geographic features.

    The second step leads to a model that is additive in each non-geographic component and also additive in the combined location effect. According to the technical report of Mayer [1], SHAP dependence plots of additive components in a boosted trees model are shifted versions of corresponding partial dependence plots (evaluated at observed values). This allows a “Ceteris Paribus” interpretation of SHAP dependence plots of corresponding components.

    # Extend the feature set
    more_geo <- c("CNTR_DIST", "OCEAN_DIST", "RAIL_DIST", "HWY_DIST")
    x2 <- c(x, more_geo)
    
    X_train2 <- data.matrix(miami[ix, x2])
    X_valid2 <- data.matrix(miami[-ix, x2])
    
    dtrain2 <- xgb.DMatrix(X_train2, label = y_train)
    dvalid2 <- xgb.DMatrix(X_valid2, label = y_valid)
    
    # Build interaction constraint vector
    ic <- c(
      list(which(x2 %in% c(x_coord, more_geo)) - 1),
      as.list(which(x2 %in% x_nongeo) - 1)
    )
    
    # Modify parameters
    params$interaction_constraints <- ic
    
    fit2 <- xgb.train(
      params = params, 
      data = dtrain2, 
      watchlist = list(valid = dvalid2), 
      early_stopping_rounds = 20,
      nrounds = 1000,
      callbacks = list(cb.print.evaluation(period = 100))
    )
    
    # SHAP analysis
    sv2 <- shapviz(fit2, X_pred = X_valid2)
    
    # Two selected features: Thanks to additivity, structure_quality can be read as 
    # Ceteris Paribus
    sv_dependence(sv2, v = c("structure_quality", "LONGITUDE"), alpha = 0.2)
    
    # Total geographic effect (Ceteris Paribus thanks to additivity)
    sv_dependence2D(sv2, x = "LONGITUDE", y = "LATITUDE", add_vars = more_geo) +
      coord_equal()
    # Extend the feature set
    more_geo = ["CNTR_DIST", "OCEAN_DIST", "RAIL_DIST", "HWY_DIST"]
    x2 = x + more_geo
    
    X_train2, X_valid2 = train_test_split(X[x2], test_size=0.2, random_state=30)
    
    dtrain2 = xgb.DMatrix(X_train2, label=y_train)
    dvalid2 = xgb.DMatrix(X_valid2, label=y_valid)
    
    # Build interaction constraint vector
    ic = [x_coord + more_geo, *[[z] for z in x_nongeo]]
    
    # Modify parameters
    params["interaction_constraints"] = ic
    
    fit2 = xgb.train(
        params=params,
        dtrain=dtrain2,
        evals=[(dvalid2, "valid")],
        verbose_eval=100,
        early_stopping_rounds=20,
        num_boost_round=1000,
    )
    
    # SHAP analysis
    xgb_explainer2 = shap.Explainer(fit2)
    shap_values2 = xgb_explainer2(X_valid2)
    
    v = ["structure_quality", "LONGITUDE"]
    shap.plots.scatter(shap_values2[:, v], color=shap_values2[:, v])
    
    # Total location effect
    shap_coord2 = shap_values2[:, x_coord]
    c = shap_values2[:, x_coord + more_geo].values.sum(axis=1)
    plt.scatter(*list(shap_coord2.data.T), c=c, s=4)
    ax = plt.gca()
    ax.set_aspect("equal", adjustable="box")
    plt.colorbar()
    plt.title("Total location effect")
    plt.show()
    SHAP dependence plots of an additive feature (structure quality, no vertical scatter per unique feature value) and one of the geographic features (Python output).
    Sum of all geographic features (color) against coordinates. There are no interactions to non-geographic features, so the effect can be read Ceteris Paribus (Python output).

    Again, the resulting total geographic effect looks reasonable.

    Wrap-Up

    • SHAP values of all geographic components in a model can be summed up and plotted on the color scale against coordinates (or some other geographic representation). This gives a lightning fast impression of the location effects.
    • Interaction constraints between geographic and non-geographic features lead to Ceteris Paribus interpretation of total geographic effects.

    The Python and R notebooks can be found here:

    References

    1. Mayer, Michael. 2022. “SHAP for Additively Modeled Features in a Boosted Trees Model.” https://arxiv.org/abs/2207.14490.
    2. Mayer, Michael, Steven C. Bourassa, Martin Hoesli, and Donato Flavio Scognamiglio. 2022. “Machine Learning Applications to Land and Structure Valuation.” Journal of Risk and Financial Management.

  • Quantiles And Their Estimation

    Applied statistics is dominated by the ubiquitous mean. For a change, this post is dedicated to quantiles. I will give my best to provide a good mix of theory and practical examples.

    While the mean describes only the central tendency of a distribution or random sample, quantiles are able to describe the whole distribution. They appear in box-plots, in childrens’ weight-for-age curves, in salary survey results, in risk measures like the value-at-risk in the EU-wide solvency II framework for insurance companies, in quality control and in many more fields.

    Definitions

    Often, one talks about quantiles, but rarely defines them. In what fallows, I borrow from Gneiting (2011).

    Definition 1: Quantile

    Given a cumulative probability distribution (CDF) F(x)=\mathbb{P}(X\leq x), the quantile at level \alpha \in (0,1) (ɑ-quantile for short), q_\alpha(F), is defined as

    \begin{equation*}
    q_\alpha(F) = \{x: \lim_{y\uparrow x} F(y)\leq \alpha \leq F(x)\} \,.
    \end{equation*}

    The inequalities of this definition are called coverage conditions. It is very important to note that quantiles are potentially set valued. Another way to write this set is as an interval:

    \begin{align*}
    q_\alpha(F) &\in [q_\alpha^-(F), q_\alpha^+(F)]
    \\
    q_\alpha^-(F) &= \sup\{x:F(x) < \alpha\} = \inf\{x:F(x) \geq \alpha\}
    \\
    q_\alpha^+(F) &= \inf\{x:F(x) > \alpha\} = \sup\{x:F(x) \leq \alpha\}
    \end{align*}

    For q_\alpha^-, we recover the usual quantile definition as the generalized inverse of F. But this is only one possible value. I will discuss examples of quantile intervals later on.

    To get acquainted a bit more, let’s plot the cumulative distribution function and the quantile function for some continuous distributions: Normal, Gamma and Pareto distribution. The parametrisation is such that all have mean 2, Normal and Gamma have variance 4, and Pareto has an infinite variance. For those continuous and strictly increasing distributions, all quantiles are unique, and therefore simplify to the inverse CDF q_\alpha^- in the above equations. Note that those three distributions have very different tail behaviour: The density of the Normal distribution has the famously fast decrease \propto e^{-x^2}, the Gamma density has an exponentially decreasing tail \propto e^{-x} and the Pareto density has a fat tail, i.e. an inverse power \propto \frac{1}{x^\alpha}.

    CDF (top) and quantile function (bottom) of several distributions: Normal N(\mu=2, \sigma^2=2)(left), Gamma Ga(\alpha=2, \beta=\frac{1}{2}) (middle) and Pareto Pa(\alpha=2)(right).

    There are at least two more equivalent ways to define quantiles. They will help us later to get a better visualisations.

    Definition 2: Quantile as minimiser

    Given a probability distribution F, the ɑ-quantile q_\alpha(F) is defined as any minimiser of the expected scoring function S

    \begin{align*}
    q_\alpha(F) &\in \argmin_x \mathbb{E}(S_\alpha(x, Y))
    \\
    S_\alpha(x, y) &= (\mathbf{1}_{x\geq y} - \alpha)(x - y)
    \end{align*}

    where the expectation is taken over Y \sim F.

    The scoring function or loss function S can be generalized to S_\alpha(x, y) = (\mathbb{1}_{x\geq y} – \alpha)(g(x) – g(y)) for any increasing function g, but the above version in definition 2 is by far the simplest one and coincides with the pinball loss used in quantile regression.

    This definition is super useful because it provides a tool to assess whether a given value really is a quantile. A plot will suffice.

    Having a definition in terms of a convex optimisation problem, there is another definition in terms of the first order condition of optimality. For continuous, strictly increasing distributions, this would be equivalent to setting the first derivative to zero. For our non-smooth scoring function with potentially set-valued solution, this gets more complicated, e.g. subdifferential or subgradients replacing derivatives. In the end, it amounts to a sign change of the expectation of the so called identification function V(x, y)=\mathbf{1}_{x\geq y}-\alpha.

    Empirical Distribution

    The empirical distribution provides an excellent example. Given a sample of n observations y_1, \ldots, y_n, the empirical distribution is given by F_n(x)=\frac{1}{n}\sum_{i=1}^n \mathbf{1}_{x\geq y_i}. Let us start simple and take two observations y_1=1 and y_2=2. Plugging in this distribution in the definition 1 of quantiles gives the exact quantiles of this 2-point empirical CDF:

    \begin{equation}
    q_\alpha(F_2)=
    \begin{cases}
       1 &\text{if } \alpha<\frac{1}{2} \\
       [1, 2] &\text{if } \alpha=\frac{1}{2} \\
       2 &\text{else}
    \end{cases}
    \end{equation}

    Here we encounter the interval [1, 2] for \alpha=\frac{1}{2}. Again, I plot both the (empirical) CDF F_n and the quantiles.

    Empirical distribution function and exact quantiles of observations y=1 and y=2.

    In the left plot, the big dots unambiguously mark the values at x=1 and x=2. For the quantiles in the right plot, the vertical line at probability 0.5 really means that all those values between 1 and 2 are possible 0.5-quantiles, also known as median.

    If you wonder about the value 1 for quantiles of level smaller than 50%, the minimisation formulation helps. The following plot shows \mathbb{E}(S_\alpha(x, Y)) for \alpha=0.2 with a clear unique minimum at x=1.

    Expected scoring function (pinball loss) for \alpha=0.2 for the empirical CDF with observations 1 and 2.

    A note for the interested reader: The above empirical distribution is the same as the distribution of a Bernoulli random variable, except that the x-values are shifted, i.e. the Bernoulli random variables are canonically set to 0 and 1 instead of 1 and 2. Furthermore, there is a direct connection between quantiles and classification via the cost-weighted misclassification error, see Fissler, Lorentzen & Mayer (2022).

    Empirical Quantiles

    From the empirical CDF, it is only a small step to empirical quantiles. But what’s the difference anyway? While we saw the exact quantile of the empirical distribution, q_\alpha(F_n), an empirical or sample quantile estimate the true (population) quantile given a data sample, i.e. \hat{q}_\alpha(\{y_i\}) \approx q_\alpha(F).

    As an empirical CDF estimates the CDF of the true underlying (population) distribution, F_n=\hat{F} \approx F, one immediate way to estimate a quantile is:

    1. Estimate the CDF via the empirical CDF F_n.
    2. Use the exact quantile in analogy to Eq.(1) as an estimate.

    Very importantly, this is just one way to estimate quantiles from a sample. There are many, many more. Here is the outcome of the 20%-quantile of our tiny data sample y_1=1 and y_2=2.

    import numpy as np
    
    methods = [
        'inverted_cdf',
        'averaged_inverted_cdf',
        'closest_observation',
        'interpolated_inverted_cdf',
        'hazen',
        'weibull',
        'linear',
        'median_unbiased',
        'normal_unbiased',
        'nearest',
        'lower',
        'higher',
        'midpoint',
    ]
    alpha = 0.2
    for m in methods:
        estimate = np.quantile([1, 2], 0.2, method=m)
        print(f"{m:<25} {alpha}-quantile estimate = {estimate}")
    inverted_cdf              0.2-quantile estimate = 1
    averaged_inverted_cdf     0.2-quantile estimate = 1.0
    closest_observation       0.2-quantile estimate = 1
    interpolated_inverted_cdf 0.2-quantile estimate = 1.0
    hazen                     0.2-quantile estimate = 1.0
    weibull                   0.2-quantile estimate = 1.0
    linear                    0.2-quantile estimate = 1.2
    median_unbiased           0.2-quantile estimate = 1.0
    normal_unbiased           0.2-quantile estimate = 1.0
    nearest                   0.2-quantile estimate = 1
    lower                     0.2-quantile estimate = 1
    higher                    0.2-quantile estimate = 2
    midpoint                  0.2-quantile estimate = 1.5

    Note that the first 9 methods are the ones discussed in a famous paper of Hyndman & Fan (1996). The default method of both Python’s numpy.quantile and R’s quantile is linear, i.e. number 7 in Hyndman & Fan. Somewhat surprisingly, we observe that this default method is clearly biased in this case and overestimates the true quantile.

    For large sample sizes, the differences will get tiny and all methods converge finally to a true quantile, at least for continuous distributions. In order to assess the bias with small sample sizes for each method, I do a simulation. This is where the fun starts😄

    For all three selected distributions and for quantile levels 15%, 50% and 85%, I simulate 10000 random samples, each of sample size 10 and calculate the sample quantile. Then I take the mean over all 10000 simulations as well as the 5% and the 95% quantiles as a measure of uncertainty, i.e. 90% confidence intervals. After some coding, this results in the following plot. (I spare you the code at this point. You can find it in the linked notebook at the bottom of this post).

    Small sample bias (n=10) of different empirical quantile estimation methods (x-axis and color) based on 10000 simulations. Dots are the mean values, error bars cover a 90% confidence interval. The dotted horizontal line is the theoretical quantile value.
    left: Normal distribution; mid: Gamma distribution; right: Pareto distribution
    top: 15%-quantile; mid: 50%-quantile; bottom: 85%-quantile

    For the 15%-quantile, the default linear method always overestimates, but it does surprisingly well for the 85%-quantile of the Pareto distribution. Overall, I personally would prefer the median unbiased or the Hazen method. Interestingly, the Hazen method is one of the oldest, namely from Hazen (1914), and is the only one that fulfills all proposed properties of Hyndman & Fan, who propose the median unbiased method as default.

    Quantile Regression

    So far, the interest was in the quantile of a sample or distribution. To go one step further, one might ask for the conditional quantile of a response variable Y given some features or covariates X, q_\alpha(Y|X)=q_\alpha(F_{Y|X}). This is the realm of quantile regression as invented by Koenker & Bassett (1978). To stay within linear models, one simply replaces the squared error of ordinary least squares (OLS) by the pinball loss.

    Without going into any more details, I chose the Palmer’s penguin dataset and the body mass in gram as response variable, all the other variables as feature variables. And again, I model the 15%, 50% and 85% quantiles.

    import seaborn as sns
    
    df = sns.load_dataset("penguins").dropna()
    df
    speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
    0AdelieTorgersen39.118.7181.03750.0Male
    1AdelieTorgersen39.517.4186.03800.0Female
    2AdelieTorgersen40.318.0195.03250.0Female
    4AdelieTorgersen36.719.3193.03450.0Female
    5AdelieTorgersen39.320.6190.03650.0Male
    338GentooBiscoe47.213.7214.04925.0Female
    340GentooBiscoe46.814.3215.04850.0Female
    341GentooBiscoe50.415.7222.05750.0Male
    342GentooBiscoe45.214.8212.05200.0Female
    343GentooBiscoe49.916.1213.05400.0Male
    from sklearn.base import clone
    from sklearn.compose import ColumnTransformer
    from sklearn.linear_model import QuantileRegressor
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import OneHotEncoder, SplineTransformer
    
    y = df["body_mass_g"]
    X = df.drop(columns="body_mass_g")
    qr50 = Pipeline([
        ("column_transformer",
         ColumnTransformer([
             ("ohe", OneHotEncoder(drop="first"), ["species", "island", "sex"]),
             ("spline", SplineTransformer(n_knots=3, degree=2), ["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]),
             
         ])
        ),
        ("quantile_regressor",
         QuantileRegressor(quantile=0.5, alpha=0, solver="highs")
        )
    ])
    qr15 = clone(qr50)
    qr15.set_params(quantile_regressor__quantile=0.15)
    qr85 = clone(qr50)
    qr85.set_params(quantile_regressor__quantile=0.85)
    
    qr15.fit(X, y)
    qr50.fit(X, y)
    qr85.fit(X, y)

    This code snippet gives the three fitted linear quantile models. That was the easy part. I find it much harder to produce good visualisations.

    import polars as pl  # imported and used much earlier in the notebook
    
    df_obs = df.copy()
    df_obs["type"] = "observed"
    dfs = [pl.from_pandas(df_obs)]
    for m, name in [(qr15, "15%-q"), (qr50, "median"), (qr85, "85%-q")]:
        df_pred = df.copy()
        df_pred["type"] = "predicted_" + name
        df_pred["body_mass_g"] = m.predict(X)
        dfs.append(pl.from_pandas(df_pred))
    df_pred = pl.concat(dfs).to_pandas()
    
    sns.catplot(df_pred, x="species", y="body_mass_g", hue="type", alpha=0.5)
    Quantile regression estimates and observed values for species (x-axis).

    This plot per species seems suspicious. We would expect the 85%-quantile prediction in red to mostly be larger than the median (green), instead we detect several clusters. The reason behind it is the sex of the penguins which also enters the model. To demonstrate this fact, I plot the sex separately and make the species visible by the plot style. This time, I put the flipper length on the x-axis.

    sns.relplot(
        df_pred,
        x="flipper_length_mm",
        y="body_mass_g",
        hue="type",
        col="sex",
        style="species",
        alpha=0.5,
    )
    Quantile regression estimates and observed values vs flipper length (x-axis). Left: male; right: female

    This is a really nice plot:

    • One immediately observes the differences in sex on the left and right subplot.
    • The two clusters in each subplot can be well explained by the penguin species.
    • The 3 quantiles are now vertically in the expected order, 15% quantile in yellow the lowest, 85%-quantile in red the highest, the median in the middle, and some observations beyond those predicted quantiles.
    • The models are linear in flipper length with a positive, but slightly different slope per quantile level.

    As a final check, let us compute the coverage of each quantile model (in-sample).

    [
        np.mean(y <= qr15.predict(X)),
        np.mean(y <= qr50.predict(X)),
        np.mean(y <= qr85.predict(X)),
    ]
    [0.15015015015015015, 0.5225225225225225, 0.8618618618618619]

    Those numbers are close enough to 15%, 50% and 85%.

    As always, the full Python code is available as notebook: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2023-02-11%20empirical_quantiles.ipynb.

  • Histograms, Gradient Boosted Trees, Group-By Queries and One-Hot Encoding

    This post shows how filling histograms can be done in very different ways thereby connecting very different areas: from gradient boosted trees to SQL queries to one-hot encoding. Let’s jump into it!

    Modern gradient boosted trees (GBT) like LightGBM, XGBoost and the HistGradientBoostingRegressor of scikit-learn all use two techniques on top of standard gradient boosting:

    • 2nd order Taylor expansion of the loss which amounts to using gradients and hessians.
    • One histogram per feature: bin the feature and fill the histogram with the gradients and hessians.

    The filling of the histograms is often the bottleneck when fitting GBTs. While filling a single histogram is very fast, this operation is executed many times: for each boosting round, for each tree split and for each feature. This is the reason why GBT implementations have dedicated routines for it. We look into this operation from different angles.

    For the coming (I)Python code snippets to work (# %% indicates a new notebook cell), we need the following imports.

    import duckdb                    # v0.5.1
    import matplotlib.pyplot as plt  # v.3.6.1
    from matplotlib.ticker import MultipleLocator
    import numpy as np               # v1.23.4
    import pandas as pd              # v1.5.0
    import pyarrow as pa             # v9.0.0
    import tabmat                    # v3.1.2
    
    from sklearn.ensemble._hist_gradient_boosting.histogram import (
        _build_histogram_root,
    )                                # v1.1.2
    from sklearn.ensemble._hist_gradient_boosting.common import (
      HISTOGRAM_DTYPE
    )

    Naive Histogram Visualisation

    As a starter, we create a small table with two columns: bin index and value of the hessian.

    def highlight(df):
        if df["bin"] == 0:
            return ["background-color: rgb(255, 128, 128)"] * len(df)
        elif df["bin"] == 1:
            return ["background-color: rgb(128, 255, 128)"] * len(df)
        else:
            return ['background-color: rgb(128, 128, 255)'] * len(df)
    
    df = pd.DataFrame({"bin": [0, 2, 1, 0, 1], "hessian": [1.5, 1, 2, 2.5, 3]})
    df.style.apply(highlight, axis=1)
      bin hessian
    0 0 1.500000
    1 2 1.000000
    2 1 2.000000
    3 0 2.500000
    4 1 3.000000

    A histogram then sums up all the hessian values belonging to the same bin. The result looks like the following.

    Above table visualised as histogram

    Dedicated Method

    We simulate filling the histogram of a single feature. Therefore, we draw 1,000,000 random variables for gradients and hessians as well as the bin indices.

    import duckdb
    import pyarrow as pa
    import numpy as np
    import tabmat
    
    from sklearn.ensemble._hist_gradient_boosting.histogram import (
        _build_histogram_root,
    )
    from sklearn.ensemble._hist_gradient_boosting.common import HISTOGRAM_DTYPE
    
    
    rng = np.random.default_rng(42)
    n_obs = 1000_000
    n_bins = 256
    binned_feature = rng.integers(0, n_bins, size=n_obs, dtype=np.uint8)
    gradients = rng.normal(size=n_obs).astype(np.float32)
    hessians = rng.lognormal(size=n_obs).astype(np.float32)

    Now we use the dedicated (and private!) and single-threaded method _build_histogram_root from sckit-learn to fill a histogram.

    hist_root = np.zeros((1, n_bins), dtype=HISTOGRAM_DTYPE)
    %time _build_histogram_root(0, binned_feature, gradients, hessians, hist_root)
    # Wall time: 1.38 ms

    This executes in around 1.4 ms. This is quite fast. But again, imagine 100 boosting rounds with 10 tree splits on average and 100 features. This means this is done around 100,000 times and would therefore take roughly 2 minutes.

    Let’s have a look at the first 5 bins:

    hist_root[:, 0:5]
    array([[(-79.72386998, 6508.89500265, 3894),
            ( 37.98393589, 6460.63222205, 3998),
            ( 53.54256977, 6492.22722797, 3805),
            ( 21.19542398, 6797.34159299, 3928),
            ( 16.24716742, 6327.03757573, 3875)]],
          dtype=[('sum_gradients', '<f8'), ('sum_hessians', '<f8'), ('count', '<u4')])

    SQL Group-By Query

    Someone familiar with SQL and database queries might immediately see how this task can be formulated as SQL group-by-aggregate query. To demonstrate it on our simulated data, we use DuckDB as well as Apache Arrow (the file format as well as the Python library pyarrow). You can read more about DuckDB in our post DuckDB: Quacking SQL.

    # %%
    con = duckdb.connect()
    arrow_table = pa.Table.from_pydict(
        {
            "bin": binned_feature,
            "gradients": gradients,
            "hessians": hessians,
    })
    # Read data once to make timing fairer
    arrow_result = con.execute("SELECT * FROM arrow_table")
    
    # %%
    %%time
    arrow_result = con.execute("""
    SELECT
        bin as bin,
        SUM(gradients) as sum_gradients,
        SUM(hessians) as sum_hessians,
        COUNT() as count
    FROM arrow_table
    GROUP BY bin
    """).arrow()
    # Wall time: 6.52 ms

    On my laptop, this takes about 6.5 ms and, upon sorting, gives the same results:

    arrow_result.sort_by("bin").slice(length=5)
    pyarrow.Table
    bin: uint8
    sum_gradients: double
    sum_hessians: double
    count: int64
    ----
    bin: [[0,1,2,3,4]]
    sum_gradients: [[-79.72386997545254,37.98393589106854,53.54256977112527,21.195423980039777,16.247167424764484]]
    sum_hessians: [[6508.895002648234,6460.632222048938,6492.227227974683,6797.341592986137,6327.037575732917]]
    count: [[3894,3998,3805,3928,3875]]

    As we have the table as an Arrow table, we can stay within pyarrow:

    %%time
    arrow_result = arrow_table.group_by("bin").aggregate(
        [
            ("gradients", "sum"),
            ("hessians", "sum"),
            ("bin", "count"),
        ]
    )
    # Wall time: 10.8 ms

    The fact that DuckDB is faster than Arrow on this task might have to do with the large invested effort on parallelised group-by operations, see their post Parallel Grouped Aggregation in DuckDB for more infos.

    One-Hot encoded Matrix Multiplication

    I think it is very interesting that filling histograms can be written as a matrix multiplication! The trick is to view the feature as a categorical feature and use its one-hot encoded matrix representation. This blows up memory, of course. Note that one-hot encoding is usually met with generalized linear models (GLM) in order to incorporate nominal categorical feature variables with no internal ordering in the design matrix.

    For our demonstration, we use a numpy index trick to construct the one-hot encoded matrix employing the fact that the binned feature already contains the right indices.

    # %%
    %%time
    m_OHE = np.eye(n_bins)[binned_feature].T
    vec = np.column_stack((gradients, hessians, np.ones_like(gradients)))
    # Wall time: 770 ms
    
    # %%
    %time result_ohe = m_OHE @ vec
    # Wall time: 199 ms
    
    # %%
    result_ohe[:5]
    array([[ -79.72386998, 6508.89500265, 3894.        ],
           [  37.98393589, 6460.63222205, 3998.        ],
           [  53.54256977, 6492.22722797, 3805.        ],
           [  21.19542398, 6797.34159299, 3928.        ],
           [  16.24716742, 6327.03757573, 3875.        ]])

    This is way slower, but, somehow surprisingly, produces the same result.

    The one-hot encoded matrix is very sparse, with only one non-zero value per column, i.e. only one out of 256 (number of bins) values is non-zero. This structure can be exploited to reduce both CPU time as well as memory consumption, with the help of the package tabmat that was built to accelerate GLMs. Unfortunately, tabmat only provides a matrix-vector multiplication (and the sandwich product, of course), but no matrix-matrix multiplication. So we have to do a little extra work.

    # %%
    %time m_categorical = tabmat.CategoricalMatrix(cat_vec=binned_feature)
    # Wall time: 21.5 ms
    
    # %%
    # tabmat needs contigous arrays with dtype = Python float = float64
    vec = np.asfortranarray(vec, dtype=float)
    
    # %%
    %%time
    tabmat_result = np.column_stack(
        (
            vec[:, 0] @ m_categorical,
            vec[:, 1] @ m_categorical,
            vec[:, 2] @ m_categorical,
        )
    )
    # Wall time: 4.82 ms
    
    # %%
    tabmat_result[0:5]
    array([[ -79.72386998, 6508.89500265, 3894.        ],
           [  37.98393589, 6460.63222205, 3998.        ],
           [  53.54256977, 6492.22722797, 3805.        ],
           [  21.19542398, 6797.34159299, 3928.        ],
           [  16.24716742, 6327.03757573, 3875.        ]])

    While the timing of this approach is quite good, the construction of a CategoricalMatrix requires more time than the matrix-vector multiplication.

    Conclusion

    In the end, the special (Cython) routine of scikit-learn ist the fastest of our tested methods for filling histograms. The other GBT libraries have their own even more specialised routines which might be a reason for even faster fit times. What we learned in this post is that this seemingly simple task plays a very crucial part in modern GBTs and can be accomplished by very different approaches. These different approaches uncover connections of algorithms of quite different domains.

    The full code as ipython notebook can be found at https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2022-10-31%20histogram-GBT-GroupBy-OHE.ipynb.

  • Kernel SHAP in R and Python

    Lost in Translation between R and Python 9

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    Kernel SHAP

    SHAP is one of the most used model interpretation technique in Machine Learning. It decomposes predictions into additive contributions of the features in a fair way. For tree-based methods, the fast TreeSHAP algorithm exists. For general models, one has to resort to computationally expensive Monte-Carlo sampling or the faster Kernel SHAP algorithm. Kernel SHAP uses a regression trick to get the SHAP values of an observation with a comparably small number of calls to the predict function of the model. Still, it is much slower than TreeSHAP.

    Two good references for Kernel SHAP:

    1. Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30, 2017.
    2. Ian Covert and Su-In Lee. Improving KernelSHAP: Practical Shapley Value Estimation Using Linear Regression. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:3457-3465, 2021.

    In our last post, we introduced our new “kernelshap” package in R. Since then, the package has been substantially improved, also by the big help of David Watson:

    1. The package now supports multi-dimensional predictions.
    2. It received a massive speed-up
    3. Additionally, parallel computing can be activated for even faster calculations.
    4. The interface has become more intuitive.
    5. If the number of features is small (up to ten or eleven), it can provide exact Kernel SHAP values just like the reference Python implementation.
    6. For a larger number of features, it now uses partly-exact (“hybrid”) calculations, very similar to the logic in the Python implementation.

    With those changes, the R implementation is about to meet the Python version at eye level.

    Example with four features

    In the following, we use the diamonds data to fit a linear regression with

    • log(price) as response
    • log(carat) as numeric feature
    • clarity, color and cut as categorical features (internally dummy encoded)
    • interactions between log(carat) and the other three “C” variables. Note that the interactions are very weak

    Then, we calculate SHAP decompositions for about 1000 diamonds (every 53th diamond), using 120 diamonds as background dataset. In this case, both R and Python will use exact calculations based on m=2^4 – 2 = 14 possible binary on-off vectors (a value of 1 representing a feature value picked from the original observation, a value of 0 a value picked from the background data).

    library(ggplot2)
    library(kernelshap)
    
    # Turn ordinal factors into unordered
    ord <- c("clarity", "color", "cut")
    diamonds[, ord] <- lapply(diamonds[ord], factor, ordered = FALSE)
    
    # Fit model
    fit <- lm(log(price) ~ log(carat) * (clarity + color + cut), data = diamonds)
    
    # Subset of 120 diamonds used as background data
    bg_X <- diamonds[seq(1, nrow(diamonds), 450), ]
    
    # Subset of 1018 diamonds to explain
    X_small <- diamonds[seq(1, nrow(diamonds), 53), c("carat", ord)]
    
    # Exact KernelSHAP (5 seconds)
    system.time(
      ks <- kernelshap(fit, X_small, bg_X = bg_X)  
    )
    ks
    
    # SHAP values of first 2 observations:
    #          carat     clarity     color        cut
    # [1,] -2.050074 -0.28048747 0.1281222 0.01587382
    # [2,] -2.085838  0.04050415 0.1283010 0.03731644
    
    # Using parallel backend
    library("doFuture")
    
    registerDoFuture()
    plan(multisession, workers = 2)  # Windows
    # plan(multicore, workers = 2)   # Linux, macOS, Solaris
    
    # 3 seconds on second call
    system.time(
      ks3 <- kernelshap(fit, X_small, bg_X = bg_X, parallel = TRUE)  
    )
    
    # Visualization
    library(shapviz)
    
    sv <- shapviz(ks)
    sv_importance(sv, "bee")
    import numpy as np
    import pandas as pd
    from plotnine.data import diamonds
    from statsmodels.formula.api import ols
    from shap import KernelExplainer
    
    # Turn categoricals into integers because, inconveniently, kernel SHAP
    # requires numpy array as input
    ord = ["clarity", "color", "cut"]
    x = ["carat"] + ord
    diamonds[ord] = diamonds[ord].apply(lambda x: x.cat.codes)
    X = diamonds[x].to_numpy()
    
    # Fit model with interactions and dummy variables
    fit = ols(
      "np.log(price) ~ np.log(carat) * (C(clarity) + C(cut) + C(color))", 
      data=diamonds
    ).fit()
    
    # Background data (120 rows)
    bg_X = X[0:len(X):450]
    
    # Define subset of 1018 diamonds to explain
    X_small = X[0:len(X):53]
    
    # Calculate KernelSHAP values
    ks = KernelExplainer(
      model=lambda X: fit.predict(pd.DataFrame(X, columns=x)), 
      data = bg_X
    )
    sv = ks.shap_values(X_small)  # 74 seconds
    sv[0:2]
    
    # array([[-2.05007406, -0.28048747,  0.12812216,  0.01587382],
    #        [-2.0858379 ,  0.04050415,  0.12830103,  0.03731644]])
    SHAP summary plot (R model)

    The results match, hurray!

    Example with nine features

    The computation effort of running exact Kernel SHAP explodes with the number of features. For nine features, the number of relevant on-off vectors is 2^9 – 2 = 510, i.e. about 36 times larger than with four features.

    We now modify above example, adding five additional features to the model. Note that the model structure is completely non-sensical. We just use it to get a feeling about what impact a 36 times larger workload has.

    Besides exact calculations, we use an almost exact hybrid approach for both R and Python, using 126 on-off vectors (p*(p+1) for the exact part and 4p for the sampling part, where p is the number of features), resulting in a significant speed-up both in R and Python.

    fit <- lm(
      log(price) ~ log(carat) * (clarity + color + cut) + x + y + z + table + depth, 
      data = diamonds
    )
    
    # Subset of 1018 diamonds to explain
    X_small <- diamonds[seq(1, nrow(diamonds), 53), setdiff(names(diamonds), "price")]
    
    # Exact Kernel SHAP: 61 seconds
    system.time(
      ks <- kernelshap(fit, X_small, bg_X = bg_X, exact = TRUE)  
    )
    ks
    #          carat        cut     color     clarity         depth         table          x           y            z
    # [1,] -1.842799 0.01424231 0.1266108 -0.27033874 -0.0007084443  0.0017787647 -0.1720782 0.001330275 -0.006445693
    # [2,] -1.876709 0.03856957 0.1266546  0.03932912 -0.0004202636 -0.0004871776 -0.1739880 0.001397792 -0.006560624
    
    # Default, using an almost exact hybrid algorithm: 17 seconds
    system.time(
      ks <- kernelshap(fit, X_small, bg_X = bg_X, parallel = TRUE)  
    )
    #          carat        cut     color     clarity         depth         table          x           y            z
    # [1,] -1.842799 0.01424231 0.1266108 -0.27033874 -0.0007084443  0.0017787647 -0.1720782 0.001330275 -0.006445693
    # [2,] -1.876709 0.03856957 0.1266546  0.03932912 -0.0004202636 -0.0004871776 -0.1739880 0.001397792 -0.006560624
    x = ["carat"] + ord + ["table", "depth", "x", "y", "z"]
    X = diamonds[x].to_numpy()
    
    # Fit model with interactions and dummy variables
    fit = ols(
      "np.log(price) ~ np.log(carat) * (C(clarity) + C(cut) + C(color)) + table + depth + x + y + z", 
      data=diamonds
    ).fit()
    
    # Background data (120 rows)
    bg_X = X[0:len(X):450]
    
    # Define subset of 1018 diamonds to explain
    X_small = X[0:len(X):53]
    
    # Calculate KernelSHAP values: 12 minutes
    ks = KernelExplainer(
      model=lambda X: fit.predict(pd.DataFrame(X, columns=x)), 
      data = bg_X
    )
    sv = ks.shap_values(X_small)
    sv[0:2]
    # array([[-1.84279897e+00, -2.70338744e-01,  1.26610769e-01,
    #          1.42423108e-02,  1.77876470e-03, -7.08444295e-04,
    #         -1.72078182e-01,  1.33027467e-03, -6.44569296e-03],
    #        [-1.87670887e+00,  3.93291219e-02,  1.26654599e-01,
    #          3.85695742e-02, -4.87177593e-04, -4.20263565e-04,
    #         -1.73988040e-01,  1.39779179e-03, -6.56062359e-03]])
    
    # Now, using a hybrid between exact and sampling: 5 minutes
    sv = ks.shap_values(X_small, nsamples=126)
    sv[0:2]
    # array([[-1.84279897e+00, -2.70338744e-01,  1.26610769e-01,
    #          1.42423108e-02,  1.77876470e-03, -7.08444295e-04,
    #         -1.72078182e-01,  1.33027467e-03, -6.44569296e-03],
    #        [-1.87670887e+00,  3.93291219e-02,  1.26654599e-01,
    #          3.85695742e-02, -4.87177593e-04, -4.20263565e-04,
    #         -1.73988040e-01,  1.39779179e-03, -6.56062359e-03]])

    Again, the results are essentially the same between R and Python, but also between the hybrid algorithm and the exact algorithm. This is interesting, because the hybrid algorithm is significantly faster than the exact one.

    Wrap-Up

    • R is catching up with Python’s superb “shap” package.
    • For two non-trivial linear regressions with interactions, the “kernelshap” package in R provides the same output as Python.
    • The hybrid between exact and sampling KernelSHAP (as implemented in Python and R) offers a very good trade-off between speed and accuracy.
    • kernelshap()in R is fast!

    The Python and R codes can be found here:

    The examples were run on a Windows notebook with an Intel i7-8650U 4 core CPU.

  • From Least Squares Benchmarks to the Marchenko–Pastur Distribution

    In this blog post, I tell the story how I learned about a theorem for random matrices of the two Ukrainian🇺🇦 mathematicians Vladimir Marchenko and Leonid Pastur. It all started with benchmarking least squares solvers in scipy.

    Setting the Stage for Least Squares Solvers

    Least squares starts with a matrix A \in \mathbb{R}^{n,p} and a vector b \in \mathbb{R}^{n} and one is interested in solution vectors x \in \mathbb{R}^{p} fulfilling

    x^\star = \argmin_x ||Ax-b||_2^2 \,.

    You can read more about least squares in our earlier post Least Squares Minimal Norm Solution. There are many possible ways to tackle this problem and many algorithms are available. One standard solver is LSQR with the following properties:

    • Iterative solver, which terminates when some stopping criteria are smaller than a user specified tolerance.
    • It only uses matrix-vector products. Therefore, it is suitable for large and sparse matrices A.
    • It effectively solves the normal equations A^T A = A^Tb based on a bidiagonalization procedure of Golub and Kahan (so never really calculating A^T A).
    • It is, unfortunately, susceptible to ill-conditioned matrices A, which we demonstrated in our earlier post.

    Wait, what is an ill-conditioned matrix? This is most easily explained with the help of the singular value decomposition (SVD). Any real valued matrix permits a decomposition into three parts:

    A = U S V^T \,.

    U and V are orthogonal matrices, but not of further interest to us. The matrix S on the other side is (rectangular) diagonal with only non-negative entries s_i = S_{ii} \geq 0. A matrix A is said to be ill-conditioned if it has a large condition number, which can be defined as the ratio of largest and smallest singular value, \mathrm{cond}(A) = \frac{\max_i s_i}{\min_i s_i} = \frac{s_{\mathrm{max}}}{s_{\mathrm{min}}}. Very often, large condition numbers make numerical computations difficult or less precise.

    Benchmarking LSQR

    One day, I decided to benchmark the computation time of least squares solvers provided by scipy, in particular LSQR. I wanted results for different sizes n, p of the matrix dimensions. So I needed to somehow generate different matrices A. There are a myriad ways to do that. Naive as I was, I did the most simple thing and used standard Normal (Gaussian) distributed random matrices A_{ij} \sim \mathcal{N}(0, 1) and ran benchmarks on those. Let’s see how that looks in Python.

    from collections import OrderedDict
    from functools import partial
    
    import matplotlib.pyplot as plt
    import numpy as np
    from scipy.linalg import lstsq
    from scipy.sparse.linalg import lsqr
    import seaborn as sns
    from neurtu import Benchmark, delayed
    
    plt.ion()
    
    p_list = [100, 500]
    rng = np.random.default_rng(42)
    X = rng.standard_normal(max(p_list) ** 2 * 2)
    y = rng.standard_normal(max(p_list) * 2)
    
    def bench_cases():    
        for p in p_list:
            for n in np.arange(0.1, 2.1, 0.1):
                n = int(p*n)
                A = X[:n*p].reshape(n, p)
                b = y[:n]
                for solver in ['lstsq', 'lsqr']:
                    tags = OrderedDict(n=n, p=p, solver=solver)
                    if solver == 'lstsq':
                        solve = delayed(lstsq, tags=tags)
                    elif solver == 'lsqr':
                        solve = delayed(
                            partial(
                                lsqr, atol=1e-10, btol=1e-10, iter_lim=1e6
                            ),
                            tags=tags)
    
                    yield solve(A, b)
    
    bench = Benchmark(wall_time=True)
    df = bench(bench_cases())
    
    g = sns.relplot(x='n', y='wall_time', hue='solver', col='p',
                kind='line', facet_kws={'sharex': False, 'sharey': False},
                data=df.reset_index(), marker='o')
    g.set_titles("p = {col_name}")
    g.set_axis_labels("n", "Wall time (s)")
    g.set(xscale="linear", yscale="log")
    
    plt.subplots_adjust(top=0.9)
    g.fig.suptitle('Benchmark LSQR vs LSTSQ')
    for ax in g.axes.flatten():
        ax.tick_params(labelbottom=True)
    Timing benchmark of LSQR and standard LAPACK LSTSQ solvers for different sizes of n on the x-axis and p=100 left, p=500 right.

    The left plot already looks a bit suspicious around n=p. But what is happening on the right side? Where does this spike of LSQR come from? And why does the standard least squares solver, SVD-based lstsq, not show this spike?

    When I saw these results, I thought something might be wrong with LSQR and opened an issue on the scipy github repository, see https://github.com/scipy/scipy/issues/11204. The community there is really fantastic. Brett Naul pointed me to ….

    The Marchenko–Pastur Distribution

    The Marchenko–Pastur distribution is the distribution of the eigenvalues (singular values of square matrices) of certain random matrices in the large sample limit. Given a random matrix A \in \mathbb{R}^{n,p} with i.i.d. entries A_{ij} having zero mean, \mathbb{E}[A_{ij}] = 0, and finite variance, \mathrm{Var}[A_{ij}] = \sigma^2 < \infty, we define the matrix Y_n = \frac{1}{n}A^T A \in \mathbb{R}^{p,p}. As square and even symmetric matrix, Y_n has a simpler SVD, namely Y_n = V \Sigma V^T. One can in fact show that V is the same as in the SVD of A and that the diagonal matrix \Sigma = \frac{1}{n}S^TS contains the squared singular values of A and \min(0, p-n) extra zero values. The (diagonal) values of \Sigma are called eigenvalues \lambda_1, \ldots, \lambda_p of Y_n. Note/ that the eigenvalues are themselves random variables, and we are interested in their probability distribution or probability measure. We define the (random) measure \mu_p(B) = \frac{1}{p} \#\{\lambda_j \in B\} for all intervals B \subset \mathbb{R}.

    The theorem of Marchenko and Pastur then states that for n, p \rightarrow \infty with \frac{p}{n} \rightarrow \rho , we have \mu_p \rightarrow \mu, where

    \mu(B) = \begin{cases} (1-\frac{1}{\rho})\mathbb{1}_{0\in B} + \nu(B),\quad &\rho > 1 \\ \nu(B), \quad & 0\leq\rho\leq 1 \end{cases} \,,
    \nu(x) = \frac{1}{2\pi\sigma^2} \frac{\sqrt{(\rho_+ - x)(x - \rho_-)}}{\rho x} \mathbb{1}_{x \in [\rho_-, \rho_+]} dx\,,
    \rho_{\pm} = \sigma^2 (1\pm\sqrt{\rho})^2 \,.

    We can at least derive the point mass at zero for \rho>1 \Leftrightarrow p>n: We said above that \Sigma contains p-n extra zeros and those correspond to a density of \frac{p-n}{p}=1 – \frac{1}{\rho} at zero.

    A lot of math so far. Just note that the assumptions on A are exactly met by the one in our benchmark above. Also note that the normal equations can be expressed in terms of Y_n as n Y_n x = A^Tb.

    Empirical Confirmation of the Marchenko–Pastur Distribution

    Before we come back to the spikes in our benchmark, let us have a look and see how good the Marchenko–Pastur distribution is approximated for finite sample size. We choose n=1000, p=500 which gives \rho=\frac{1}{2}. We plot a histrogram of the eigenvalues next to the Marchenko–Pastur distribution.

    def marchenko_pastur_mu(x, rho, sigma2=1):
        x = np.atleast_1d(x).astype(float)
        rho_p = sigma2 * (1 + np.sqrt(rho)) ** 2
        rho_m = sigma2 * (1 - np.sqrt(rho)) ** 2
        mu = np.zeros_like(x)
        is_nonzero = (rho_m < x) & (x < rho_p)
        x_valid = x[is_nonzero]
        factor = 1 / (2 * np.pi * sigma2 * rho)
        mu[is_nonzero] = factor / x_valid
        mu[is_nonzero] *= np.sqrt((rho_p - x_valid) * (x_valid - rho_m))
        if rho > 1:
            mu[x == 0] = 1 - 1 / rho
        return mu
    
    fig, ax = plt.subplots()
    
    n, p = 1000, 500
    A = X.reshape(n, p)
    Y = 1/n * A.T @ A
    eigenvals, _ = np.linalg.eig(Y)
    ax.hist(eigenvals.real, bins=50, density=True, label="histogram")
    x = np.linspace(0, np.max(eigenvals.real), 100)
    ax.plot(x, marchenko_pastur_mu(x, rho=p/n), label="MP distribution")
    ax.legend()
    ax.set_xlabel("eigenvalue")
    ax.set_ylabel("probability")
    ax.set_title("Empirical evidence for n=1000, p=500, rho=0.5")
    Hitogram of eigenvalues for n=100, p=500 versus Marchenko–Pastur distribution

    I have to say, I am very impressed by this good agreement for n=1000, which is far from being huge.

    Conclusion

    Let’s visualize the Marchenko–Pastur distribution Y_n for several ratios \rho and fix \sigma=1:

    fig, ax = plt.subplots()
    
    rho_list = [0.5, 1, 1.5]
    x = np.linspace(0, 5, 1000)[1:] # exclude 0
    for rho in rho_list:
        y = marchenko_pastur_mu(x, rho)
        line, = ax.plot(x, y, label=f"rho={rho}")
        # plot zero point mass
        if rho > 1:
            ax.scatter(0, marchenko_pastur_mu(0, rho), color = line.get_color())
    
    ax.set_ylim(None, 1.2)
    ax.legend()
    ax.set_title("Marchenko-Pastur Distribution")
    ax.set_xlabel("x")
    ax.set_ylabel("dmu/dx")
    Marchenko-Pastur distribution for several ratios rho. The green dot is the point mass at zero.

    From this figure it becomes obvious that the closer the ratio \rho = 1, the higher the probability for very tiny eigenvalues. This results in a high probability for an ill-conditioned matrix A^TA coming from an ill-conditioned matrix A. Let’s confirm that:

    p = 500
    n_vec = []
    c_vec = []
    for n in np.arange(0.1, 2.05, 0.05):
        n = int(p*n)
        A = X[:n*p].reshape(n, p)
        n_vec.append(n)
        c_vec.append(np.linalg.cond(A))
    
    fig, ax = plt.subplots()
    ax.plot(n_vec, c_vec)
    ax.set_xlabel("n")
    ax.set_ylabel("condition number of A")
    ax.set_title("Condition Number of A for p=500")
    Condition number of the random matrix A

    As a result of the ill-conditioned A, the LSQR solver has problems to achieve its tolerance criterion, needs more iterations, and takes longer time. This is exactly what we observed in the benchmark plots: the peak occurred around n=p. The SVD-based lstsq solver, on the other hand, does not use an iterative scheme and does not need more time for ill-conditioned matrices.

    You find the accompanying notebook here: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2022-05-22%20from%20least%20squares%20benchmarks%20to%20the%20marchenko-pastur%20distribution.ipynb

  • DuckDB: Quacking SQL

    Lost in Translation between R and Python 8

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    DuckDB

    DuckDB is a fantastic in-process SQL database management system written completely in C++. Check its official documentation and other blogposts like this to get a feeling of its superpowers. It is getting better and better!

    Some of the highlights:

    • Easy installation in R and Python, made possible via language bindings.
    • Multiprocessing and fast.
    • Allows to work with data bigger than RAM.
    • Can fire SQL queries on R and Pandas tables.
    • Can fire SQL queries on (multiple!) csv and/or Parquet files.
    • Quacks Apache Arrow.

    Installation

    DuckDB is super easy to install:

    • R: install.packages("duckdb")
    • Python: pip install duckdb

    Additional packages required to run the code of this post are indicated in the code.

    A first query

    Let’s start by loading a dataset, initializing DuckDB and running a simple query.

    The dataset we use here contains information on over 20,000 sold houses in Kings County. Along with the sale price, different features describe the size and location of the properties. The dataset is available on OpenML.org with ID 42092.

    library(OpenML)
    library(duckdb)
    library(tidyverse)
    
    # Load data
    df <- getOMLDataSet(data.id = 42092)$data
    
    # Initialize duckdb, register df and materialize first query
    con = dbConnect(duckdb())
    duckdb_register(con, name = "df", df = df)
    con %>% 
      dbSendQuery("SELECT * FROM df limit 5") %>% 
      dbFetch()
    import duckdb
    import pandas as pd
    from sklearn.datasets import fetch_openml
    
    # Load data
    df = fetch_openml(data_id=42092, as_frame=True)["frame"]
    
    # Initialize duckdb, register df and fire first query
    # If out-of-RAM: duckdb.connect("py.duckdb", config={"temp_directory": "a_directory"})
    con = duckdb.connect()
    con.register("df", df)
    con.execute("SELECT * FROM df limit 5").fetchdf()
    Result of first query (from R)

    Average price per grade

    If you like SQL, then you can do your data preprocessing and simple analyses with DuckDB. Here, we calculate the average house price per online grade (the higher the grade, the better the house).

    query <- 
      "
      SELECT AVG(price) avg_price, grade 
      FROM df 
      GROUP BY grade
      ORDER BY grade
      "
    avg <- con %>% 
      dbSendQuery(query) %>% 
      dbFetch()
    
    avg
    
    # Average price per grade
    query = """
      SELECT AVG(price) avg_price, grade 
      FROM df 
      GROUP BY grade
      ORDER BY grade
      """
    avg = con.execute(query).fetchdf()
    avg
    R output

    Highlight: queries to files

    The last query will be applied directly to files on disk. To demonstrate this fantastic feature, we first save “df” as a parquet file and “avg” as a csv file.

    write_parquet(df, "housing.parquet")
    write.csv(avg, "housing_avg.csv", row.names = FALSE)
    
    # Save df and avg to different file types
    df.to_parquet("housing.parquet")  # pyarrow=7
    avg.to_csv("housing_avg.csv", index=False)

    Let’s load some columns of “housing.parquet” data, but only rows with grades having an average price of one million USD. Agreed, that query does not make too much sense but I hope you get the idea…😃

    # "Complex" query
    query2 <- "
      SELECT price, sqft_living, A.grade, avg_price
      FROM 'housing.parquet' A
      LEFT JOIN 'housing_avg.csv' B
      ON A.grade = B.grade
      WHERE B.avg_price > 1000000
      "
    
    expensive_grades <- con %>% 
      dbSendQuery(query2) %>% 
      dbFetch()
    
    head(expensive_grades)
    
    # dbDisconnect(con)
    # Complex query
    query2 = """
      SELECT price, sqft_living, A.grade, avg_price
      FROM 'housing.parquet' A
      LEFT JOIN 'housing_avg.csv' B
      ON A.grade = B.grade
      WHERE B.avg_price > 1000000
      """
    expensive_grades = con.execute(query2).fetchdf()
    expensive_grades
    
    # con.close()
    R output

    Last words

    • DuckDB is cool!
    • If you have strong SQL skills but do not know R or Python so well, this is a great way to get used to those programming languages.
    • If you are unfamiliar to SQL but like R and/or Python, you can use DuckDB for a while and end up being an SQL addict.
    • If your analysis involves combining many large files during preprocessing, then you can try the trick shown in the last example of this post.

    The Python notebook and R code can be found at: