Tag: Python

Python programming lagnuage

  • Random Forests with Monotonic Constraints

    Lost in Translation between R and Python 7

    Hello random forest friends

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    Monotonic constraints

    On ML competition platforms like Kaggle, complex and unintuitively behaving models dominate. In this respect, reality is completely different. There, the majority of models do not serve as pure prediction machines but rather as fruitful source of information. Furthermore, even if used as prediction machine, the users of the models might expect a certain degree of consistency when “playing” with input values.

    A classic example are statistical house appraisal models. An additional bathroom or an additional square foot of ground area is expected to raise the appraisal, everything else being fixed (ceteris paribus). The user might lose trust in the model if the opposite happens.

    One way to enforce such consistency is to monitor the signs of coefficients of a linear regression model. Another useful strategy is to impose monotonicity constraints on selected model effects.

    Trees and monotonic constraints

    Monotonicity constraints are especially simple to implement for decision trees. The rule is basically as follows:
    If a monotonicity constraint would be violated by a split on feature X, it is rejected. (Or a large penalty is subtracted from the corresponding split gain.) This will imply monotonic behavior of predictions in X, keeping all other features fixed.

    Tree ensembles like boosted trees or random forests will automatically inherit this property.

    Boosted trees

    Most implementations of boosted trees offer monotonicity constraints. Here is a selection:

    What about random forests?

    Unfortunately, the picture is completely different for random forests. At the time of writing, I am not aware of any random forest implementation in R or Python offering this useful feature.

    Some options

    1. Implement monotonic constrainted random forests from scratch.
    2. Ask for this feature in existing implementations.
    3. Be creative and use XGBoost to emulate random forests.

    For the moment, let’s stick to option 3. In our last R <-> Python blog post, we demonstrated that XGBoost’s random forest mode works essentially as good as standard random forest implementations, at least in regression settings and using sensible defaults.

    Warning: Be careful with imposing monotonicity constraints

    Ask yourself: does the constraint really make sense for all possible values of other features? You will see that the answer is often “no”.

    An example: If your house price model uses the features “number of rooms” and “living area”, then a monotonic constraint on “living area” might make sense (given any number of rooms), while such constraint would be non-sensical for the number of rooms. Why? Because having six rooms in a 1200 square feet home is not necessarily better than having just five rooms in an equally sized home.

    Let’s try it out

    We use a nice dataset containing information on over 20,000 sold houses in Kings County. Along with the sale price, different features describe the size and location of the properties. The dataset is available on OpenML.org with ID 42092.

    Some rows and columns from the Kings County house dataset.

    The following R and Python codes

    • fetch the data,
    • prepare the ML setting,
    • fit unconstrained XGBoost random forests using log sales price as response,
    • and visualize the effect of log ground area by individual conditional expectation (ICE) curves.

    An ICE curve for variable X shows how the prediction of one specific observation changes if the value of X changes. Repeating this for multiple observations gives an idea of the effect of X. The average over multiple ICE curves produces the famous partial dependent plot.

    library(farff)
    library(OpenML)
    library(dplyr)
    library(xgboost)
    
    set.seed(83454)
    
    rmse <- function(y, pred) {
      sqrt(mean((y-pred)^2))
    }
    
    # Load King Country house prices dataset on OpenML
    # ID 42092, https://www.openml.org/d/42092
    df <- getOMLDataSet(data.id = 42092)$data
    head(df)
    
    # Prepare
    df <- df %>%
      mutate(
        log_price = log(price),
        log_sqft_lot = log(sqft_lot),
        year = as.numeric(substr(date, 1, 4)),
        building_age = year - yr_built,
        zipcode = as.integer(as.character(zipcode))
      )
    
    # Define response and features
    y <- "log_price"
    x <- c("grade", "year", "building_age", "sqft_living",
           "log_sqft_lot", "bedrooms", "bathrooms", "floors", "zipcode",
           "lat", "long", "condition", "waterfront")
    
    # random split
    ix <- sample(nrow(df), 0.8 * nrow(df))
    y_test <- df[[y]][-ix]
    
    # Fit untuned, but good(!) XGBoost random forest
    dtrain <- xgb.DMatrix(data.matrix(df[ix, x]),
                          label = df[ix, y])
    
    params <- list(
      objective = "reg:squarederror",
      learning_rate = 1,
      num_parallel_tree = 500,
      subsample = 0.63,
      colsample_bynode = 1/3,
      reg_lambda = 0,
      max_depth = 20,
      min_child_weight = 2
    )
    
    system.time( # 25 s
      unconstrained <- xgb.train(
        params,
        data = dtrain,
        nrounds = 1,
        verbose = 0
      )
    )
    
    pred <- predict(unconstrained, data.matrix(df[-ix, x]))
    
    # Test RMSE: 0.172
    rmse(y_test, pred)
    
    # ICE curves via our flashlight package
    library(flashlight)
    
    pred_xgb <- function(m, X) predict(m, data.matrix(X[, x]))
    
    fl <- flashlight(
      model = unconstrained,
      label = "unconstrained",
      data = df[ix, ],
      predict_function = pred_xgb
    )
    
    light_ice(fl, v = "log_sqft_lot", indices = 1:9,
              evaluate_at = seq(7, 11, by = 0.1)) %>%
      plot()
    # Imports
    import numpy as np
    from sklearn.datasets import fetch_openml
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error
    from sklearn.inspection import PartialDependenceDisplay
    from xgboost import XGBRFRegressor
    
    # Fetch data from OpenML
    df = fetch_openml(data_id=42092, as_frame=True)["frame"]
    
    # Prepare data
    df = df.assign(
        year=lambda x: x.date.str[0:4].astype(int),
        zipcode=lambda x: x.zipcode.astype(int),
        log_sqft_lot=lambda x: np.log(x.sqft_lot),
        building_age=lambda x: x.year - x.yr_built,
    )
    
    # Feature list
    xvars = [
        "grade",
        "year",
        "building_age",
        "sqft_living",
        "log_sqft_lot",
        "bedrooms",
        "bathrooms",
        "floors",
        "zipcode",
        "lat",
        "long",
        "condition",
        "waterfront",
    ]
    
    # Data split
    y_train, y_test, X_train, X_test = train_test_split(
        np.log(df["price"]), df[xvars], train_size=0.8, random_state=766
    )
    
    # Modeling - wall time: 39 seconds
    param_dict = dict(
        n_estimators=500,
        max_depth=20,
        learning_rate=1,
        subsample=0.63,
        colsample_bynode=1 / 3,
        reg_lambda=0,
        objective="reg:squarederror",
        min_child_weight=2,
    )
    
    unconstrained = XGBRFRegressor(**param_dict).fit(X_train, y_train)
    
    # Test RMSE 0.176
    pred = unconstrained.predict(X_test)
    print(f"RMSE: {mean_squared_error(y_test, pred, squared=False):.03f}")
    
    # ICE and PDP - wall time: 47 seconds
    PartialDependenceDisplay.from_estimator(
        unconstrained,
        X=X_train,
        features=["log_sqft_lot"],
        kind="both",
        subsample=20,
        random_state=1,
    )

    Figure 1 (R output): ICE curves of log(ground area) for the first nine observations. Many non-monotonic parts are visible.

    We clearly see many non-monotonic (and in this case counterintuitive) ICE curves.

    What would a model give with monotonically increasing constraint on the ground area?

    # Monotonic increasing constraint
    (params$monotone_constraints <- 1 * (x == "log_sqft_lot"))
    
    system.time( #  179s
      monotonic <- xgb.train(
        params,
        data = dtrain,
        nrounds = 1,
        verbose = 0
      )
    )
    
    pred <- predict(monotonic, data.matrix(df[-ix, x]))
    
    # Test RMSE: 0.176
    rmse(y_test, pred)
    
    fl_m <- flashlight(
      model = monotonic,
      label = "monotonic",
      data = df[ix, ],
      predict_function = pred_xgb
    )
    
    light_ice(fl_m, v = "log_sqft_lot", indices = 1:9,
              evaluate_at = seq(7, 11, by = 0.1)) %>%
      plot()
    # One needs to pass the constraints as single string, which is rather ugly
    mc = "(" + ",".join([str(int(x == "log_sqft_lot")) for x in xvars]) + ")"
    print(mc)
    
    # Modeling - wall time 49 seconds
    constrained = XGBRFRegressor(monotone_constraints=mc, **param_dict)
    constrained.fit(X_train, y_train)
    
    # Test RMSE: 0.178
    pred = constrained.predict(X_test)
    print(f"RMSE: {mean_squared_error(y_test, pred, squared=False):.03f}")
    
    # ICE and PDP - wall time 39 seconds
    PartialDependenceDisplay.from_estimator(
        constrained,
        X=X_train,
        features=["log_sqft_lot"],
        kind="both",
        subsample=20,
        random_state=1,
    )
    Figure 2 (R output): ICE curves of the same observations as in Figure 1, but now with monotonic constraint. All curves are monotonically increasing.

    We see:

    1. It works! Each ICE curve in log(lot area) is monotonically increasing. This means that predictions are monotonically increasing in lot area, keeping all other feature values fixed.
    2. The model performance is slightly worse. This is the price paid for receiving a more intuitive behaviour in an important feature.
    3. In Python, both models take about the same time to fit (30-40 s on a 4 core i7 CPU laptop). Curiously, in R, the constrained model takes about six times longer to fit than the unconstrained one (170 s vs 30 s).

    Summary

    • Monotonic constraints help to create intuitive models.
    • Unfortunately, as per now, native random forest implementations do not offer such constraints.
    • Using XGBoost’s random forest mode is a temporary solution until native random forest implementations add this feature.
    • Be careful to add too many constraints: does a constraint really make sense for all other (fixed) choices of feature values?

    The Python notebook and R code can be found at:

  • Personal Highlights of Scikit-Learn 1.0

    Yes! After more than 10 years, scikit-learn released its 1.0 version on 24 September 2021. In this post, I’d like to point out some personal highlights apart from the release highlights.

    1. Feature Names

    This one is listed in the release highlights, but deserves to be mentioned again.

    from sklearn.compose import ColumnTransformer
    from sklearn.linear_model import LogisticRegression
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    import pandas as pd
    
    df = pd.DataFrame({
        "pet": ["dog", "cat", "fish"],
        "age": [3, 7, 1],
        "noise": [-99, pd.NA, 1e-10],
        "target": [1, 0, 1],
    })
    y = df.pop("target")
    X = df
    
    preprocessor = ColumnTransformer(
        [
            ("numerical", StandardScaler(), ["age"]),
            ("categorical", OneHotEncoder(), ["pet"]),
        ],
        verbose_feature_names_out=False,
        remainder="drop",
    )
    
    pipe = make_pipeline(preprocessor, LogisticRegression())
    pipe.fit(X, y)
    pipe[:-1].get_feature_names_out()
    array(['age', 'pet_cat', 'pet_dog', 'pet_fish'], dtype=object)

    This is not yet available for all estimators and transformers, but it is a big step towards SLEP007.

    2. ColumnTransformer allows changed order of columns

    Before this release, ColumnTransformer recorded the order of the columns of a dataframe during the fit method and required that a dataframe X passed to transform had the exact same columns and in the exact same order.

    This was a big pain point in productive settings because fit and predict of a model pipeline, both calling transform, often get data from different sources, and, for instance, SQL does not care about the order of columns. On top, remainder=”drop” forced you to have also all dropped columns in transform. This contradicted at least my modelling workflow as I often specify all meaningful features explicitly and drop the rest by the remainder option. This then led to unwanted surprises when applying predict to new data in the end. It might also happen, that one forgets to remove the target variable from the training X and relies on the drop option. Usually, the application of the predictive model pipeline is on new data without the target variable. The error in this case, however, might be considered a good thing.

    With pull request (PR) #19263, the ColumnTransformer only cares about the presence and names of the columns, not about their order. With remainder=”drop”, it only cares about the specified columns and ignores all other columns, even no matter if the dropped ones are different in fit and transform. Note that this only works with pandas dataframes as input (or an object that quacks alike).

    df_new = pd.DataFrame({
        "age": [1, 9, 3],
        "another_noise": [pd.NA, -99, 1e-10],
        "pet": ["cat", "dog", "fish"],
    })
    pipe.predict(df_new)

    You find these little code snippets as notebook at the usual place: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-10-21%20scikit-learn_v1_release_highlights.ipynb.

    3. Poisson criterion for random forests

    Scikit-learn v0.24 shipped with the new option criterion=”poisson” for DecisionTreeRegressor to split nodes based on the reduction of Poisson deviance. Version 1.0 passed this option further to the RandomForestRegressor in PR #19836. Random forests are often used models and valued for their ease of use. We even like to write blog posts about them:

    The Poisson splitting criterion has its place when modelling counts or frequencies. It allows for non-negative values to be modelled, but forbids non-positive predictions. This corresponds to y_{train} \geq 0 and y_{predict} > 0.

    4. The best example/tutorial of the year

    It’s not visible from the release notes, but this deserves to be noted. PR #20281 added a fantastic example, more like a tutorial, on time-related feature engineering. You find a lot of interesting features, some of them shipped with the 1.0 release, e.g. time base cross validation, generation of cyclic b-splines and adding pairwise interactions to a linear model, usage of native categorical features in the HistGradientBoostingRegressor

    Take a look for yourself at this wonderful example.

  • Feature Subsampling For Random Forest Regression

    TLDR: The number of subsampled features is a main source of randomness and an important parameter in random forests. Mind the different default values across implementations.

    Randomness in Random Forests

    Random forests are very popular machine learning models. They are build from easily understandable and well visualizable decision trees and give usually good predictive performance without the need for excessive hyperparameter tuning. Some drawbacks are that they do not scale well to very large datasets and that their predictions are discontinuous on continuous features.

    A key ingredient for random forests is—no surprise here—randomness. The two main sources for randomness are:

    • Feature subsampling in every node split when fitting decision trees.
    • Row subsampling (bagging) of the training dataset for each decision tree.

    In this post, we want to investigate the first source, feature subsampling, with a special focus on regression problems on continuous targets (as opposed to classification).

    Feature Subsampling

    In his seminal paper, Leo Breiman introduced random forests and pointed out several advantages of feature subsamling per node split. We cite from his paper:

    The forests studied here consist of using randomly selected inputs or combinations of inputs at each node to grow each tree. The resulting forests give accuracy that compare favorably with Adaboost. This class of procedures has desirable characteristics:

    i Its accuracy is as good as Adaboost and sometimes better.

    ii It’s relatively robust to outliers and noise.

    iii It’s faster than bagging or boosting.

    iv It gives useful internal estimates of error, strength, correlation and variable importance.

    v It’s simple and easily parallelized.

    Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).

    Note the focus on comparing with Adaboost at that time and the, in today’s view, relatively small datasets used for empirical studies in this paper.

    If the input data as p number of features (columns), implementations of random forests usually allow to specify how many features to consider at each split:

    ImplementationLanguageParameterDefault
    scikit-learn RandomForestRegressorPythonmax_featuresp
    scikit-learn RandomForestClassifierPythonmax_featuressqrt(p)
    rangerRmtrysqrt(p)
    randomForest regressionRmtryp/3
    randomForest classificationRmtrysqrt(p)
    H2O regressionPython & Rmtriesp/3
    H2O classificationPython & Rmtriessqrt(p)

    Note that the default of scikit-learn for regression is surprising because it switches of the randomness from feature subsampling rendering it equal to bagged trees!

    While empirical studies on the impact of feature for good default choices focus on classification problems, see the literature review in Probst et al 2019, we consider a set of regression problems with continuous targets. Note that different results might be more related to different feature spaces than to the difference between classification and regression.

    The hyperparameters mtry, sample size and node size are the parameters that control the randomness of the RF. […]. Out of these parameters, mtry is most influential both according to the literature and in our own experiments. The best value of mtry depends on the number of variables that are related to the outcome.

    Probst, P. et al. “Hyperparameters and tuning strategies for random forest.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (2019): n. pag.

    Benchmarks

    We selected the following 13 datasets with regression problems:

    Datasetnumber of samplesnumber of used features p
    Allstate188,318130
    Bike_Sharing_Demand17,37912
    Brazilian_houses10’69212
    ames1’46079
    black_friday166’8219
    colleges7’06349
    delays_zurich_transport27’32717
    diamonds53’9406
    la_crimes1’468’82525
    medical_charges_nominal163’06511
    nyc-taxi-green-dec-2016581’83514
    particulate-matter-ukair-2017394’2999
    taxi581’83518

    Note that among those, there is no high dimensional dataset in the sense that p>number of samples.

    On these, we fitted the scikit-learn (version 0.24) RandomForestRegressor (within a short pipeline handling missing values) with default parameters. We used 5-fold cross validation with 4 different values max_feature=p/3 (blue), sqrt(p) (orange), 0.9 p (green), and p (red). Now, we show the mean squared error with uncertainty bars (± one standard deviation of cross validation splits), the lower the better.

    In addition, we also report the fit time of each (5-fold) fit in seconds, again the lower the better.

    Note that sqrt(p) is often smaller than p/3. With this in mind, this graphs show that fit time is about proportional to the number of features subsampled.

    Conclusion

    • The main tuning parameter of random forests is the number of features used for feature subsampling (max_features, mtry). Depending on the dataset, it has a relevant impact on the predictive performance.
    • The default of scikit-learn’s RandomForestRegressor seems odd. It produces bagged trees. This is a bit like using ridge regression with a zero penalty😉. However, it can be justified by our benchmarks above.

    The full code can be found here:

    https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-08-19%20random_forests_max_features.ipynb

  • Strong Random Forests with XGBoost

    Lost in Translation between R and Python 6

    Hello random forest friends

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    The last one was on diamond duplicates and grouped sampling.

    XGBoost’s random forests

    For sure, XGBoost is well known for its excellent gradient boosting trees implementation. Although less obvious, it is no secret that it also offers a way to fit single trees in parallel, emulating random forests, see the great explanations on the official XGBoost page. Still, there seems to exist quite some confusion on how to choose certain parameters in order to get good results. It is the aim of this post to clarify this.

    Also LightGBM offers a random forest mode. We will investigate it in a later post.

    Why would you want to use XGBoost to fit a random forest?

    1. Interaction & monotonic constraints are available for XGBoost, but typically not for random forest implementations. A separate post will follow to illustrate this in the random forest setting.
    2. XGBoost can natively deal with missing values in an elegant way, unlike many random forest algorithms.
    3. You can stick to the same data preparation pipeline.

    I had additional reasons in mind, e.g. using non-standard loss functions, but this did not turn out to work well. This is possibly due to the fact that XGBoost uses a quadratic approximation to the loss, which is exact only for the mean squared error loss (MSE).

    How to enable the ominous random forest mode?

    Following the official explanations, we would need to set

    • num_parallel_tree to the number of trees in the forest,
    • learning_rate and num_boost_round to 1.

    There are further valuable tips, e.g. to set row and column subsampling to values below one to resemble true random forests.

    Still, most of the regularization parameters of XGBoost tend to favour simple trees, while the idea of a random forest is to aggregate deep, overfitted trees. These regularization parameters have to be changed as well in order to get good results.

    So voila my suggestions.

    Suggestions for parameters

    • learning_rate=1 (see above)
    • num_boost_round=1 (see above)
      Has to be set in train(), not in the parameter list. It is called nrounds in R.
    • subsample=0.63
      A random forest draws a bootstrap sample to fit each tree. This means about 0.63 of the rows will enter one or multiple times into the model, leaving 37% out. While XGBoost does not offer such sampling with replacement, we can still introduce the necessary randomness in the dataset used to fit a tree by skipping 37% of the rows per tree.
    • colsample_bynode=floor(sqrt(m))/m
      Column subsampling per split is the main source of randomness in a random forest. A good default is usually to sample the square root of the number of features m or m/3. XGBoost offers different colsample_by* parameters, but it is important to sample per split resp. per node, not by tree. Otherwise, it might happen that important features are missing in a tree altogether, leading to overall bad predictions.
    • num_parallel_tree
      The number of trees. Native implementations of random forests usually use a default value between 100 and 500. The more, the better—but slower.
    • reg_lambda=0
      XGBoost uses a default L2 penalty of 1! This will typically lead to shallow trees, colliding with the idea of a random forest to have deep, wiggly trees. In my experience, leaving this parameter at its default will lead to extremely bad XGBoost random forest fits.
      Set it to zero or a value close to zero.
    • max_depth=20
      Random forests usually train very deep trees, while XGBoost’s default is 6. A value of 20 corresponds to the default in the h2o random forest, so let’s go for their choice.
    • min_child_weight=2
      The default of XGBoost is 1, which tends to be slightly too greedy in random forest mode. For binary classification, you would need to set it to a value close or equal to 0.

    Of course these parameters can be tuned by cross-validation, but one of the reasons to love random forests is their good performance even with default parameters.

    Compared to optimized random forests, XGBoost’s random forest mode is quite slow. At the cost of performance, choose

    • lower max_depth,
    • higher min_child_weight, and/or
    • smaller num_parallel_tree.

    Let’s try it out with regression

    We will use a nice house price dataset, consisting of information on over 20,000 sold houses in Kings County. Along with the sale price, different features describe the size and location of the properties. The dataset is available on OpenML.org with ID 42092.

    Some rows and columns from the Kings County house dataset.

    The following R resp. Python codes fetch the data, prepare the ML setting and fit a native random forest with good defaults. In R, we use the ranger package, in Python the implementation of scikit-learn.

    The response variable is the logarithmic sales price. A healthy set of 13 variables are used as features.

    library(farff)
    library(OpenML)
    library(dplyr)
    library(ranger)
    library(xgboost)
    
    set.seed(83454)
    
    rmse <- function(y, pred) {
      sqrt(mean((y-pred)^2))
    }
    
    # Load King Country house prices dataset on OpenML
    # ID 42092, https://www.openml.org/d/42092
    df <- getOMLDataSet(data.id = 42092)$data
    head(df)
    
    # Prepare
    df <- df %>%
      mutate(
        log_price = log(price),
        year = as.numeric(substr(date, 1, 4)),
        building_age = year - yr_built,
        zipcode = as.integer(as.character(zipcode))
    )
    
    # Define response and features
    y <- "log_price"
    x <- c("grade", "year", "building_age", "sqft_living",
           "sqft_lot", "bedrooms", "bathrooms", "floors", "zipcode",
           "lat", "long", "condition", "waterfront")
    m <- length(x)
    
    # random split
    ix <- sample(nrow(df), 0.8 * nrow(df))
    
    # Fit untuned random forest
    system.time( # 3 s
      fit_rf <- ranger(reformulate(x, y), data = df[ix, ])
    )
    y_test <- df[-ix, y]
    
    # Test RMSE: 0.173
    rmse(y_test, predict(fit_rf, df[-ix, ])$pred)
    # object.size(fit_rf) # 180 MB
    # Imports
    import numpy as np
    import pandas as pd
    
    from sklearn.datasets import fetch_openml
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_squared_error
    
    
    def rmse(y_true, y_pred):
        return np.sqrt(mean_squared_error(y_true, y_pred))
      
    # Fetch data from OpenML
    df = fetch_openml(data_id=42092, as_frame=True)["frame"]
    print("Shape: ", df.shape)
    df.head()
    
    # Prepare data
    df = df.assign(
        year = lambda x: x.date.str[0:4].astype(int),
        zipcode = lambda x: x.zipcode.astype(int)
    ).assign(
        building_age = lambda x: x.year - x.yr_built,
    )
    
    # Feature list
    xvars = [
        "grade", "year", "building_age", "sqft_living", 
        "sqft_lot", "bedrooms", "bathrooms", "floors", 
        "zipcode", "lat", "long", "condition", "waterfront"
    ]
    
    # Data split
    y_train, y_test, X_train, X_test = train_test_split(
        np.log(df["price"]), df[xvars], 
        train_size=0.8, random_state=766
    )
    
    # Fit scikit-learn rf
    rf = RandomForestRegressor(
        n_estimators=500, 
        max_features="sqrt", 
        max_depth=20,
        n_jobs=-1, 
        random_state=104
    )
    
    rf.fit(X_train, y_train)  # Wall time 3 s
    
    # Test RMSE: 0.176
    print(f"RMSE: {rmse(y_test, rf.predict(X_test)):.03f}")

    Both in R and Python, the test RMSE is between 0.17 and 0.18, i.e. about 2/3 of the test predictions are within 18% of the observed value. Not bad!
    Note: The test performance depends on the split seed, so it does not make sense to directly compare the R and Python performance.

    With XGBoost’s random forest mode

    Now let’s try to reach the same performance with XGBoost’s random forest implementation using the above parameter suggestions.

    # Fit untuned, but good(!) XGBoost random forest
    dtrain <- xgb.DMatrix(data.matrix(df[ix, x]),
                          label = df[ix, y])
    
    params <- list(
      objective = "reg:squarederror",
      learning_rate = 1,
      num_parallel_tree = 500,
      subsample = 0.63,
      colsample_bynode = floor(sqrt(m)) / m,
      reg_lambda = 0,
      max_depth = 20,
      min_child_weight = 2
    )
    
    system.time( # 20 s
      fit_xgb <- xgb.train(
        params,
        data = dtrain,
        nrounds = 1,
        verbose = 0
      )
    )
    
    pred <- predict(fit_xgb, data.matrix(df[-ix, x]))
    
    # Test RMSE: 0.174
    rmse(y_test, pred)
    # xgb.save(fit_xgb, "xgb.model") # 140 MB
    import xgboost as xgb
    
    dtrain = xgb.DMatrix(X_train, label=y_train)
    
    m = len(xvars)
    
    params = dict(
        objective="reg:squarederror",
        learning_rate=1,
        num_parallel_tree=500,
        subsample=0.63,
        colsample_bynode=int(np.sqrt(m))/m,
        reg_lambda=0,
        max_depth=20,
        min_child_weight=2
    )
    
    rf_xgb = xgb.train(  # Wall time 34 s
        params, 
        dtrain, 
        num_boost_round=1
    )
    preds = rf_xgb.predict(xgb.DMatrix(X_test))
    
    # 0.177
    print(f"RMSE: {rmse(y_test, preds):.03f}")

    We see:

    • The performance of the XGBoost random forest is essentially as good as the native random forest implementations. And all this without any parameter tuning!
    • XGBoost is much slower than the optimized random forest implementations. If this is a problem, e.g. reduce the tree depth. In this example, Python takes almost twice as much time as R. No idea why!
      The timings were made on a usual 4 core i7 processor.
    • Disk space required to store the model objects is comparable between XGBoost and native random forest implementations.

    What if you would run the same model with XGBoost defaults?

    • With default reg_lambda=1:
      The performance would end up at a catastrophic RMSE of 0.35!
    • With default max_depth=6:
      The RMSE would be much worse (0.23) as well.
    • With colsample_bytree instead of colsample_bynode:
      The RMSE would deteriorate to 0.27.

    Thus: It is essential to set some values to a good “random forest” default!

    Does it always work that good?

    Definitively not in classification settings. However, in regression settings with the MSE loss, XGBoost’s random forest mode is often as accurate as native implementations.

    • Classification models
      In my experience, the XGBoost random forest mode does not work as good as a native random forest for classification, possibly due to the fact that it uses only an approximation to the loss function.
    • Other regression examples
      Using the setting of our last “R <–> Python” post (diamond duplicates and grouped sampling) and the same parameters as above, we get the following test RMSEs: With ranger (R code in link below): 0.1043, with XGBoost: 0.1042. Sweet!

    Wrap up

    • With the right default parameters, XGBoost’s random forest mode reaches similar performance on regression problems than native random forest packages. Without any tuning!
    • For losses other than MSE, it does not work so well.

    The Python notebook and R code can be found at:

  • Least Squares Minimal Norm Solution

    Least squares is a cornerstone of linear algebra, optimization and therefore also for statistical and machine learning models. Given a matrix A ∈ Rn,p and a vector b ∈ Rn, we search for

    x^\star = \argmin_{x \in R^p} ||Ax - b||^2

    Translation for regression problems: Search for coefficients β→x given the design or features matrix X→A and target y→b.

    In the case of a singular matrix A or an underdetermined setting n<p, the above definition is not precise and permits many solutions x. In those cases, a more precise definition is the minimum norm solution of least squares:

    x^\star = \argmin_{x} ||x||^2
    \quad \text{subject to} \quad
    \min_{x \in R^p} ||Ax - b||^2

    In words: We solve the least squares problem and choose the solution with minimal Euclidean norm.

    We will show step by step what this means on a series on overdetermined systems. In linear regression with more observations than features, n>p, one says the system is overdetermined, leading to ||Ax^\star-b||^2 > 0 in most cases.

    Regular System

    Let’s start with a well-behaved example. To have good control over the matrix, we construct it by its singular value decomposition (SVD) A=USV’ with orthogonal matrices U and V and diagonal matrix S. Recall that S has only non-negative entries and — for regular matrices — even strictly positive values. Therefore, we start by setting \operatorname{diag}(S) = (1, 2, 3, 4, 5).

    from pprint import pprint
    import matplotlib.pyplot as plt
    import numpy as np
    from numpy.linalg import norm
    from scipy.stats import ortho_group
    from scipy import linalg
    from scipy.sparse import linalg as spla
    
    
    def generate_U_S_Vt(n=10, p=5, random_state=532):
        """Generate SVD to construct a regular matrix A.
        
        A has n rows, p columns.
        
        Returns
        -------
        U: orthogonal matrix
        S: diagonal matrix
        Vt: orthogonal matrix
        """
        r = min(n, p)
        S = np.diag(1.0 * np.arange(1, 1 + r))
        if n > p:
            # add rows with value 0
            S = np.concatenate((S, np.zeros((n - p, p))), axis=0)
        elif p > n:
            # add columns with value 0
            S = np.concatenate((S, np.zeros((n, p - n))), axis=1)
        U = ortho_group.rvs(n, random_state=random_state)
        Vt = ortho_group.rvs(p, random_state=random_state + 1)
        return U, S, Vt
    
    
    def solve_least_squares(A, b):
        """Solve least squares with several methods.
        
        Returns
        ------
        x : dictionary with solver and solution
        """
        x = {}
        x["gelsd"] = linalg.lstsq(A, b, lapack_driver="gelsd")[0]
        x["gelsy"] = linalg.lstsq(A, b, lapack_driver="gelsy")[0]
        x["lsqr"] = spla.lsqr(A, b)[0]
        x["lsmr"] = spla.lsmr(A, b)[0]
        x["normal_eq"] = linalg.solve(A.T @ A, A.T @ b, assume_a="sym")
    
        return x
    
    
    def print_dict(d):
        np.set_string_function(np.array2string)
        pprint(d)
        np.set_string_function(None)
    
    
    np.set_printoptions(precision=5)
    
    
    n = 10
    p = 5
    
    U, S, Vt = generate_U_S_Vt(n=n, p=p)
    A = U @ S @ Vt
    
    x_true = np.round(6 * Vt.T[:p, 0])  # interesting choice
    rng = np.random.default_rng(157)
    noise = rng.standard_normal(n)
    b = A @ x_true + noise
    
    S_inv = np.copy(S.T)
    S_inv[S_inv>0] = 1/S_inv[S_inv>0]
    
    x_exact = Vt.T @ S_inv @ U.T @ b
    
    print(f"x_exact = {x_exact}")
    print_dict(solve_least_squares(A, b))
    x_exact = [ 0.78087 -4.74942 -0.99938 -2.38327 -3.7431 ] {'gelsd': [ 0.78087 -4.74942 -0.99938 -2.38327 -3.7431 ],  'gelsy': [ 0.78087 -4.74942 -0.99938 -2.38327 -3.7431 ],  'lsmr': [ 0.78087 -4.74942 -0.99938 -2.38327 -3.7431 ],  'lsqr': [ 0.78087 -4.74942 -0.99938 -2.38327 -3.7431 ],  'normal_eq': [ 0.78087 -4.74942 -0.99938 -2.38327 -3.7431 ]}

    We see that all least squares solvers do well on regular systems, be it the LAPACK routines for direct least squares, GELSD and GELSY, the iterative solvers LSMR and LSQR or solving the normal equations by the LAPACK routine SYSV for symmetric matrices.

    Singular System

    We convert our regular matrix in a singular one by setting its first diagonal element to zero.

    S[0, 0] = 0
    A = U @ S @ Vt
    
    S_inv = np.copy(S.T)
    S_inv[S_inv>0] = 1/S_inv[S_inv>0]
    
    # Minimum Norm Solution
    x_exact = Vt.T @ S_inv @ U.T @ b
    
    print(f"x_exact = {x_exact}")
    x_solution = solve_least_squares(A, b)
    print_dict(x_solution)
    x_exact = [-0.21233  0.00708  0.34973 -0.30223 -0.0235 ] {'gelsd': [-0.21233  0.00708  0.34973 -0.30223 -0.0235 ],  'gelsy': [-0.21233  0.00708  0.34973 -0.30223 -0.0235 ],  'lsmr': [-0.21233  0.00708  0.34973 -0.30223 -0.0235 ],  'lsqr': [-0.21233  0.00708  0.34973 -0.30223 -0.0235 ],  'normal_eq': [-0.08393 -0.60784  0.17531 -0.57127 -0.50437]}

    As promised by their descriptions, the first four solvers find the minimum norm solution. If you run the code yourself, you will get a LinAlgWarning from the normal equation solver. The output confirms it: Solving normal equations for singular systems is a bad idea. However, a closer look reveals the following

    print(f"norm of x:\n"
          f"x_exact: {norm(x_exact)}\n"
          f"normal_eq: {norm(x_solution['normal_eq'])}\n"
         )
    print(f"norm of Ax-b:\n"
          f"x_exact: {norm(A @ x_exact - b)}\n"
          f"normal_eq: {norm(A @ x_solution['normal_eq'] - b)}"
         )
    norm of x:
    x_exact:   0.5092520023062155
    normal_eq: 0.993975690303498
    
    norm of Ax-b:
    x_exact:   6.9594032092014935
    normal_eq: 6.9594032092014935 

    Although the solution of the normal equations seems far off, it reaches the same minimum of the least squares problem and is thus a viable solution. Especially in iterative algorithms, the fact that the norm of the solution vector is large, at least larger than the minimum norm solution, is an undesirable property.

    Minimum Norm

    As the blogpost title advertises the minimum norm solution and a figure is still missing, we will visualize the many solutions to the least squares problem. But how to systematically construct different solutions? Luckily, we already have the SVD of A. The columns of V that multiply with the zero values of D, let’s call it V1, give us the null space of A, i.e. AV1 t = 0 for any vector t. In our example, there is only one such zero diagonal element, V1 is just one column and t reduces to a number.

    t = np.linspace(-3, 3, 100)  # free parameter
    # column vectos of x_lsq are least squares solutions
    x_lsq = (x_exact + Vt.T[:, 0] * t.reshape(-1, 1)).T
    x_norm = np.linalg.norm(x_lsq, axis=0)
    lsq_norm = np.linalg.norm(A @ x_lsq - b.reshape(-1, 1), axis=0)
    
    plt.plot(t, 2 * x_norm, label="2 * norm of x")
    plt.plot(t, lsq_norm, label="norm of Ax-b")
    plt.legend()
    plt.xlabel("t")
    plt.ylabel("Norm")
    plt.title("Euclidean norm of solution and residual")
    Norm of solution vector and residual of least squares

    In order to have both lines in one figure, we scaled the norm of the solution vector by a factor of two. We see that all vectors achieve the same objective, i.e. Euclidean norm of the residuals Ax – b, while t=0 has minimum norm among those solution vectors.

    Tiny Perturbation of b

    Let us also have a look what happens if we add a tiny perturbation to the vector b.

    eps = 1e-10
    print(f"x_exact = {x_exact}")
    print_dict(solve_least_squares(A, b + eps))
    x_exact = [-0.21233  0.00708  0.34973 -0.30223 -0.0235 ] {'gelsd': [-0.21233  0.00708  0.34973 -0.30223 -0.0235 ],  'gelsy': [-0.21233  0.00708  0.34973 -0.30223 -0.0235 ],  'lsmr': [-0.21233  0.00708  0.34973 -0.30223 -0.0235 ],  'lsqr': [-0.21233  0.00708  0.34973 -0.30223 -0.0235 ],  'normal_eq': [-0.12487 -0.41176  0.23093 -0.48548 -0.35104]}

    We see that the first four solvers are stable while the solving the normal equations shows large deviations compared to the unperturbed system above. For example, compare the first vector element of -0.21 vs -0.12 even for a perturbation as tiny as 1.0e-10.

    Ill-Conditioned System

    The last section showed an example of an ill-conditioned system. By ill-conditioned we mean a huge difference between largest and smallest eigenvalue of A, the ratio of which is called condition number. We can achieve this by setting the first diagonal element of S to a tiny positive number instead of exactly zero.

    S[0, 0] = 1e-10
    A = U @ S @ Vt
    
    S_inv = np.copy(S.T)
    S_inv[S_inv>0] = 1/S_inv[S_inv>0]
    
    # Minimum Norm Solution
    x_exact = Vt.T @ S_inv @ U.T @ b
    
    print(f"x_exact = {x_exact}")
    print_dict(solve_least_squares(A, b))
    x_exact = [ 9.93195e+09 -4.75650e+10 -1.34911e+10 -2.08104e+10 -3.71960e+10]
    {'gelsd': [ 9.93194e+09 -4.75650e+10 -1.34910e+10 -2.08104e+10 -3.71960e+10],
     'gelsy': [ 9.93196e+09 -4.75650e+10 -1.34911e+10 -2.08104e+10 -3.71961e+10],
     'lsmr': [-0.21233  0.00708  0.34973 -0.30223 -0.0235 ],
     'lsqr': [-0.21233  0.00708  0.34973 -0.30223 -0.0235 ],
     'normal_eq': [  48559.67679 -232557.57746  -65960.92822 -101747.66128 -181861.06429]}
    print(f"norm of x:\n"
          f"x_exact:   {norm(x_exact)}\n"
          f"lsqr:      {norm(x_solution['lsqr'])}\n"
          f"normal_eq: {norm(x_solution['normal_eq'])}\n"
         )
    print(f"norm of Ax-b:\n"
          f"x_exact:   {norm(A @ x_exact - b)}\n"
          f"lsqr:      {norm(A @ x_solution['lsqr'] - b)}\n"
          f"normal_eq: {norm(A @ x_solution['normal_eq'] - b)}"
         )
    norm of x:
    x_exact:   66028022639.34349
    lsqr:      0.5092520023062157
    normal_eq: 0.993975690303498
    
    norm of Ax-b:
    x_exact:   2.1991587442017146
    lsqr:      6.959403209201494
    normal_eq: 6.959403209120507

    Several points are interesting to observe:

    • The exact solution has a very large norm. A tiny change in the matrix A compared to the singular system changed the solution dramatically!
    • Normal equation and iterative solvers LSQR and LSMR fail badly and don’t find the solution with minimal residual. All three seem to find solutions with the same norm as the singular system from above.
    • The iterative solvers indeed find exactly the same solutions as for the singular system.

    Note, that LSQR and LSMR can be fixed by requiring a higher accuracy via the parameters atol and btol.

    Wrap-Up

    Solving least squares problems is fundamental for many applications. While regular systems are more or less easy to solve, singular as well as ill-conditioned systems have intricacies: Multiple solutions and sensibility to small perturbations. At least, the minimum norm solution always gives a well defined unique answer and direct solvers find it reliably.

    The Python notebook can be found in the usual github repository: 2021-05-02 least_squares_minimum_norm_solution.ipynb

    References

    Very good, concise reference:

    Good source with respect to Ordinary Least Squares (OLS)

    • Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani, Surprises in High-Dimensional Ridgeless Least Squares Interpolation
      https://arxiv.org/abs/1903.08560
  • A Curious Fact on the Diamonds Dataset

    Lost in Translation between R and Python 5

    Hello regression world

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    The last two included a deep dive into historic mortality rates as well as studying a beautiful regression formula.

    Diamonds data

    One of the most used datasets to teach regression is the diamonds dataset. It describes 54’000 diamonds by

    • their price,
    • the four “C” variables (carat, color, cut, clarity),
    • as well as by perspective measurements table, depth, x, y, and z.

    The dataset is readily available, e.g. in

    • R package ggplot2,
    • Python package plotnine,
    • and the fantastic OpenML database.

    Question: How many times did you use diamonds data to compare regression techniques like random forests and gradient boosting?

    Answer: Probably a lot!

    The curious fact

    We recently stumbled over a curious fact regarding that dataset. 26% of the diamonds are duplicates regarding price and the four “C” variables. Within duplicates, the perspective variables table, depth, x, y, and z would differ as if a diamond had been measured from different angles.

    In order to illustrate the issue, let us add the two auxilary variables

    • id: group id of diamonds with identical price and four “C”, and
    • id_size: number of rows in that id

    to the dataset and consider a couple of examples. You can view both R and Python code – but the specific output will differ because language specific naming of group ids.

    library(tidyverse)
    
    # We add group id and its size
    dia <- diamonds %>% 
      group_by(carat, cut, clarity, color, price) %>% 
      mutate(id = cur_group_id(),
             id_size = n()) %>% 
      ungroup() %>% 
      arrange(id)
    
    # Proportion of duplicates
    1 - max(dia$id) / nrow(dia)  # 0.26
    
    # Some examples
    dia %>% 
      filter(id_size > 1) %>%
      head(10)
    
    # Most frequent
    dia %>% 
      arrange(-id_size) %>% 
      head(.$id_size[1])
    
    # A random large diamond appearing multiple times
    dia %>% 
      filter(id_size > 3) %>% 
      arrange(-carat) %>% 
      head(.$id_size[1])
    import numpy as np
    import pandas as pd
    from plotnine.data import diamonds
    
    # Variable groups
    cat_vars = ["cut", "color", "clarity"]
    xvars = cat_vars + ["carat"]
    all_vars = xvars + ["price"]
    
    print("Shape: ", diamonds.shape)
    
    # Add id and id_size
    df = diamonds.copy()
    df["id"] = df.groupby(all_vars).ngroup()
    df["id_size"] = df.groupby(all_vars)["price"].transform(len)
    df.sort_values("id", inplace=True)
    
    print(f'Proportion of dupes: {1 - df["id"].max() / df.shape[0]:.0%}')
    
    print("Random examples")
    print(df[df.id_size > 1].head(10))
    
    print("Most frequent")
    print(df.sort_values(["id_size", "id"]).tail(13))
    
    print("A random large diamond appearing multiple times")
    df[df.id_size > 3].sort_values("carat").tail(6)
    Table 1: Some duplicates in the four “C” variables and price (Python output).
    Table 2: One of the two(!) diamonds appearing a whopping 43 times (Python output).
    Table 3: A large, 2.01 carat diamond appears six times (Python output).

    Of course, having the same id does not necessarily mean that the rows really describe the same diamond. price and the four “C”s could coincide purely by chance. Nevertheless: there are exactly six diamonds of 2.01 carat and a price of 16,778 USD in the dataset. And they all have the same color, cut and clarity. This cannot be coincidence!

    Why would this be problematic?

    In the presence of grouped data, standard validation techniques tend to reward overfitting.

    This becomes immediately clear having in mind the 2.01 carat diamond from Table 3. Standard cross-validation (CV) uses random or stratified sampling and would scatter the six rows of that diamond across multiple CV folds. Highly flexible algorithms like random forests or nearest-neighbour regression could exploit this by memorizing the price of this diamond in-fold and do very well out-of-fold. As a consequence, the stated CV performance would be too good and the choice of the modeling technique and its hyperparameters suboptimal.

    With grouped data, a good approach is often to randomly sample the whole group instead of single rows. Using such grouped splitting ensures that all rows in the same group would end up in the same fold, removing the above described tendency to overfit.

    Note 1. In our case of duplicates, a simple alternative to grouped splitting would be to remove the duplicates altogether. However, the occurrence of duplicates is just one of many situations where grouped or clustered samples appear in reality.

    Note 2. The same considerations not only apply to cross-validation but also to simple train/validation/test splits.

    Evaluation

    What does this mean regarding our diamonds dataset? Using five-fold CV, we will estimate the true root-mean-squared error (RMSE) of a random forest predicting log price by the four “C”. We run this experiment twice: one time, we create the folds by random splitting and the other time by grouped splitting. How heavily will the results from random splitting be biased?

    library(ranger)
    library(splitTools) # one of our packages on CRAN
    
    set.seed(8325)
    
    # We model log(price)
    dia <- dia %>% 
      mutate(y = log(price))
    
    # Helper function: calculate rmse
    rmse <- function(obs, pred) {
      sqrt(mean((obs - pred)^2))
    }
    
    # Helper function: fit model on one fold and evaluate
    fit_on_fold <- function(fold, data) {
      fit <- ranger(y ~ carat + cut + color + clarity, data = data[fold, ])
      rmse(data$y[-fold], predict(fit, data[-fold, ])$pred)
    }
      
    # 5-fold CV for different split types
    cross_validate <- function(type, data) {
      folds <- create_folds(data$id, k = 5, type = type)
      mean(sapply(folds, fit_on_fold, data = dia))
    }
    
    # Apply and plot
    (results <- sapply(c("basic", "grouped"), cross_validate, data = dia))
    barplot(results, col = "orange", ylab = "RMSE by 5-fold CV")
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import cross_val_score, GroupKFold, KFold
    from sklearn.metrics import make_scorer, mean_squared_error
    import seaborn as sns
    
    rmse = make_scorer(mean_squared_error, squared=False)
    
    # Prepare y, X
    df = df.sample(frac=1, random_state=6345)
    y = np.log(df.price)
    X = df[xvars].copy()
    
    # Correctly ordered integer encoding
    X[cat_vars] = X[cat_vars].apply(lambda x: x.cat.codes)
    
    # Cross-validation
    results = {}
    rf = RandomForestRegressor(n_estimators=500, max_features="sqrt", 
                               min_samples_leaf=5, n_jobs=-1)
    for nm, strategy in zip(("basic", "grouped"), (KFold, GroupKFold)):
        results[nm] = cross_val_score(
            rf, X, y, cv=strategy(), scoring=rmse, groups=df.id
        ).mean()
    print(results)
    
    res = pd.DataFrame(results.items())
    sns.barplot(x=0, y=1, data=res);
    Figure 1: Test root-mean-squared error using different splitting methods (R output).

    The RMSE (11%) of grouped CV is 8%-10% higher than of random CV (10%). The standard technique therefore seems to be considerably biased.

    Final remarks

    • The diamonds dataset is not only a brilliant example to demonstrate regression techniques but also a great way to show the importance of a clean validation strategy (in this case: grouped splitting).
    • Blind or automatic ML would most probably fail to detect non-trivial data structures like in this case and therefore use inappropriate validation strategies. The resulting model would be somewhere between suboptimal and dangerous. Just that nobody would know it!
    • The first step towards a good model validation strategy is data understanding. This is a mix of knowing the data source, how the data was generated, the meaning of columns and rows, descriptive statistics etc.

    The Python notebook and R code can be found at:

  • A Beautiful Regression Formula

    Lost in Translation between R and Python 4

    Hello statistics aficionados

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    The last one was a deep dive into historic mortality rates.

    No Covid-19, no public data for a change: This post focusses on a real beauty, namely a decomposition of the R-squared in a linear regression model

    E(y) = \alpha + \sum_{j = 1}^p x_j \beta_j

    fitted by least-squares. If the response y and all p covariables are standardized to variance 1 beforehand, then the R-squared can be obtained as the cross-product of the fitted coefficients and the usual correlations between each covariable and the response:

    R^2 = \sum_{j = 1}^p \text{cor}(y, x_j)\hat\beta_j.

    Two elegant derivations can be found in this answer to the same question, written by the number 1 contributor to crossvalidated: whuber. Look up a couple of his posts – and statistics will suddenly feel super easy and clear.

    Direct consequences of the formula are:

    1. If a covariable is uncorrelated with the response, it cannot contribute to the R-squared, i.e. neither improve nor worsen. This is not obvious.
    2. A correlated covariable only improves R-squared if its coefficient is non-zero. Put differently: if the effect of a covariable is already fully covered by the other covariables, it does not improve the R-squared. This is somewhat obvious.

    Note that all formulas refer to in-sample calculations.

    Since we do not want to bore you with math, we simply demonstrate the result with short R and Python codes based on the famous iris dataset.

    y <- "Sepal.Width"
    x <- c("Sepal.Length", "Petal.Length", "Petal.Width")
    
    # Scaled version of iris
    iris2 <- data.frame(scale(iris[c(y, x)]))
    
    # Fit model 
    fit <- lm(reformulate(x, y), data = iris2)
    summary(fit) # multiple R-squared: 0.524
    (betas <- coef(fit)[x])
    # Sepal.Length Petal.Length  Petal.Width 
    #    1.1533143   -2.3734841    0.9758767 
    
    # Correlations (scaling does not matter here)
    (cors <- cor(iris[, y], iris[x]))
    # Sepal.Length Petal.Length Petal.Width
    #   -0.1175698   -0.4284401  -0.3661259
     
    # The R-squared?
    sum(betas * cors) # 0.524
    # Import packages
    import numpy as np
    import pandas as pd
    from sklearn import datasets
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LinearRegression
    
    # Load data
    iris = datasets.load_iris(as_frame=True).data
    print("The data:", iris.head(3), sep = "\n")
    
    # Specify response
    yvar = "sepal width (cm)"
    
    # Correlations of everyone with response
    cors = iris.corrwith(iris[yvar]).drop(yvar)
    print("\nCorrelations:", cors, sep = "\n")
    
    # Prepare scaled response and covariables
    X = StandardScaler().fit_transform(iris.drop(yvar, axis=1))
    y = StandardScaler().fit_transform(iris[[yvar]])
    
    # Fit linear regression
    OLS = LinearRegression().fit(X, y)
    betas = OLS.coef_[0]
    print("\nScaled coefs:", betas, sep = "\n")
    
    # R-squared via scikit-learn: 0.524
    print(f"\nUsual R-squared:\t {OLS.score(X, y): .3f}")
    
    # R-squared via decomposition: 0.524
    rsquared = betas @ cors.values
    print(f"Applying the formula:\t {rsquared: .3f}")

    Indeed: the cross-product of coefficients and correlations equals the R-squared of 52%.

    The Python notebook and R code can be found at:

  • Swiss Mortality

    Lost in Translation between R and Python 3

    Hello again!

    This is the third article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    Post 2: https://lorentzen.ch/index.php/2021/01/26/covid-19-deaths-per-mio/

    Before diving into the data analysis, I would like to share with you that writing this text brought back some memories from my first year as an actuary. Indeed, I started in life insurance and soon switched to non-life. While investigating mortality may be seen as a dry matter, it reveals some interesting statistical problems which have to be properly addressed before drawing conclusions.

    Similar to Post 2, we use a publicly available data, this time from the Human Mortality Database and from the Federal Statistical Office of Switzerland, in order to calculate crude death rates, i.e. number of deaths per person alive per year. This time, we look at a longer time periode of 20 and over 100 years and take all causes of death into account. We caution against any misinterpretation: We show only crude death rates (CDR) which do not take into account any demographic shifts like changing distributions of age or effects from measures taken against COVID-19.

    Let us start with producing the first figure. We fetch the data from the internet, pick some countries of interest, focus on males and females combined only, aggregate and plot. The Python version uses the visualization library altair, which can generate interactive charts. Unfortunately, we can only show a static version in the blog post. If someone knows how to render Verga-Light charts in wordpress, I’d be very interested in a secure solution.

    library(tidyverse)
    library(lubridate)
    
    # Fetch data
    df_original = read_csv(
      "https://www.mortality.org/Public/STMF/Outputs/stmf.csv",
      skip = 2
    )
    
    # 1. Select countries of interest and only "both" sexes
    # Note: Germany "DEUTNP" and "USA" have short time series
    # 2. Change to ISO-3166-1 ALPHA-3 codes
    # 3.Create population pro rata temporis (exposure) to ease aggregation
    df_mortality <- df_original %>% 
      filter(CountryCode %in% c("CAN", "CHE", "FRATNP", "GBRTENW", "SWE"),
             Sex == "b") %>% 
      mutate(CountryCode = recode(CountryCode, "FRATNP" = "FRA",
                                  "GBRTENW" = "England & Wales"),
             population = DTotal / RTotal,
             Year = ymd(Year, truncated = 2))
    
    # Data aggregation per year and country
    df <- df_mortality %>%
      group_by(Year, CountryCode) %>% 
      summarise(CDR = sum(DTotal) / sum(population), 
                .groups = "drop")
    
    ggplot(df, aes(x = Year, y = CDR, color = CountryCode)) +
      geom_line(size = 1) +
      ylab("Crude Death Rate per Year") +
      theme(legend.position = c(0.2, 0.8))
    import pandas as pd
    import altair as alt
    
    
    # Fetch data
    df_mortality = pd.read_csv(
        "https://www.mortality.org/Public/STMF/Outputs/stmf.csv",
        skiprows=2,
    )
    
    # Select countdf_mortalityf interest and only "both" sexes
    # Note: Germany "DEUTNP" and "USA" have short time series
    df_mortality = df_mortality[
        df_mortality["CountryCode"].isin(["CAN", "CHE", "FRATNP", "GBRTENW", "SWE"])
        & (df_mortality["Sex"] == "b")
    ].copy()
    
    # Change to ISO-3166-1 ALPHA-3 codes
    df_mortality["CountryCode"].replace(
        {"FRATNP": "FRA", "GBRTENW": "England & Wales"}, inplace=True
    )
    
    # Create population pro rata temporis (exposure) to ease aggregation
    df_mortality = df_mortality.assign(
        population=lambda df: df["DTotal"] / df["RTotal"]
    )
    
    # Data aggregation per year and country
    df_mortality = (
        df_mortality.groupby(["Year", "CountryCode"])[["population", "DTotal"]]
        .sum()
        .assign(CDR=lambda x: x["DTotal"] / x["population"])
        # .filter(items=["CDR"])  # make df even smaller
        .reset_index()
        .assign(Year=lambda x: pd.to_datetime(x["Year"], format="%Y"))
    )
    
    chart = (
        alt.Chart(df_mortality)
        .mark_line()
        .encode(
            x="Year:T",
            y=alt.Y("CDR:Q", scale=alt.Scale(zero=False)),
            color="CountryCode:N",
        )
        .properties(title="Crude Death Rate per Year")
        .interactive()
    )
    # chart.save("crude_death_rate.html")
    chart
    Crude death rate (CDR) for Canada (CAN), Switzerland (CHE), England & Wales, France (FRA) and Sweden (SWE). Data as of 07.02.2021.

    Note that the y-axis does not start at zero. Nevertheless, we see values between 0.007 and 0.014, which is twice as large. While 2020 shows a clear raise in mortality, for some countries more dramatic than for others, the values of 2021 are still preliminary. For 2021, the data is still incomplete and the yearly CDR is based on a small observation period and hence on a smaller population pro rata temporis. On top, there might be effects from seasonality. To sum up, it means that there is a larger uncertainty for 2021 than for previous whole years.

    For Switzerland, it is also possible to collect data for over 100 years. As the code for data fetching and preparation becomes a bit lengthy, we won’t bother you with it. You can find it in the notebooks linked below. Note that we added the value of 2020 from the figure above. This seems legit as the CDR of both data sources agree within less than 1% relative error.

    Crude death rate (CDR) for Switzerland from 1901 to 2020.

    Again, note that the left y-axis does not start at zero, but the right y-axis does. One can see several interesting facts:

    • The Swiss population is and always was growing for the last 120 years—with the only exception around 1976.
    • The Spanish flu between 1918 and 1920 caused by far the largest peak in mortality in the last 120 years.
    • The second world war is not visible in the mortality in Switzerland.
    • Overall, the mortality is decreasing.

    The Python notebook and R code can be found at:

  • Covid-19 Deaths per Mio

    Lost in Translation between R and Python 2

    Hello again!

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    Post 1: https://lorentzen.ch/index.php/2021/01/07/illustrating-the-central-limit-theorem/

    In Post 2, we use a publicly available data of the European Centre for Disease Prevention and Control to calculate Covid-19 deaths per Mio persons over time and across countries . We will use slim Python and R codes to

    • fetch the data directly from the internet,
    • prepare and restructure it for plotting and
    • plot a curve per selected country.

    Note that different countries use different definitions of whom to count as Covid-19 death and these definitions might also have changed over time. So be careful with comparisons!

    library(tidyverse)
    
    # Source and countries
    link <- "https://opendata.ecdc.europa.eu/covid19/casedistribution/csv"
    countries <- c("Switzerland", "United_States_of_America", 
                   "Germany", "Sweden")
    
    # Import
    df0 <- read_csv(link)
    
    # Data prep
    df <- df0 %>%
      mutate(Date = lubridate::dmy(dateRep),
             Deaths = deaths_weekly / (popData2019 / 1e6))  %>%
      rename(Country = countriesAndTerritories) %>%
      filter(Date >= "2020-03-01",
             Country %in% countries)
    
    # Plot
    ggplot(df, aes(x = Date, y = Deaths, color = Country)) +
      geom_line(size = 1) +
      ylab("Weekly deaths per Mio") +
      theme(legend.position = c(0.2, 0.85))
    import pandas as pd
    
    # Source and countries
    url = "https://opendata.ecdc.europa.eu/covid19/casedistribution/csv"
    countries = ["Switzerland", "United_States_of_America", 
                 "Germany", "Sweden"]
    
    # Fetch data
    df0 = pd.read_csv(url)
    # df0.head()
    
    # Prepare data
    df = df0.assign(
        Date=lambda x: pd.to_datetime(x["dateRep"], format="%d/%m/%Y"),
        Deaths=lambda x: x["deaths_weekly"] / x["popData2019"] * 1e6,
    ).rename(columns={"countriesAndTerritories": "Country"})
    df = df.loc[
        (df["Country"].isin(countries)) & (df["Date"] >= "2020-03-01"),
        ["Country", "Date", "Deaths"],
    ]
    df = df.pivot(index="Date", columns="Country")
    df = df.droplevel(0, axis=1)
    
    # Plot
    ax = df.plot()
    ax.set_ylabel('Weekly Covid-19 deaths per Mio');

    Weekly Covid-19 deaths per Mio inhabitants as per January 26, 2021 (Python output).

    The code can be found on https://github.com/mayer79/covid with some other analyses regarding viruses.

  • Illustrating The Central Limit Theorem

    Lost in Translation between R and Python 1

    This is the first article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    Let’s start with a little bit of statistics – it wont be the last time, friends: Illustrating the Central Limit Theorem (CLT).

    Take a sample of a random variable X with finite variance. The CLT says: No matter how “unnormally” distributed X is, its sample mean will be approximately normally distributed, at least if the sample size is not too small. This classic result is the basis to construct simple confidence intervals and hypothesis tests for the (true) mean of X, check out Wikipedia for a lot of additional information.

    The code below illustrates this famous statistical result by simulation, using a very asymmetrically distributed X, namely X = 1 with probability 0.2 and X=0 otherwise. X could represent the result of asking a randomly picked person whether he smokes. Conducting such a poll, the mean of the collected sample of such results would be a statistical estimate of the proportion of people smoking.

    Curiously, by a tiny modification, the same code will also illustrate another key result in statistics – the Law of Large Numbers: For growing sample size, the distribution of the sample mean of X contracts to the expectation E(X).

    # Fix seed, set constants
    set.seed(2006)
    sample_sizes <- c(1, 10, 30, 1000)
    nsims <- 10000
    
    # Helper function: Mean of one sample of X
    one_mean <- function(n, p = c(0.8, 0.2)) {
      mean(sample(0:1, n, replace = TRUE, prob = p))
    }
    # one_mean(10)
    
    # Simulate and plot
    par(mfrow = c(2, 2), mai = rep(0.4, 4))
    
    for (n in sample_sizes) {
      means <- replicate(nsims, one_mean(n))
      hist(means, breaks = "FD", 
           # xlim = 0:1, # uncomment for LLN
           main = sprintf("n=%i", n))
    }
    import numpy as np
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    # Fix seed, set constants
    np.random.seed(100)
    sample_sizes = [1, 10, 30, 1000]
    nsims = 10_000
    
    # Helper function: Mean of one sample
    def one_mean(n, p=0.2):
        return np.random.binomial(1, p, n).mean()
    
    # Simulate and plot
    fig, axes = plt.subplots(2, 2, figsize=(8, 8))
    
    for i, n in enumerate(sample_sizes):
        means = [one_mean(n) for ell in range(nsims)]
        ax = axes[i // 2, i % 2]
        ax.hist(means, 50)
        ax.title.set_text(f'$n = {n}$')
        ax.set_xlabel('mean')
        # ax.set_xlim(0, 1)  # uncomment for LLN
    fig.tight_layout()

    Result: The Central Limit Theorem

    The larger the samples, the closer the histogram of the simulated means resembles a symmetric bell shaped curve (R-Output for illustration).

    Result: The Law of Large Number

    Fixing the x-scale illustrates – for free(!) – the Law of Large Numbers: The distribution of the mean contracts more and more to the expectation 0.2 (R-Output for illustration).

    See also the python notebook https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-01-07 Illustrating The Central Limit Theorem.ipynb and for many great posts on R,  http://www.R-bloggers.com.