Author: Michael Mayer

  • A Beautiful Regression Formula

    Lost in Translation between R and Python 4

    Hello statistics aficionados

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    The last one was a deep dive into historic mortality rates.

    No Covid-19, no public data for a change: This post focusses on a real beauty, namely a decomposition of the R-squared in a linear regression model

    E(y) = \alpha + \sum_{j = 1}^p x_j \beta_j

    fitted by least-squares. If the response y and all p covariables are standardized to variance 1 beforehand, then the R-squared can be obtained as the cross-product of the fitted coefficients and the usual correlations between each covariable and the response:

    R^2 = \sum_{j = 1}^p \text{cor}(y, x_j)\hat\beta_j.

    Two elegant derivations can be found in this answer to the same question, written by the number 1 contributor to crossvalidated: whuber. Look up a couple of his posts – and statistics will suddenly feel super easy and clear.

    Direct consequences of the formula are:

    1. If a covariable is uncorrelated with the response, it cannot contribute to the R-squared, i.e. neither improve nor worsen. This is not obvious.
    2. A correlated covariable only improves R-squared if its coefficient is non-zero. Put differently: if the effect of a covariable is already fully covered by the other covariables, it does not improve the R-squared. This is somewhat obvious.

    Note that all formulas refer to in-sample calculations.

    Since we do not want to bore you with math, we simply demonstrate the result with short R and Python codes based on the famous iris dataset.

    y <- "Sepal.Width"
    x <- c("Sepal.Length", "Petal.Length", "Petal.Width")
    
    # Scaled version of iris
    iris2 <- data.frame(scale(iris[c(y, x)]))
    
    # Fit model 
    fit <- lm(reformulate(x, y), data = iris2)
    summary(fit) # multiple R-squared: 0.524
    (betas <- coef(fit)[x])
    # Sepal.Length Petal.Length  Petal.Width 
    #    1.1533143   -2.3734841    0.9758767 
    
    # Correlations (scaling does not matter here)
    (cors <- cor(iris[, y], iris[x]))
    # Sepal.Length Petal.Length Petal.Width
    #   -0.1175698   -0.4284401  -0.3661259
     
    # The R-squared?
    sum(betas * cors) # 0.524
    # Import packages
    import numpy as np
    import pandas as pd
    from sklearn import datasets
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LinearRegression
    
    # Load data
    iris = datasets.load_iris(as_frame=True).data
    print("The data:", iris.head(3), sep = "\n")
    
    # Specify response
    yvar = "sepal width (cm)"
    
    # Correlations of everyone with response
    cors = iris.corrwith(iris[yvar]).drop(yvar)
    print("\nCorrelations:", cors, sep = "\n")
    
    # Prepare scaled response and covariables
    X = StandardScaler().fit_transform(iris.drop(yvar, axis=1))
    y = StandardScaler().fit_transform(iris[[yvar]])
    
    # Fit linear regression
    OLS = LinearRegression().fit(X, y)
    betas = OLS.coef_[0]
    print("\nScaled coefs:", betas, sep = "\n")
    
    # R-squared via scikit-learn: 0.524
    print(f"\nUsual R-squared:\t {OLS.score(X, y): .3f}")
    
    # R-squared via decomposition: 0.524
    rsquared = betas @ cors.values
    print(f"Applying the formula:\t {rsquared: .3f}")

    Indeed: the cross-product of coefficients and correlations equals the R-squared of 52%.

    The Python notebook and R code can be found at:

  • Covid-19 Deaths per Mio

    Lost in Translation between R and Python 2

    Hello again!

    This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    Post 1: https://lorentzen.ch/index.php/2021/01/07/illustrating-the-central-limit-theorem/

    In Post 2, we use a publicly available data of the European Centre for Disease Prevention and Control to calculate Covid-19 deaths per Mio persons over time and across countries . We will use slim Python and R codes to

    • fetch the data directly from the internet,
    • prepare and restructure it for plotting and
    • plot a curve per selected country.

    Note that different countries use different definitions of whom to count as Covid-19 death and these definitions might also have changed over time. So be careful with comparisons!

    library(tidyverse)
    
    # Source and countries
    link <- "https://opendata.ecdc.europa.eu/covid19/casedistribution/csv"
    countries <- c("Switzerland", "United_States_of_America", 
                   "Germany", "Sweden")
    
    # Import
    df0 <- read_csv(link)
    
    # Data prep
    df <- df0 %>%
      mutate(Date = lubridate::dmy(dateRep),
             Deaths = deaths_weekly / (popData2019 / 1e6))  %>%
      rename(Country = countriesAndTerritories) %>%
      filter(Date >= "2020-03-01",
             Country %in% countries)
    
    # Plot
    ggplot(df, aes(x = Date, y = Deaths, color = Country)) +
      geom_line(size = 1) +
      ylab("Weekly deaths per Mio") +
      theme(legend.position = c(0.2, 0.85))
    import pandas as pd
    
    # Source and countries
    url = "https://opendata.ecdc.europa.eu/covid19/casedistribution/csv"
    countries = ["Switzerland", "United_States_of_America", 
                 "Germany", "Sweden"]
    
    # Fetch data
    df0 = pd.read_csv(url)
    # df0.head()
    
    # Prepare data
    df = df0.assign(
        Date=lambda x: pd.to_datetime(x["dateRep"], format="%d/%m/%Y"),
        Deaths=lambda x: x["deaths_weekly"] / x["popData2019"] * 1e6,
    ).rename(columns={"countriesAndTerritories": "Country"})
    df = df.loc[
        (df["Country"].isin(countries)) & (df["Date"] >= "2020-03-01"),
        ["Country", "Date", "Deaths"],
    ]
    df = df.pivot(index="Date", columns="Country")
    df = df.droplevel(0, axis=1)
    
    # Plot
    ax = df.plot()
    ax.set_ylabel('Weekly Covid-19 deaths per Mio');

    Weekly Covid-19 deaths per Mio inhabitants as per January 26, 2021 (Python output).

    The code can be found on https://github.com/mayer79/covid with some other analyses regarding viruses.

  • Illustrating The Central Limit Theorem

    Lost in Translation between R and Python 1

    This is the first article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

    Let’s start with a little bit of statistics – it wont be the last time, friends: Illustrating the Central Limit Theorem (CLT).

    Take a sample of a random variable X with finite variance. The CLT says: No matter how “unnormally” distributed X is, its sample mean will be approximately normally distributed, at least if the sample size is not too small. This classic result is the basis to construct simple confidence intervals and hypothesis tests for the (true) mean of X, check out Wikipedia for a lot of additional information.

    The code below illustrates this famous statistical result by simulation, using a very asymmetrically distributed X, namely X = 1 with probability 0.2 and X=0 otherwise. X could represent the result of asking a randomly picked person whether he smokes. Conducting such a poll, the mean of the collected sample of such results would be a statistical estimate of the proportion of people smoking.

    Curiously, by a tiny modification, the same code will also illustrate another key result in statistics – the Law of Large Numbers: For growing sample size, the distribution of the sample mean of X contracts to the expectation E(X).

    # Fix seed, set constants
    set.seed(2006)
    sample_sizes <- c(1, 10, 30, 1000)
    nsims <- 10000
    
    # Helper function: Mean of one sample of X
    one_mean <- function(n, p = c(0.8, 0.2)) {
      mean(sample(0:1, n, replace = TRUE, prob = p))
    }
    # one_mean(10)
    
    # Simulate and plot
    par(mfrow = c(2, 2), mai = rep(0.4, 4))
    
    for (n in sample_sizes) {
      means <- replicate(nsims, one_mean(n))
      hist(means, breaks = "FD", 
           # xlim = 0:1, # uncomment for LLN
           main = sprintf("n=%i", n))
    }
    import numpy as np
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    # Fix seed, set constants
    np.random.seed(100)
    sample_sizes = [1, 10, 30, 1000]
    nsims = 10_000
    
    # Helper function: Mean of one sample
    def one_mean(n, p=0.2):
        return np.random.binomial(1, p, n).mean()
    
    # Simulate and plot
    fig, axes = plt.subplots(2, 2, figsize=(8, 8))
    
    for i, n in enumerate(sample_sizes):
        means = [one_mean(n) for ell in range(nsims)]
        ax = axes[i // 2, i % 2]
        ax.hist(means, 50)
        ax.title.set_text(f'$n = {n}$')
        ax.set_xlabel('mean')
        # ax.set_xlim(0, 1)  # uncomment for LLN
    fig.tight_layout()

    Result: The Central Limit Theorem

    The larger the samples, the closer the histogram of the simulated means resembles a symmetric bell shaped curve (R-Output for illustration).

    Result: The Law of Large Number

    Fixing the x-scale illustrates – for free(!) – the Law of Large Numbers: The distribution of the mean contracts more and more to the expectation 0.2 (R-Output for illustration).

    See also the python notebook https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2021-01-07 Illustrating The Central Limit Theorem.ipynb and for many great posts on R,  http://www.R-bloggers.com.