{"id":239,"date":"2021-03-14T11:20:54","date_gmt":"2021-03-14T10:20:54","guid":{"rendered":"https:\/\/lorentzen.ch\/?p=239"},"modified":"2021-03-14T11:20:56","modified_gmt":"2021-03-14T10:20:56","slug":"a-beautiful-regression-formula","status":"publish","type":"post","link":"https:\/\/lorentzen.ch\/index.php\/2021\/03\/14\/a-beautiful-regression-formula\/","title":{"rendered":"A Beautiful Regression Formula"},"content":{"rendered":"\n<h1 class=\"wp-block-heading\">Lost in Translation between R and Python 4<\/h1>\n\n\n\n<p>Hello statistics aficionados<\/p>\n\n\n\n<p>This is the next article in our series <strong>&#8220;Lost in Translation between R and Python&#8221;<\/strong>. The aim of this series is to provide high-quality R <strong>and<\/strong> Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.<\/p>\n\n\n\n<p>The last one was a deep dive into historic <a href=\"https:\/\/lorentzen.ch\/wp-admin\/post.php?post=179&amp;action=edit\" data-type=\"URL\" data-id=\"https:\/\/lorentzen.ch\/wp-admin\/post.php?post=179&amp;action=edit\">mortality<\/a> rates.<\/p>\n\n\n\n<p>No Covid-19, no public data for a change: This post focusses on a real beauty, namely a decomposition of the R-squared in a linear regression model<\/p>\n\n\n\n<div class=\"wp-block-katex-display-block katex-eq\" data-katex-display=\"true\"><pre>E(y) = \\alpha + \\sum_{j = 1}^p x_j \\beta_j<\/pre><\/div>\n\n\n\n<p>fitted by least-squares. If the <strong>response y and all p covariables are standardized to variance 1<\/strong> beforehand, then the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Coefficient_of_determination\">R-squared<\/a> can be obtained as the cross-product of the fitted coefficients and the usual correlations between each covariable and the response:<\/p>\n\n\n\n<div class=\"wp-block-katex-display-block katex-eq\" data-katex-display=\"true\"><pre>R^2 = \\sum_{j = 1}^p \\text{cor}(y, x_j)\\hat\\beta_j.<\/pre><\/div>\n\n\n\n<p>Two elegant derivations can be found in <a href=\"https:\/\/stats.stackexchange.com\/questions\/437919\/why-is-r-squared-equal-to-the-sum-of-standardized-coefficients-times-the-correla\/513296#513296\">this answer<\/a> to the same question, written by the number 1 contributor to <a href=\"https:\/\/stats.stackexchange.com\/\">crossvalidated<\/a>: whuber. Look up a couple of his posts &#8211; and statistics will suddenly feel super easy and clear.<\/p>\n\n\n\n<p>Direct consequences of the formula are:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>If a covariable is uncorrelated with the response, it cannot contribute to the R-squared, i.e. neither improve nor worsen. This is not obvious.<\/li><li>A correlated covariable only improves R-squared if its coefficient is non-zero. Put differently: if the effect of a covariable is already fully covered by the other covariables, it does not improve the R-squared. This <em>is<\/em> somewhat obvious.<\/li><\/ol>\n\n\n\n<p>Note that all formulas refer to in-sample calculations.<\/p>\n\n\n\n<p>Since we do not want to bore you with math, we simply demonstrate the result with short R and Python codes based on the famous iris dataset. <\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-945a4423-1935-4285-8342-b1ec3f944c56\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-945a4423-1935-4285-8342-b1ec3f944c56-tab-0\" aria-controls=\"ub-tabbed-content-945a4423-1935-4285-8342-b1ec3f944c56-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: center; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><div role=\"tab\" id=\"ub-tabbed-content-945a4423-1935-4285-8342-b1ec3f944c56-tab-1\" aria-controls=\"ub-tabbed-content-945a4423-1935-4285-8342-b1ec3f944c56-panel-1\" aria-selected=\"false\" class=\"wp-block-ub-tabbed-content-tab-title-wrap\" style=\"--ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: center; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">Python<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-945a4423-1935-4285-8342-b1ec3f944c56-panel-0\" aria-labelledby=\"ub-tabbed-content-945a4423-1935-4285-8342-b1ec3f944c56-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'>y &lt;- \"Sepal.Width\"\nx &lt;- c(\"Sepal.Length\", \"Petal.Length\", \"Petal.Width\")\n\n# Scaled version of iris\niris2 &lt;- data.frame(scale(iris[c(y, x)]))\n\n# Fit model \nfit &lt;- lm(reformulate(x, y), data = iris2)\nsummary(fit) # multiple R-squared: 0.524\n(betas &lt;- coef(fit)[x])\n# Sepal.Length Petal.Length  Petal.Width \n#    1.1533143   -2.3734841    0.9758767 \n\n# Correlations (scaling does not matter here)\n(cors &lt;- cor(iris[, y], iris[x]))\n# Sepal.Length Petal.Length Petal.Width\n#   -0.1175698   -0.4284401  -0.3661259\n \n# The R-squared?\nsum(betas * cors) # 0.524<\/pre><\/div>\n\n<\/div><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap ub-hide\" id=\"ub-tabbed-content-945a4423-1935-4285-8342-b1ec3f944c56-panel-1\" aria-labelledby=\"ub-tabbed-content-945a4423-1935-4285-8342-b1ec3f944c56-tab-1\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"python\",\"mime\":\"text\/x-python\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"\",\"language\":\"Python\",\"maxHeight\":\"400px\",\"modeName\":\"python\"}'># Import packages\nimport numpy as np\nimport pandas as pd\nfrom sklearn import datasets\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.linear_model import LinearRegression\n\n# Load data\niris = datasets.load_iris(as_frame=True).data\nprint(\"The data:\", iris.head(3), sep = \"\\n\")\n\n# Specify response\nyvar = \"sepal width (cm)\"\n\n# Correlations of everyone with response\ncors = iris.corrwith(iris[yvar]).drop(yvar)\nprint(\"\\nCorrelations:\", cors, sep = \"\\n\")\n\n# Prepare scaled response and covariables\nX = StandardScaler().fit_transform(iris.drop(yvar, axis=1))\ny = StandardScaler().fit_transform(iris[[yvar]])\n\n# Fit linear regression\nOLS = LinearRegression().fit(X, y)\nbetas = OLS.coef_[0]\nprint(\"\\nScaled coefs:\", betas, sep = \"\\n\")\n\n# R-squared via scikit-learn: 0.524\nprint(f\"\\nUsual R-squared:\\t {OLS.score(X, y): .3f}\")\n\n# R-squared via decomposition: 0.524\nrsquared = betas @ cors.values\nprint(f\"Applying the formula:\\t {rsquared: .3f}\")<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<p><strong>Indeed: the cross-product of coefficients and correlations equals the R-squared of 52%.<\/strong><\/p>\n\n\n\n<p>The Python notebook and R code can be found at:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Python: <a href=\"https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2021-03-14%20rsquared_decomposition.py\">https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2021-03-14%20rsquared_decomposition.py<\/a><\/li><li>R: <a href=\"https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2021-02-19 swiss_mortality.R\">https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2021-03-14%20rsquared_decomposition.r<\/a><\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;R <-> Python&#8221; continued&#8230; A beautiful formula for R-squared.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[10,6,5],"class_list":["post-239","post","type-post","status-publish","format-standard","hentry","category-statistics","tag-lost-in-translation","tag-python","tag-r"],"featured_image_src":null,"author_info":{"display_name":"Michael Mayer","author_link":"https:\/\/lorentzen.ch\/index.php\/author\/michael\/"},"_links":{"self":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/239","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/comments?post=239"}],"version-history":[{"count":14,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/239\/revisions"}],"predecessor-version":[{"id":256,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/239\/revisions\/256"}],"wp:attachment":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/media?parent=239"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/categories?post=239"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/tags?post=239"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}