{"id":1074,"date":"2023-02-11T22:42:16","date_gmt":"2023-02-11T21:42:16","guid":{"rendered":"https:\/\/lorentzen.ch\/?p=1074"},"modified":"2023-02-22T23:11:38","modified_gmt":"2023-02-22T22:11:38","slug":"quantiles-and-their-estimation","status":"publish","type":"post","link":"https:\/\/lorentzen.ch\/index.php\/2023\/02\/11\/quantiles-and-their-estimation\/","title":{"rendered":"Quantiles And Their Estimation"},"content":{"rendered":"\n<p>Applied statistics is dominated by the ubiquitous <strong>mean<\/strong>. For a change, this post is dedicated to <strong>quantiles<\/strong>. I will give my best to provide a good mix of theory and practical examples.<\/p>\n\n\n\n<p>While the mean describes only the central tendency of a distribution or random sample, quantiles are able to describe the whole distribution. They appear in box-plots, in childrens&#8217; <a href=\"https:\/\/www.who.int\/tools\/child-growth-standards\/standards\/weight-for-age\">weight-for-age<\/a> curves, in salary survey results, in risk measures like the value-at-risk in the EU-wide <a href=\"https:\/\/eur-lex.europa.eu\/legal-content\/EN\/TXT\/?uri=CELEX%3A02009L0138-20210630\">solvency II<\/a> framework for insurance companies, in quality control and in many more fields.<\/p>\n\n\n<div class=\"wp-block-ub-table-of-contents-block ub_table-of-contents\" id=\"ub_table-of-contents-84d0be4f-9bf3-4f18-91dc-9f4476425ce6\" data-linktodivider=\"false\" data-showtext=\"show\" data-hidetext=\"hide\" data-scrolltype=\"auto\" data-enablesmoothscroll=\"false\" data-initiallyhideonmobile=\"false\" data-initiallyshow=\"true\"><div class=\"ub_table-of-contents-header-container\" style=\"\">\n\t\t\t<div class=\"ub_table-of-contents-header\" style=\"text-align: left; \">\n\t\t\t\t<div class=\"ub_table-of-contents-title\" style=\"\"><\/div>\n\t\t\t\t\n\t\t\t<\/div>\n\t\t<\/div><div class=\"ub_table-of-contents-extra-container\" style=\"\">\n\t\t\t<div class=\"ub_table-of-contents-container ub_table-of-contents-1-column \">\n\t\t\t\t<ul style=\"\"><li style=\"\"><a href=\"https:\/\/lorentzen.ch\/index.php\/2023\/02\/11\/quantiles-and-their-estimation\/#0-definitions\" style=\"\">Definitions<\/a><\/li><li style=\"\"><a href=\"https:\/\/lorentzen.ch\/index.php\/2023\/02\/11\/quantiles-and-their-estimation\/#1-empirical-distribution\" style=\"\">Empirical Distribution<\/a><\/li><li style=\"\"><a href=\"https:\/\/lorentzen.ch\/index.php\/2023\/02\/11\/quantiles-and-their-estimation\/#2-empirical-quantiles\" style=\"\">Empirical Quantiles<\/a><\/li><li style=\"\"><a href=\"https:\/\/lorentzen.ch\/index.php\/2023\/02\/11\/quantiles-and-their-estimation\/#3-quantile-regression\" style=\"\">Quantile Regression<\/a><\/li><\/ul>\n\t\t\t<\/div>\n\t\t<\/div><\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"0-definitions\">Definitions<\/h2>\n\n\n\n<p>Often, one talks about quantiles, but rarely defines them. In what fallows, I borrow from <a href=\"https:\/\/doi.org\/10.48550\/arXiv.0912.0902\">Gneiting (2011)<\/a>.<\/p>\n\n\n<div style=\"border: 2px solid #000000; border-radius: 0%; background-color: inherit; \" class=\"ub-styled-box ub-bordered-box wp-block-ub-styled-box\" id=\"ub-styled-box-d836f3c5-4c4b-420c-b908-3d2bc2fe6e9b\">\n<p class=\"has-large-font-size\" id=\"ub-styled-box-bordered-content-\">Definition 1: Quantile<\/p>\n\n\n\n<p>Given a cumulative probability distribution (CDF) <code><span class=\"katex-eq\" data-katex-display=\"false\">F(x)=\\mathbb{P}(X\\leq x)<\/span><\/code>, the <strong>quantile<\/strong> at level <code><span class=\"katex-eq\" data-katex-display=\"false\">\\alpha \\in (0,1)<\/span><\/code> (\u0251-quantile for short), <code><span class=\"katex-eq\" data-katex-display=\"false\">q_\\alpha(F)<\/span><\/code>, is defined as<\/p>\n\n\n\n<div class=\"wp-block-katex-display-block katex-eq\" data-katex-display=\"true\"><pre>\\begin{equation*}\nq_\\alpha(F) = \\{x: \\lim_{y\\uparrow x} F(y)\\leq \\alpha \\leq F(x)\\} \\,.\n\\end{equation*}<\/pre><\/div>\n\n\n<\/div>\n\n\n<p>The inequalities of this definition are called coverage conditions. It is very important to note that quantiles are potentially set valued. Another way to write this set is as an interval:<\/p>\n\n\n\n<div class=\"wp-block-katex-display-block katex-eq\" data-katex-display=\"true\"><pre>\\begin{align*}\nq_\\alpha(F) &amp;\\in [q_\\alpha^-(F), q_\\alpha^+(F)]\n\\\\\nq_\\alpha^-(F) &amp;= \\sup\\{x:F(x) &lt; \\alpha\\} = \\inf\\{x:F(x) \\geq \\alpha\\}\n\\\\\nq_\\alpha^+(F) &amp;= \\inf\\{x:F(x) &gt; \\alpha\\} = \\sup\\{x:F(x) \\leq \\alpha\\}\n\\end{align*}<\/pre><\/div>\n\n\n\n<p>For <span class=\"katex-eq\" data-katex-display=\"false\">q_\\alpha^-<\/span>, we recover the usual quantile definition as the generalized inverse of <span class=\"katex-eq\" data-katex-display=\"false\">F<\/span>. But this is only one possible value. I will discuss examples of quantile intervals later on.<\/p>\n\n\n\n<p>To get acquainted a bit more, let&#8217;s plot the cumulative distribution function and the quantile function for some continuous distributions: Normal, Gamma and Pareto distribution. The parametrisation is such that all have mean 2, Normal and Gamma have variance 4, and Pareto has an infinite variance. For those continuous and strictly increasing distributions, all quantiles are unique, and therefore simplify to the inverse CDF <span class=\"katex-eq\" data-katex-display=\"false\">q_\\alpha^-<\/span> in the above equations. Note that those three distributions have very different tail behaviour: The density of the Normal distribution has the famously fast decrease <span class=\"katex-eq\" data-katex-display=\"false\">\\propto e^{-x^2}<\/span>, the Gamma density has an exponentially decreasing tail <span class=\"katex-eq\" data-katex-display=\"false\">\\propto e^{-x}<\/span> and the Pareto density has a fat tail, i.e. an inverse power <span class=\"katex-eq\" data-katex-display=\"false\">\\propto \\frac{1}{x^\\alpha}<\/span>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"984\" height=\"590\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image.png\" alt=\"\" class=\"wp-image-1083\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image.png 984w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-300x180.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-768x460.png 768w\" sizes=\"auto, (max-width: 984px) 100vw, 984px\" \/><figcaption class=\"wp-element-caption\">CDF (top) and quantile function (bottom) of several distributions: Normal <span class=\"katex-eq\" data-katex-display=\"false\">N(\\mu=2, \\sigma^2=2)<\/span>(left), Gamma <span class=\"katex-eq\" data-katex-display=\"false\">Ga(\\alpha=2, \\beta=\\frac{1}{2})<\/span> (middle) and Pareto <span class=\"katex-eq\" data-katex-display=\"false\">Pa(\\alpha=2)<\/span>(right).<\/figcaption><\/figure>\n\n\n\n<p>There are at least two more equivalent ways to define quantiles. They will help us later to get a better visualisations.<\/p>\n\n\n<div style=\"border: 2px solid #000000; border-radius: 0%; background-color: inherit; \" class=\"ub-styled-box ub-bordered-box wp-block-ub-styled-box\" id=\"ub-styled-box-c7cc2026-8b7e-4596-85a5-9370ce49bc6b\">\n<p class=\"has-large-font-size\" id=\"ub-styled-box-bordered-content-\">Definition 2: Quantile as minimiser<\/p>\n\n\n\n<p>Given a probability distribution <code><span class=\"katex-eq\" data-katex-display=\"false\">F<\/span><\/code>, the \u0251-quantile <code><span class=\"katex-eq\" data-katex-display=\"false\">q_\\alpha(F)<\/span><\/code> is defined as any minimiser of the expected scoring function <span class=\"katex-eq\" data-katex-display=\"false\">S<\/span><\/p>\n\n\n\n<div class=\"wp-block-katex-display-block katex-eq\" data-katex-display=\"true\"><pre>\\begin{align*}\nq_\\alpha(F) &amp;\\in \\argmin_x \\mathbb{E}(S_\\alpha(x, Y))\n\\\\\nS_\\alpha(x, y) &amp;= (\\mathbf{1}_{x\\geq y} - \\alpha)(x - y)\n\\end{align*}<\/pre><\/div>\n\n\n\n<p>where the expectation is taken over <span class=\"katex-eq\" data-katex-display=\"false\">Y \\sim F<\/span>.<\/p>\n\n\n<\/div>\n\n\n<p>The scoring function or loss function <span class=\"katex-eq\" data-katex-display=\"false\">S<\/span> can be generalized to <span class=\"katex-eq\" data-katex-display=\"false\">S_\\alpha(x, y) = (\\mathbb{1}_{x\\geq y} - \\alpha)(g(x) - g(y))<\/span> for any increasing function <span class=\"katex-eq\" data-katex-display=\"false\">g<\/span>, but the above version in definition 2 is by far the simplest one and coincides with the pinball loss used in quantile regression.<\/p>\n\n\n\n<p>This definition is super useful because it provides a tool to assess whether a given value really is a quantile. A plot will suffice.<\/p>\n\n\n\n<p>Having a definition in terms of a convex optimisation problem, there is another definition in terms of the first order condition of optimality. For continuous, strictly increasing distributions, this would be equivalent to setting the first derivative to zero. For our non-smooth scoring function with potentially set-valued solution, this gets more complicated, e.g. subdifferential or subgradients replacing derivatives. In the end, it amounts to a sign change of the expectation of the so called identification function <span class=\"katex-eq\" data-katex-display=\"false\">V(x, y)=\\mathbf{1}_{x\\geq y}-\\alpha<\/span>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"1-empirical-distribution\">Empirical Distribution<\/h2>\n\n\n\n<p>The empirical distribution provides an excellent example. Given a sample of <em>n<\/em> observations <span class=\"katex-eq\" data-katex-display=\"false\">y_1, \\ldots, y_n<\/span>, the empirical distribution is given by <span class=\"katex-eq\" data-katex-display=\"false\">F_n(x)=\\frac{1}{n}\\sum_{i=1}^n \\mathbf{1}_{x\\geq y_i}<\/span>. Let us start simple and take two observations <span class=\"katex-eq\" data-katex-display=\"false\">y_1=1<\/span> and <span class=\"katex-eq\" data-katex-display=\"false\">y_2=2<\/span>. Plugging in this distribution in the definition 1 of quantiles gives the exact quantiles of this 2-point empirical CDF:<\/p>\n\n\n\n<div class=\"wp-block-katex-display-block katex-eq\" data-katex-display=\"true\"><pre>\\begin{equation}\nq_\\alpha(F_2)=\n\\begin{cases}\n   1 &amp;\\text{if } \\alpha&lt;\\frac{1}{2} \\\\\n   [1, 2] &amp;\\text{if } \\alpha=\\frac{1}{2} \\\\\n   2 &amp;\\text{else}\n\\end{cases}\n\\end{equation}<\/pre><\/div>\n\n\n\n<p>Here we encounter the interval [1, 2] for <span class=\"katex-eq\" data-katex-display=\"false\">\\alpha=\\frac{1}{2}<\/span>. Again, I plot both the (empirical) CDF <span class=\"katex-eq\" data-katex-display=\"false\">F_n<\/span> and the quantiles.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"789\" height=\"390\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-1.png\" alt=\"\" class=\"wp-image-1113\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-1.png 789w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-1-300x148.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-1-768x380.png 768w\" sizes=\"auto, (max-width: 789px) 100vw, 789px\" \/><figcaption class=\"wp-element-caption\">Empirical distribution function and exact quantiles of observations y=1 and y=2.<\/figcaption><\/figure>\n\n\n\n<p>In the left plot, the big dots unambiguously mark the values at x=1 and x=2. For the quantiles in the right plot, the vertical line at probability 0.5 really means that all those values between 1 and 2 are possible 0.5-quantiles, also known as median.<\/p>\n\n\n\n<p>If you wonder about the value 1 for quantiles of level smaller than 50%, the minimisation formulation helps. The following plot shows <span class=\"katex-eq\" data-katex-display=\"false\">\\mathbb{E}(S_\\alpha(x, Y))<\/span> for <span class=\"katex-eq\" data-katex-display=\"false\">\\alpha=0.2<\/span> with a clear unique minimum at x=1.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"536\" height=\"393\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-2.png\" alt=\"\" class=\"wp-image-1114\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-2.png 536w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-2-300x220.png 300w\" sizes=\"auto, (max-width: 536px) 100vw, 536px\" \/><figcaption class=\"wp-element-caption\">Expected scoring function (pinball loss) for <span class=\"katex-eq\" data-katex-display=\"false\">\\alpha=0.2<\/span> for the empirical CDF with observations 1 and 2.<\/figcaption><\/figure>\n\n\n\n<p>A note for the interested reader: The above empirical distribution is the same as the distribution of a Bernoulli random variable, except that the x-values are shifted, i.e. the Bernoulli random variables are canonically set to 0 and 1 instead of 1 and 2. Furthermore, there is a direct connection between quantiles and classification via the cost-weighted misclassification error, see <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2202.12780\">Fissler, Lorentzen &amp; Mayer (2022)<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"2-empirical-quantiles\">Empirical Quantiles<\/h2>\n\n\n\n<p>From the empirical CDF, it is only a small step to empirical quantiles. But what&#8217;s the difference anyway? While we saw the <em>exact<\/em> quantile of the empirical distribution, <span class=\"katex-eq\" data-katex-display=\"false\">q_\\alpha(F_n)<\/span>, an empirical or sample quantile estimate the <em>true<\/em> (population) quantile given a data sample, i.e. <span class=\"katex-eq\" data-katex-display=\"false\">\\hat{q}_\\alpha(\\{y_i\\}) \\approx q_\\alpha(F)<\/span>.<\/p>\n\n\n\n<p>As an empirical CDF estimates the CDF of the <em>true<\/em> underlying (population) distribution, <span class=\"katex-eq\" data-katex-display=\"false\">F_n=\\hat{F} \\approx F<\/span>, one immediate way to estimate a quantile is:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Estimate the CDF via the empirical CDF <span class=\"katex-eq\" data-katex-display=\"false\">F_n<\/span>.<\/li>\n\n\n\n<li>Use the exact quantile in analogy to Eq.(1) as an estimate.<\/li>\n<\/ol>\n\n\n\n<p>Very importantly, this is just one way to estimate quantiles from a sample. There are many, many more. Here is the outcome of the 20%-quantile of our tiny data sample <span class=\"katex-eq\" data-katex-display=\"false\">y_1=1<\/span> and <span class=\"katex-eq\" data-katex-display=\"false\">y_2=2<\/span>.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">import numpy as np\n\nmethods = [\n    'inverted_cdf',\n    'averaged_inverted_cdf',\n    'closest_observation',\n    'interpolated_inverted_cdf',\n    'hazen',\n    'weibull',\n    'linear',\n    'median_unbiased',\n    'normal_unbiased',\n    'nearest',\n    'lower',\n    'higher',\n    'midpoint',\n]\nalpha = 0.2\nfor m in methods:\n    estimate = np.quantile([1, 2], 0.2, method=m)\n    print(f&quot;{m:&lt;25} {alpha}-quantile estimate = {estimate}&quot;)<\/pre><\/div>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">inverted_cdf              0.2-quantile estimate = 1\naveraged_inverted_cdf     0.2-quantile estimate = 1.0\nclosest_observation       0.2-quantile estimate = 1\ninterpolated_inverted_cdf 0.2-quantile estimate = 1.0\nhazen                     0.2-quantile estimate = 1.0\nweibull                   0.2-quantile estimate = 1.0\nlinear                    0.2-quantile estimate = 1.2\nmedian_unbiased           0.2-quantile estimate = 1.0\nnormal_unbiased           0.2-quantile estimate = 1.0\nnearest                   0.2-quantile estimate = 1\nlower                     0.2-quantile estimate = 1\nhigher                    0.2-quantile estimate = 2\nmidpoint                  0.2-quantile estimate = 1.5<\/code><\/pre>\n\n\n\n<p>Note that the first 9 methods are the ones discussed in a famous paper of <a href=\"https:\/\/doi.org\/10.1080\/00031305.1996.10473566\" data-type=\"URL\" data-id=\"https:\/\/doi.org\/10.1080\/00031305.1996.10473566\">Hyndman &amp; Fan (1996)<\/a>. The default method of both Python&#8217;s <code><a href=\"https:\/\/numpy.org\/doc\/stable\/reference\/generated\/numpy.quantile.html#numpy-quantile\">numpy.quantile<\/a><\/code> and R&#8217;s <code><a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/stats\/html\/quantile.html\">quantile<\/a><\/code> is <strong>linear<\/strong>, i.e. number 7 in Hyndman &amp; Fan. Somewhat surprisingly, we observe that this default method is clearly biased in this case and overestimates the true quantile.<\/p>\n\n\n\n<p>For large sample sizes, the differences will get tiny and all methods converge finally to a true quantile, at least for continuous distributions. In order to assess the bias with small sample sizes for each method, I do a simulation. This is where the fun starts\ud83d\ude04 <\/p>\n\n\n\n<p>For all three selected distributions and for quantile levels 15%, 50% and 85%, I simulate 10000 random samples, each of sample size 10 and calculate the sample quantile. Then I take the mean over all 10000 simulations as well as the 5% and the 95% quantiles as a measure of uncertainty, i.e. 90% confidence intervals. After some coding, this results in the following plot. (I spare you the code at this point. You can find it in the linked notebook at the bottom of this post).<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"758\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/quantile_simulation-1024x758.png\" alt=\"\" class=\"wp-image-1145\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/quantile_simulation-1024x758.png 1024w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/quantile_simulation-300x222.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/quantile_simulation-768x568.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/quantile_simulation-1536x1137.png 1536w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/quantile_simulation-2048x1515.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Small sample bias (n=10) of different empirical quantile estimation methods (x-axis and color) based on 10000 simulations. Dots are the mean values, error bars cover a 90% confidence interval. The dotted horizontal line is the theoretical quantile value.<br>left: Normal distribution; mid: Gamma distribution; right: Pareto distribution<br>top: 15%-quantile; mid: 50%-quantile; bottom: 85%-quantile <\/figcaption><\/figure>\n\n\n\n<p>For the 15%-quantile, the default linear method always overestimates, but it does surprisingly well for the 85%-quantile of the Pareto distribution. Overall, I personally would prefer the median unbiased or the Hazen method. Interestingly, the Hazen method is one of the oldest, namely from <a href=\"https:\/\/doi.org\/10.1061\/taceat.0002563\">Hazen (1914)<\/a>, and is the only one that fulfills all proposed properties of Hyndman &amp; Fan, who propose the median unbiased method as default.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"3-quantile-regression\">Quantile Regression<\/h2>\n\n\n\n<p>So far, the interest was in the quantile of a sample or distribution. To go one step further, one might ask for the conditional quantile of a response variable Y given some features or covariates X, <span class=\"katex-eq\" data-katex-display=\"false\">q_\\alpha(Y|X)=q_\\alpha(F_{Y|X})<\/span>. This is the realm of quantile regression as invented by <a href=\"https:\/\/doi.org\/10.2307\/1913643\">Koenker &amp; Bassett (1978)<\/a>. To stay within linear models, one simply replaces the squared error of ordinary least squares (OLS) by the pinball loss.<\/p>\n\n\n\n<p>Without going into any more details, I chose the <a href=\"https:\/\/github.com\/allisonhorst\/palmerpenguins\">Palmer&#8217;s penguin dataset<\/a> and the body mass in gram as response variable, all the other variables as feature variables. And again, I model the 15%, 50% and 85% quantiles. <\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">import seaborn as sns\n\ndf = sns.load_dataset(&quot;penguins&quot;).dropna()\ndf<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-table alignwide has-small-font-size\"><table><thead><tr><th><\/th><th>species<\/th><th>island<\/th><th>bill_length_mm<\/th><th>bill_depth_mm<\/th><th>flipper_length_mm<\/th><th>body_mass_g<\/th><th>sex<\/th><\/tr><\/thead><tbody><tr><th>0<\/th><td>Adelie<\/td><td>Torgersen<\/td><td>39.1<\/td><td>18.7<\/td><td>181.0<\/td><td>3750.0<\/td><td>Male<\/td><\/tr><tr><th>1<\/th><td>Adelie<\/td><td>Torgersen<\/td><td>39.5<\/td><td>17.4<\/td><td>186.0<\/td><td>3800.0<\/td><td>Female<\/td><\/tr><tr><th>2<\/th><td>Adelie<\/td><td>Torgersen<\/td><td>40.3<\/td><td>18.0<\/td><td>195.0<\/td><td>3250.0<\/td><td>Female<\/td><\/tr><tr><th>4<\/th><td>Adelie<\/td><td>Torgersen<\/td><td>36.7<\/td><td>19.3<\/td><td>193.0<\/td><td>3450.0<\/td><td>Female<\/td><\/tr><tr><th>5<\/th><td>Adelie<\/td><td>Torgersen<\/td><td>39.3<\/td><td>20.6<\/td><td>190.0<\/td><td>3650.0<\/td><td>Male<\/td><\/tr><tr><th>&#8230;<\/th><td>&#8230;<\/td><td>&#8230;<\/td><td>&#8230;<\/td><td>&#8230;<\/td><td>&#8230;<\/td><td>&#8230;<\/td><td>&#8230;<\/td><\/tr><tr><th>338<\/th><td>Gentoo<\/td><td>Biscoe<\/td><td>47.2<\/td><td>13.7<\/td><td>214.0<\/td><td>4925.0<\/td><td>Female<\/td><\/tr><tr><th>340<\/th><td>Gentoo<\/td><td>Biscoe<\/td><td>46.8<\/td><td>14.3<\/td><td>215.0<\/td><td>4850.0<\/td><td>Female<\/td><\/tr><tr><th>341<\/th><td>Gentoo<\/td><td>Biscoe<\/td><td>50.4<\/td><td>15.7<\/td><td>222.0<\/td><td>5750.0<\/td><td>Male<\/td><\/tr><tr><th>342<\/th><td>Gentoo<\/td><td>Biscoe<\/td><td>45.2<\/td><td>14.8<\/td><td>212.0<\/td><td>5200.0<\/td><td>Female<\/td><\/tr><tr><th>343<\/th><td>Gentoo<\/td><td>Biscoe<\/td><td>49.9<\/td><td>16.1<\/td><td>213.0<\/td><td>5400.0<\/td><td>Male<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">from sklearn.base import clone\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.linear_model import QuantileRegressor\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import OneHotEncoder, SplineTransformer\n\ny = df[&quot;body_mass_g&quot;]\nX = df.drop(columns=&quot;body_mass_g&quot;)\nqr50 = Pipeline([\n    (&quot;column_transformer&quot;,\n     ColumnTransformer([\n         (&quot;ohe&quot;, OneHotEncoder(drop=&quot;first&quot;), [&quot;species&quot;, &quot;island&quot;, &quot;sex&quot;]),\n         (&quot;spline&quot;, SplineTransformer(n_knots=3, degree=2), [&quot;bill_length_mm&quot;, &quot;bill_depth_mm&quot;, &quot;flipper_length_mm&quot;]),\n         \n     ])\n    ),\n    (&quot;quantile_regressor&quot;,\n     QuantileRegressor(quantile=0.5, alpha=0, solver=&quot;highs&quot;)\n    )\n])\nqr15 = clone(qr50)\nqr15.set_params(quantile_regressor__quantile=0.15)\nqr85 = clone(qr50)\nqr85.set_params(quantile_regressor__quantile=0.85)\n\nqr15.fit(X, y)\nqr50.fit(X, y)\nqr85.fit(X, y)<\/pre><\/div>\n\n\n\n<p>This code snippet gives the three fitted linear quantile models. That was the easy part. I find it much harder to produce good visualisations.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">import polars as pl  # imported and used much earlier in the notebook\n\ndf_obs = df.copy()\ndf_obs[&quot;type&quot;] = &quot;observed&quot;\ndfs = [pl.from_pandas(df_obs)]\nfor m, name in [(qr15, &quot;15%-q&quot;), (qr50, &quot;median&quot;), (qr85, &quot;85%-q&quot;)]:\n    df_pred = df.copy()\n    df_pred[&quot;type&quot;] = &quot;predicted_&quot; + name\n    df_pred[&quot;body_mass_g&quot;] = m.predict(X)\n    dfs.append(pl.from_pandas(df_pred))\ndf_pred = pl.concat(dfs).to_pandas()\n\nsns.catplot(df_pred, x=&quot;species&quot;, y=&quot;body_mass_g&quot;, hue=&quot;type&quot;, alpha=0.5)<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"645\" height=\"490\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-3.png\" alt=\"\" class=\"wp-image-1157\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-3.png 645w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-3-300x228.png 300w\" sizes=\"auto, (max-width: 645px) 100vw, 645px\" \/><figcaption class=\"wp-element-caption\">Quantile regression estimates and observed values for species (x-axis).<\/figcaption><\/figure>\n\n\n\n<p>This plot per species seems suspicious. We would expect the 85%-quantile prediction in red to mostly be larger than the median (green), instead we detect several clusters. The reason behind it is the sex of the penguins which also enters the model. To demonstrate this fact, I plot the sex separately and make the species visible by the plot style. This time, I put the flipper length on the x-axis.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">sns.relplot(\n    df_pred,\n    x=&quot;flipper_length_mm&quot;,\n    y=&quot;body_mass_g&quot;,\n    hue=&quot;type&quot;,\n    col=&quot;sex&quot;,\n    style=&quot;species&quot;,\n    alpha=0.5,\n)<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"433\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-5-1024x433.png\" alt=\"\" class=\"wp-image-1159\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-5-1024x433.png 1024w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-5-300x127.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-5-768x325.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2023\/02\/image-5.png 1159w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Quantile regression estimates and observed values vs flipper length (x-axis). Left: male; right: female<\/figcaption><\/figure>\n\n\n\n<p>This is a really nice plot:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>One immediately observes the differences in sex on the left and right subplot.<\/li>\n\n\n\n<li>The two clusters in each subplot can be well explained by the penguin species.<\/li>\n\n\n\n<li>The 3 quantiles are now vertically in the expected order, 15% quantile in yellow the lowest, 85%-quantile in red the highest, the median in the middle, and some observations beyond those predicted quantiles.<\/li>\n\n\n\n<li>The models are linear in flipper length with a positive, but slightly different slope per quantile level.<\/li>\n<\/ul>\n\n\n\n<p>As a final check, let us compute the coverage of each quantile model (in-sample).<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">[\n    np.mean(y &lt;= qr15.predict(X)),\n    np.mean(y &lt;= qr50.predict(X)),\n    np.mean(y &lt;= qr85.predict(X)),\n]<\/pre><\/div>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">[0.15015015015015015, 0.5225225225225225, 0.8618618618618619]<\/code><\/pre>\n\n\n\n<p>Those numbers are close enough to 15%, 50% and 85%.<\/p>\n\n\n\n<p>As always, the full Python code is available as notebook: <a href=\"https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2023-02-11%20empirical_quantiles.ipynb\">https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2023-02-11%20empirical_quantiles.ipynb<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Applied statistics is dominated by the ubiquitous mean. For a change, this post is dedicated to quantiles. I will give my best to provide a good mix of theory and practical examples. While the mean describes only the central tendency of a distribution or random sample, quantiles are able to describe the whole distribution. They [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16,9],"tags":[6],"class_list":["post-1074","post","type-post","status-publish","format-standard","hentry","category-machine-learning","category-statistics","tag-python"],"featured_image_src":null,"author_info":{"display_name":"Christian Lorentzen","author_link":"https:\/\/lorentzen.ch\/index.php\/author\/christian\/"},"_links":{"self":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1074","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/comments?post=1074"}],"version-history":[{"count":100,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1074\/revisions"}],"predecessor-version":[{"id":1197,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1074\/revisions\/1197"}],"wp:attachment":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/media?parent=1074"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/categories?post=1074"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/tags?post=1074"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}