Quantiles And Their Estimation

Applied statistics is dominated by the ubiquitous mean. For a change, this post is dedicated to quantiles. I will give my best to provide a good mix of theory and practical examples.

While the mean describes only the central tendency of a distribution or random sample, quantiles are able to describe the whole distribution. They appear in box-plots, in childrens’ weight-for-age curves, in salary survey results, in risk measures like the value-at-risk in the EU-wide solvency II framework for insurance companies, in quality control and in many more fields.

Definitions

Often, one talks about quantiles, but rarely defines them. In what fallows, I borrow from Gneiting (2011).

Definition 1: Quantile

Given a cumulative probability distribution (CDF) F(x)=\mathbb{P}(X\leq x), the quantile at level \alpha \in (0,1) (ɑ-quantile for short), q_\alpha(F), is defined as

\begin{equation*}
q_\alpha(F) = \{x: \lim_{y\uparrow x} F(y)\leq \alpha \leq F(x)\} \,.
\end{equation*}

The inequalities of this definition are called coverage conditions. It is very important to note that quantiles are potentially set valued. Another way to write this set is as an interval:

\begin{align*}
q_\alpha(F) &\in [q_\alpha^-(F), q_\alpha^+(F)]
\\
q_\alpha^-(F) &= \sup\{x:F(x) < \alpha\} = \inf\{x:F(x) \geq \alpha\}
\\
q_\alpha^+(F) &= \inf\{x:F(x) > \alpha\} = \sup\{x:F(x) \leq \alpha\}
\end{align*}

For q_\alpha^-, we recover the usual quantile definition as the generalized inverse of F. But this is only one possible value. I will discuss examples of quantile intervals later on.

To get acquainted a bit more, let’s plot the cumulative distribution function and the quantile function for some continuous distributions: Normal, Gamma and Pareto distribution. The parametrisation is such that all have mean 2, Normal and Gamma have variance 4, and Pareto has an infinite variance. For those continuous and strictly increasing distributions, all quantiles are unique, and therefore simplify to the inverse CDF q_\alpha^- in the above equations. Note that those three distributions have very different tail behaviour: The density of the Normal distribution has the famously fast decrease \propto e^{-x^2}, the Gamma density has an exponentially decreasing tail \propto e^{-x} and the Pareto density has a fat tail, i.e. an inverse power \propto \frac{1}{x^\alpha}.

CDF (top) and quantile function (bottom) of several distributions: Normal N(\mu=2, \sigma^2=2)(left), Gamma Ga(\alpha=2, \beta=\frac{1}{2}) (middle) and Pareto Pa(\alpha=2)(right).

There are at least two more equivalent ways to define quantiles. They will help us later to get a better visualisations.

Definition 2: Quantile as minimiser

Given a probability distribution F, the ɑ-quantile q_\alpha(F) is defined as any minimiser of the expected scoring function S

\begin{align*}
q_\alpha(F) &\in \argmin_x \mathbb{E}(S_\alpha(x, Y))
\\
S_\alpha(x, y) &= (\mathbf{1}_{x\geq y} - \alpha)(x - y)
\end{align*}

where the expectation is taken over Y \sim F.

The scoring function or loss function S can be generalized to S_\alpha(x, y) = (\mathbb{1}_{x\geq y} – \alpha)(g(x) – g(y)) for any increasing function g, but the above version in definition 2 is by far the simplest one and coincides with the pinball loss used in quantile regression.

This definition is super useful because it provides a tool to assess whether a given value really is a quantile. A plot will suffice.

Having a definition in terms of a convex optimisation problem, there is another definition in terms of the first order condition of optimality. For continuous, strictly increasing distributions, this would be equivalent to setting the first derivative to zero. For our non-smooth scoring function with potentially set-valued solution, this gets more complicated, e.g. subdifferential or subgradients replacing derivatives. In the end, it amounts to a sign change of the expectation of the so called identification function V(x, y)=\mathbf{1}_{x\geq y}-\alpha.

Empirical Distribution

The empirical distribution provides an excellent example. Given a sample of n observations y_1, \ldots, y_n, the empirical distribution is given by F_n(x)=\frac{1}{n}\sum_{i=1}^n \mathbf{1}_{x\geq y_i}. Let us start simple and take two observations y_1=1 and y_2=2. Plugging in this distribution in the definition 1 of quantiles gives the exact quantiles of this 2-point empirical CDF:

\begin{equation}
q_\alpha(F_2)=
\begin{cases}
   1 &\text{if } \alpha<\frac{1}{2} \\
   [1, 2] &\text{if } \alpha=\frac{1}{2} \\
   2 &\text{else}
\end{cases}
\end{equation}

Here we encounter the interval [1, 2] for \alpha=\frac{1}{2}. Again, I plot both the (empirical) CDF F_n and the quantiles.

Empirical distribution function and exact quantiles of observations y=1 and y=2.

In the left plot, the big dots unambiguously mark the values at x=1 and x=2. For the quantiles in the right plot, the vertical line at probability 0.5 really means that all those values between 1 and 2 are possible 0.5-quantiles, also known as median.

If you wonder about the value 1 for quantiles of level smaller than 50%, the minimisation formulation helps. The following plot shows \mathbb{E}(S_\alpha(x, Y)) for \alpha=0.2 with a clear unique minimum at x=1.

Expected scoring function (pinball loss) for \alpha=0.2 for the empirical CDF with observations 1 and 2.

A note for the interested reader: The above empirical distribution is the same as the distribution of a Bernoulli random variable, except that the x-values are shifted, i.e. the Bernoulli random variables are canonically set to 0 and 1 instead of 1 and 2. Furthermore, there is a direct connection between quantiles and classification via the cost-weighted misclassification error, see Fissler, Lorentzen & Mayer (2022).

Empirical Quantiles

From the empirical CDF, it is only a small step to empirical quantiles. But what’s the difference anyway? While we saw the exact quantile of the empirical distribution, q_\alpha(F_n), an empirical or sample quantile estimate the true (population) quantile given a data sample, i.e. \hat{q}_\alpha(\{y_i\}) \approx q_\alpha(F).

As an empirical CDF estimates the CDF of the true underlying (population) distribution, F_n=\hat{F} \approx F, one immediate way to estimate a quantile is:

  1. Estimate the CDF via the empirical CDF F_n.
  2. Use the exact quantile in analogy to Eq.(1) as an estimate.

Very importantly, this is just one way to estimate quantiles from a sample. There are many, many more. Here is the outcome of the 20%-quantile of our tiny data sample y_1=1 and y_2=2.

import numpy as np

methods = [
    'inverted_cdf',
    'averaged_inverted_cdf',
    'closest_observation',
    'interpolated_inverted_cdf',
    'hazen',
    'weibull',
    'linear',
    'median_unbiased',
    'normal_unbiased',
    'nearest',
    'lower',
    'higher',
    'midpoint',
]
alpha = 0.2
for m in methods:
    estimate = np.quantile([1, 2], 0.2, method=m)
    print(f"{m:<25} {alpha}-quantile estimate = {estimate}")
inverted_cdf              0.2-quantile estimate = 1
averaged_inverted_cdf     0.2-quantile estimate = 1.0
closest_observation       0.2-quantile estimate = 1
interpolated_inverted_cdf 0.2-quantile estimate = 1.0
hazen                     0.2-quantile estimate = 1.0
weibull                   0.2-quantile estimate = 1.0
linear                    0.2-quantile estimate = 1.2
median_unbiased           0.2-quantile estimate = 1.0
normal_unbiased           0.2-quantile estimate = 1.0
nearest                   0.2-quantile estimate = 1
lower                     0.2-quantile estimate = 1
higher                    0.2-quantile estimate = 2
midpoint                  0.2-quantile estimate = 1.5

Note that the first 9 methods are the ones discussed in a famous paper of Hyndman & Fan (1996). The default method of both Python’s numpy.quantile and R’s quantile is linear, i.e. number 7 in Hyndman & Fan. Somewhat surprisingly, we observe that this default method is clearly biased in this case and overestimates the true quantile.

For large sample sizes, the differences will get tiny and all methods converge finally to a true quantile, at least for continuous distributions. In order to assess the bias with small sample sizes for each method, I do a simulation. This is where the fun starts😄

For all three selected distributions and for quantile levels 15%, 50% and 85%, I simulate 10000 random samples, each of sample size 10 and calculate the sample quantile. Then I take the mean over all 10000 simulations as well as the 5% and the 95% quantiles as a measure of uncertainty, i.e. 90% confidence intervals. After some coding, this results in the following plot. (I spare you the code at this point. You can find it in the linked notebook at the bottom of this post).

Small sample bias (n=10) of different empirical quantile estimation methods (x-axis and color) based on 10000 simulations. Dots are the mean values, error bars cover a 90% confidence interval. The dotted horizontal line is the theoretical quantile value.
left: Normal distribution; mid: Gamma distribution; right: Pareto distribution
top: 15%-quantile; mid: 50%-quantile; bottom: 85%-quantile

For the 15%-quantile, the default linear method always overestimates, but it does surprisingly well for the 85%-quantile of the Pareto distribution. Overall, I personally would prefer the median unbiased or the Hazen method. Interestingly, the Hazen method is one of the oldest, namely from Hazen (1914), and is the only one that fulfills all proposed properties of Hyndman & Fan, who propose the median unbiased method as default.

Quantile Regression

So far, the interest was in the quantile of a sample or distribution. To go one step further, one might ask for the conditional quantile of a response variable Y given some features or covariates X, q_\alpha(Y|X)=q_\alpha(F_{Y|X}). This is the realm of quantile regression as invented by Koenker & Bassett (1978). To stay within linear models, one simply replaces the squared error of ordinary least squares (OLS) by the pinball loss.

Without going into any more details, I chose the Palmer’s penguin dataset and the body mass in gram as response variable, all the other variables as feature variables. And again, I model the 15%, 50% and 85% quantiles.

import seaborn as sns

df = sns.load_dataset("penguins").dropna()
df
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0AdelieTorgersen39.118.7181.03750.0Male
1AdelieTorgersen39.517.4186.03800.0Female
2AdelieTorgersen40.318.0195.03250.0Female
4AdelieTorgersen36.719.3193.03450.0Female
5AdelieTorgersen39.320.6190.03650.0Male
338GentooBiscoe47.213.7214.04925.0Female
340GentooBiscoe46.814.3215.04850.0Female
341GentooBiscoe50.415.7222.05750.0Male
342GentooBiscoe45.214.8212.05200.0Female
343GentooBiscoe49.916.1213.05400.0Male
from sklearn.base import clone
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import QuantileRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, SplineTransformer

y = df["body_mass_g"]
X = df.drop(columns="body_mass_g")
qr50 = Pipeline([
    ("column_transformer",
     ColumnTransformer([
         ("ohe", OneHotEncoder(drop="first"), ["species", "island", "sex"]),
         ("spline", SplineTransformer(n_knots=3, degree=2), ["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]),
         
     ])
    ),
    ("quantile_regressor",
     QuantileRegressor(quantile=0.5, alpha=0, solver="highs")
    )
])
qr15 = clone(qr50)
qr15.set_params(quantile_regressor__quantile=0.15)
qr85 = clone(qr50)
qr85.set_params(quantile_regressor__quantile=0.85)

qr15.fit(X, y)
qr50.fit(X, y)
qr85.fit(X, y)

This code snippet gives the three fitted linear quantile models. That was the easy part. I find it much harder to produce good visualisations.

import polars as pl  # imported and used much earlier in the notebook

df_obs = df.copy()
df_obs["type"] = "observed"
dfs = [pl.from_pandas(df_obs)]
for m, name in [(qr15, "15%-q"), (qr50, "median"), (qr85, "85%-q")]:
    df_pred = df.copy()
    df_pred["type"] = "predicted_" + name
    df_pred["body_mass_g"] = m.predict(X)
    dfs.append(pl.from_pandas(df_pred))
df_pred = pl.concat(dfs).to_pandas()

sns.catplot(df_pred, x="species", y="body_mass_g", hue="type", alpha=0.5)
Quantile regression estimates and observed values for species (x-axis).

This plot per species seems suspicious. We would expect the 85%-quantile prediction in red to mostly be larger than the median (green), instead we detect several clusters. The reason behind it is the sex of the penguins which also enters the model. To demonstrate this fact, I plot the sex separately and make the species visible by the plot style. This time, I put the flipper length on the x-axis.

sns.relplot(
    df_pred,
    x="flipper_length_mm",
    y="body_mass_g",
    hue="type",
    col="sex",
    style="species",
    alpha=0.5,
)
Quantile regression estimates and observed values vs flipper length (x-axis). Left: male; right: female

This is a really nice plot:

  • One immediately observes the differences in sex on the left and right subplot.
  • The two clusters in each subplot can be well explained by the penguin species.
  • The 3 quantiles are now vertically in the expected order, 15% quantile in yellow the lowest, 85%-quantile in red the highest, the median in the middle, and some observations beyond those predicted quantiles.
  • The models are linear in flipper length with a positive, but slightly different slope per quantile level.

As a final check, let us compute the coverage of each quantile model (in-sample).

[
    np.mean(y <= qr15.predict(X)),
    np.mean(y <= qr50.predict(X)),
    np.mean(y <= qr85.predict(X)),
]
[0.15015015015015015, 0.5225225225225225, 0.8618618618618619]

Those numbers are close enough to 15%, 50% and 85%.

As always, the full Python code is available as notebook: https://github.com/lorentzenchr/notebooks/blob/master/blogposts/2023-02-11%20empirical_quantiles.ipynb.


Posted

in

,

by

Tags:

Comments

One response to “Quantiles And Their Estimation”

  1. […] Christian Lorentzen digs into quantile calculation: […]

Leave a Reply

Your email address will not be published. Required fields are marked *