Quantcast
Channel: How can I reconstruct a normal distribution from a set of percentiles? - Cross Validated
Viewing all articles
Browse latest Browse all 3

Answer by Silverfish for How can I reconstruct a normal distribution from a set of percentiles?

$
0
0

You have a system of equations, all of the form $\mu + z_i \sigma \approx q_i$ where $q_i$ is one of your percentiles (or more generally, quantiles), and $z_i$ is the corresponding z-score. So $z \approx 1.96$ if you take the 97.5th percentile, for example. The equations only hold approximately rather than exactly due to sampling error. If we include an error term then we could write exact equations

$$\mu + \sigma z_i + \varepsilon_i = q_i$$

which look suspiciously like a regression equation...

Your system is overdetermined because you have five equations to estimate only two parameters. You could pick two equations and solve them simultaneously as Stephan Kolassa suggests. You almost certainly can't solve all five simultaneously because sampling variation means the parameter estimates in the five equations are not going to be consistent with each other. But you could take the "least squares" approach, trying to minimise the sum of squared errors across your five equations. This is simpler than it sounds: as the hint above suggests we just need to regress the quantiles against their z-scores. The intercept will be your estimated mean, and the slope your estimated standard deviation.

Here's R code for a function using your percentiles as defaults:

estimate_params <- function(    percentiles,    cum_probs = c(0.03, 0.10, 0.50, 0.90, 0.97)) {  z_scores <- qnorm(cum_probs)  model <- lm(percentiles ~ z_scores)  params <- coefficients(model)  names(params) <- c("mu", "sigma")  params}

Does this work? Let's simulate 10,000 data points with mean 42 and standard deviation 10.

set.seed(100)true_mu <- 42true_sigma <- 10N <- 1e4cum_probs <- c(0.03, 0.10, 0.50, 0.90, 0.97)underlying_data <- rnorm(N, mean = true_mu, sd = true_sigma)quantiles <- quantile(underlying_data, probs = cum_probs)print(quantiles)#>        3%      10%      50%      90%      97% #>  23.08953 29.56984 42.02545 54.68026 60.71053 estimate_params(quantiles)#>         mu     sigma #>  42.015122  9.936526

Not bad! Maybe it feels overkill to use regression to solve what looked like a simple algebraic problem, but the regression routine in your software will be heavily optimised so it's not a lot of work computationally. Still, if the are always taking the 3rd, 10th, 50th, 90th and 97th percentiles, then there's no need to run a fresh regression analysis each time. We can just precompute a bunch of coefficients. Then every time you get your observed quantiles, just multiply them by these coefficients and sum the products up, to get your estimate $\hat \mu$ for the mean or $\hat \sigma$ for the SD. Even less computational work! Let's look at your equations again, in matrix form, assuming you are given $n$ quantiles.

$$\begin{bmatrix} 1 & z_1 \\ 1 & z_2 \\ \vdots & \vdots \\ 1 & z_n \end{bmatrix} \begin{bmatrix} \hat \mu \\ \hat \sigma \end{bmatrix} = \begin{bmatrix} q_1 \\ q_2 \\ \vdots \\ q_n \end{bmatrix} $$

The matrix on the left is called the design matrix, commonly denoted $\mathbf X$. Denote the vector of observed quantiles as $\mathbf q$. The least-squares solution for the estimated parameters is $(\mathbf{X}^t\mathbf{X})^{-1} \mathbf{X}^t \mathbf{q}$. But if your percentiles are fixed, then the only part of this that varies from sample to sample is the observed quantiles $\mathbf{q}$ and we can precompute the matrix $(\mathbf{X}^t\mathbf{X})^{-1} \mathbf{X}^t$ that multiplies them. We continue the previous R script.

precompute_coefficients <- function(cum_probs) {  n <- length(cum_probs)  z_scores <- qnorm(cum_probs)  X <- matrix(c(rep(1, n), z_scores), ncol = 2, byrow = FALSE)  solve(t(X) %*% X) %*% t(X)}our_coefs <- precompute_coefficients(cum_probs = cum_probs)# print to 8 decimal placest(matrix(sprintf("%.8f", our_coefs), ncol = 2, byrow = TRUE))#>      [,1]          [,2]          [,3]         [,4]         [,5]        #> [1,] "0.20000000"  "0.20000000"  "0.20000000" "0.20000000" "0.20000000"#> [2,] "-0.18155223" "-0.12370764" "0.00000000" "0.12370764" "0.18155223"# check matrix multiplication gives same result as regressionour_coefs %*% quantiles#>           [,1]#> [1,] 42.015122#> [2,]  9.936526

So with your percentiles, you can easily calculate the least-square estimates of mean and SD by:

\begin{align}\hat \mu &= \frac{q_{0.03} + q_{0.1} + q_{0.5} + q_{0.9} + q_{0.97}}{5} \\\hat \sigma &= 0.18155223(q_{0.97} - q_{0.03}) + 0.12370764(q_{0.9} - q_{0.1})\end{align}


In fact some of your percentiles are more informative than others, so treating them all the same way is not the best choice. The issue is that some sample percentiles have a lower variance than others — in other words they consistently tend to come out slightly closer to the true (population) percentile than the ones with high variance. We should give these estimates more importance in our calculation, which we could do by applying weighted least squares in our regression.

The variances you need come from the asymptotic distribution of the $p_i$-th sample quantile of a continuous random variable:

$$\text{Sample quantile } p_i \sim \mathcal{N}\left(x_i, \frac{p_i(1-p_i)}{Nf(x_i)^2} \right)$$

where the sample size is $N$, the cumulative distribution function is $F(x)$, the probability density function is $f(x) = F'(x)$, and the corresponding population quantile is $x_i = F^{-1}(p_i)$.

Write $\Phi$ and $\phi$ for the CDF and PDF of the standard normal distribution.

Now consider when the underlying distribution is itself normal and let $z_i = \Phi^{-1}(p_i)$ be the corresponding z-score to our quantile. Looking at the PDF curves, the normal distribution $f$ must be a factor of $\sigma$ wider than the standard normal distribution $\phi$, so can only be $1/\sigma$ times the height (to ensure both curves enclose an area of one). Hence $f(x_i) = \phi(z_i) / \sigma$ and our asymptotic distribution is:

$$\text{Sample quantile } p_i \sim \mathcal{N}\left(\mu + z_i \sigma, \frac{\sigma^2 p_i(1-p_i)}{N\phi(z_i)^2} \right)$$

which unfortunately features the unknown $\sigma$ in the numerator of our sampling variance. The good news is that this unknown factor $\sigma^2$ is the same for all our quantiles.

Let's see how much difference this makes in practice by taking 1000 samples from our previous underlying distribution, and checking how much variation there is in our observed percentiles.

set.seed(100)true_mu <- 42true_sigma <- 10N <- 1e4cum_probs <- c(0.03, 0.10, 0.50, 0.90, 0.97)z_scores <- qnorm(cum_probs)generate_quantiles <- function() {  underlying_data <- rnorm(N, mean = true_mu,                           sd = true_sigma)  quantile(underlying_data, probs = cum_probs)}repeated_quantiles <- replicate(1000, generate_quantiles())print(repeated_quantiles[, 1:8])#>         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]#> 3%  23.08953 23.04794 23.07882 23.44405 23.41065 22.73976 23.06871 23.44198#> 10% 29.56984 29.08846 29.00956 29.32837 29.45000 29.15969 28.90494 29.40374#> 50% 42.02545 41.97033 41.86143 41.83007 41.99313 41.95614 41.81269 42.17988#> 90% 54.68026 54.51736 54.97305 54.84518 55.07962 54.67819 54.72953 55.17290#> 97% 60.71053 60.72449 60.75950 60.42161 61.04273 60.54929 60.50323 60.87246observed_quantile_vars <- apply(repeated_quantiles, 1, var)theoretical_quantile_vars <- true_sigma^2 * cum_probs * (1-cum_probs) /                              (N * dnorm(z_scores)^2)names(theoretical_quantile_vars ) <- names(observed_quantile_vars)print(observed_quantile_vars)#>         3%        10%        50%        90%        97% #> 0.06781734 0.02857057 0.01576452 0.02919835 0.06312368 print(theoretical_quantile_vars)#>         3%        10%        50%        90%        97% #> 0.06285495 0.02922110 0.01570796 0.02922110 0.06285495 

So exactly as our formula suggests, we should be putting more trust in the sample median because it has considerably less sampling variability than the more extreme percentiles. As is often the way, our sample gives us a better idea what the middle of our distribution looks like, while the picture in the tails is less clear.

Now let's see to what extent we benefit from taking weights that are proportional to the reciprocals of the observation variances. (Proportionality means we can forget about the unknown $\sigma^2$, fortunately. We can also ignore the factor of $N$.) This would be the correct thing to do if our observations were independent. In fact the sample quantiles will clearly be correlated with each other — if the 50th percentile is high, then we'd expect the 51st to be high too — and ideally we'd calculate weights based on the inverse of the variance-covariance matrix instead of the reciprocals of the variances. This is left as an exercise to the reader! (Note that you don't have to use a theoretical formula for the variance-covariance matrix: if you'll accept a slightly imperfect result, you could just estimate it by repeated sampling, similarly to how the code above estimates the variances before comparing them to the theoretical ones.) The following R script is a continuation of the previous one.

estimate_params_using_weights <- function(    percentiles,    cum_probs = c(0.03, 0.10, 0.50, 0.90, 0.97)) {  z_scores <- qnorm(cum_probs)  weights <- (dnorm(z_scores)^2) / (cum_probs * (1-cum_probs))  model <- lm(percentiles ~ z_scores, weights = weights)  params <- coefficients(model)  names(params) <- c("mu", "sigma")  return(params)}# Estimate parameters from the quantiles we repeatedly sampled beforeparam_estimates_without_weights <- apply(  repeated_quantiles,  2, # means we apply to each column   estimate_params)print(param_estimates_without_weights[, 1:7])#>            [,1]      [,2]     [,3]      [,4]     [,5]     [,6]     [,7]#> mu    42.015122 41.869716 41.93647 41.973855 42.19523 41.81662 41.80382#> sigma  9.936526  9.986012 10.05289  9.869981 10.00277 10.02124  9.99102param_estimates_with_weights <- apply(  repeated_quantiles,  2,  estimate_params_using_weights)print(param_estimates_with_weights[, 1:7])#>            [,1]      [,2]     [,3]      [,4]     [,5]     [,6]     [,7]#> mu    42.042689 41.884098 41.92686 41.957198 42.15190 41.88012 41.80940#> sigma  9.899188  9.968667 10.07343  9.892829 10.00188 10.00382 10.01361

Looking at the first 7 sets of parameter estimates, it wasn't obvious that applying weights was very beneficial: in the very first sample it makes both $\hat \mu$ and $\hat \sigma$ estimates worse! Nevertheless, we would be better to judge across all 1000 samples.

# Density plot of mu estimatesmu_estimate_density_without_weights <- density(param_estimates_without_weights[1, ]) mu_estimate_density_with_weights <- density(param_estimates_with_weights[1, ]) plot(mu_estimate_density_with_weights, col = "red", main = "Estimated mu")lines(mu_estimate_density_without_weights, col = "blue")legend(x = min(param_estimates_without_weights[1, ]),       y = max(mu_estimate_density_with_weights$y),       legend = c("weighted", "unweighted"),       col = c("red", "blue"),       lty = 1)

Estimating mu from sample quantiles with and without weighting: a density plot

# Check the variance of our estimatorsapply(  param_estimates_without_weights,  1, # applying variance over the rows  var)#>          mu       sigma #> 0.014694863 0.006743304 apply(  param_estimates_with_weights,  1,  var)#>         mu      sigma #> 0.01180019 0.00622899 

So we do see a reduction in the variance of our estimates when we apply weighting, and the density plot shows a sharper peak around the true value. These weights are not theoretically optimal since they did not take correlation of sample quantiles into account, but they still seem to be better than not weighting at all. We could also try optimising the weights empirically, as Stephan's answer hints at, which may be helpful if your underlying distribution is not normal.

As before, we can precompute the coefficients we want rather than running a fresh WLS regression on each new sample result. The difference between weighted and ordinary least-squares regression is the presence of weights matrix $\mathbf{W}$ in the formula. We ignored covariances so for us this is a diagonal matrix of the weights we found earlier. Our vector of parameter estimates is given by $(\mathbf{X}^t\mathbf{WX})^{-1} \mathbf{X}^t \mathbf{W}$ (which we precompute) multiplied by the vector of quantiles $\mathbf{q}$

precompute_coefficients_using_weights <- function(cum_probs) {  n <- length(cum_probs)  z_scores <- qnorm(cum_probs)  weights <- (dnorm(z_scores)^2) / (cum_probs * (1-cum_probs))  W <- diag(weights)  X <- matrix(c(rep(1, n), z_scores), ncol = 2, byrow = FALSE)  solve(t(X) %*% W %*% X) %*% t(X) %*% W}our_coefs_with_weights <- precompute_coefficients_using_weights(cum_probs = cum_probs)# print to 8 decimal placest(matrix(sprintf("%.8f", our_coefs_with_weights), ncol = 2, byrow = TRUE))#>      [,1]          [,2]          [,3]          [,4]         [,5]        #> [1,] "0.09705444"  "0.20876532"  "0.38836046"  "0.20876532" "0.09705444"#> [2,] "-0.13300941" "-0.19494866" "-0.00000000" "0.19494866" "0.13300941"# check matrix multiplication gives same result as regressionour_coefs_with_weights %*% repeated_quantiles[, 1]#>           [,1]#> [1,] 42.042689#> [2,]  9.899188

We see the weighted coefficients de-emphasise the extreme percentiles.

\begin{align}\hat \mu &= 0.38836046 q_{0.5} + 0.20876532(q_{0.1} + q_{0.9}) + 0.09705444(q_{0.03} + q_{0.97}) \\\hat \sigma &= 0.19494866(q_{0.97} - q_{0.03}) + 0.13300941(q_{0.9} - q_{0.1})\end{align}


Viewing all articles
Browse latest Browse all 3

Trending Articles