Median, quantiles and quantile regression as linear programming problems

Quantile regression is a bit of an exotic type of regression [1,2,3]. It can be seen as a generalization of $\ell_1$ or LAD regression [4], and just as LAD regression we can formulate and solve it as an LP.

First I want to discuss some preliminaries: how to find the median and the quantiles of a data vector $\color{darkblue}y$. That will give us the tools to formulate the quantile regression problem as an LP. The reason for adding these preliminary steps is to develop some intuition about how Quantile Regression problems are defined. I found that most papers just "define" the underlying optimization problem, without much justification. I hope to show with these small auxiliary models how we arrive at the Quantile Regression model. Along the way, we encounter some interesting titbits. I'll discuss a few details that papers typically glance over or even skip, but I find fascinating.

Problem 1: Finding the median as an optimization problem

The (sample) median is the "middle observation". Conceptually this can be done quite simply: sort the data and pick the middle observation. For an even number of observations, we have two middle observations. Often we use the average of these two. Here, in our models, we are a little bit more relaxed about this: allow something in between the two middle observations. Roughly, the median can be defined here as a number $\color{darkred}m$ such that half of the data is below and half the data is above $\color{darkred}m$.

Suppose we have data $\color{darkblue}y_i$. Then the following non-linear optimization problem can find the median $\color{darkred}m$:

NLP model for finding the median
\[\min\>\sum_i \left\|\color{darkblue}y_i - \color{darkred}m\right\|\]

You may want to think a little bit about how this model indeed finds the median (or rather a median with 50% of the data below and 50% above).
This model looks deceptively simple. Actually, this is a nasty non-differentiable NLP problem. Many general-purpose NLP solvers may have a really difficult time with it. Most of them will not be able to establish (local) optimality. In the box below are some of the frightening messages you may encounter. Standard NLP solvers expect smoothness: functions and gradients should be continuous. Ignoring this may be a bad idea.
This is a great example of why we always need to reformulate absolute values. Absolute values are wolves in sheep's clothing and they will eat you alive.
Note that this model works also if the $\color{darkblue}y_i$ are endogenous instead of just exogenous data. We will use this property later.
Why not just sort the data? This is related to the previous point. Sorting data is easy, sorting decision variables is a different thing altogether and much, much more complicated.

----     11 PARAMETER y  data

i1 0.172,    i2 0.843,    i3 0.550,    i4 0.301,    i5 0.292


**** SOLVER STATUS     4Terminated By Solver
**** MODEL STATUS      7 Feasible Solution
**** OBJECTIVE VALUE                0.9297


                           LOWER          LEVEL          UPPER         MARGINAL

---- VAR m                 -INF            0.3011        +INF            1.0000  NOPT
---- VAR z                 -INF            0.9297        +INF             .

  m  median
  z  objective



Scary messages from different solvers:


Conopt:
 ** Feasible solution. Convergence too slow. The change in objective
    has been less than 3.0000E-12 for 20 consecutive iterations


MINOS:
 EXIT - The current point cannot be improved.


SNOPT:
 EXIT - Current point cannot be improved.


IPOPT:
 EXIT: Restoration Failed!
 Final point is feasible: scaled constraint violation (0) is below tol (1e-08) and unscaled constraint violation (0) is below constr_viol_tol (0.0001).


Knitro:
  EXIT: Primal feasible solution estimate cannot be improved; desired accuracy
        in dual feasibility could not be achieved.

We can reformulate absolute values in a linear fashion, so we end up with an LP. There are (at least) two ways to do this:

LP model 1 for finding the median
\[\begin{align}\min&\sum_i\color{darkred}e_i \\ & \color{darkred}e_i\ge\color{darkblue}y_i -\color{darkred}m&& \forall i \\ & \color{darkred}e_i \ge -(\color{darkblue}y_i -\color{darkred}m) && \forall i \\ &\color{darkred}e_i\ge 0\end{align}\]

or using variable splitting:

LP model 2 for finding the median
\[\begin{align}\min&\sum_i \left(\color{darkred}e^+_i+\color{darkred}e^-_i\right) \\ & \color{darkred}e^+_i-\color{darkred}e^-_i =\color{darkblue}y_i -\color{darkred}m && \forall i \\ &\color{darkred}e^-_i,\color{darkred}e^+_i\ge 0\end{align}\]

For the last model it may be instructive to show the errors $\color{darkred}e^+_i,\color{darkred}e^-_i$:

----     69 VARIABLE e2.L  absolute value, formulation 2

             +           -

i1       0.129
i2                   0.542
i3                   0.249
i5       0.009

For every error $i$ we see that only one of $\color{darkred}e^+_i,\color{darkred}e^-_i$ will be nonzero. This is what we would expect. Note that there is one error that is zero.

These models will not have the same problems as the NLP model. They solve easily to optimality. In the next paragraphs, we will see how LP Model 2 is really the building block for Quantile Regression.

Even though this model is just a piece of the puzzle, we can learn a lot from it.

Problem 2: Finding a quantile as an optimization problem

We can generalize the concept of a median to quantiles. E.g., a 0.75 quantile gives us the number such that 75% of the data is below (and 25% is above). Here is an optimization model for this:

NLP model for finding the $\tau$-th quantile
\[\min\>\sum_{i\|\color{darkblue}y_i\ge\color{darkred}q} \color{darkblue}\tau\left\|\color{darkblue}y_i - \color{darkred}q \right\| +\sum_{i\|\color{darkblue}y_i\lt\color{darkred}q} (1-\color{darkblue}\tau)\left\|\color{darkblue}y_i-\color{darkred}q \right\| \]

This is essentially a weighted version of our median model, with weights $\color{darkblue}\tau$ and $(1-\color{darkblue}\tau)$. We don't even try to solve this as stated. Note that the summations themselves are already difficult. However, when we take a step back we see that the first summation is over the positive parts of $|\color{darkblue}y-\color{darkred}q|$ and the second one over the negative parts. Splitting absolute values into positive and negative parts is something we already know how to do: variable splitting.

LP model for finding the $\tau$-th quantile
\[\begin{align}\min&\sum_i \left( \color{darkblue}\tau \cdot \color{darkred}e^+_i +(1-\color{darkblue}\tau) \cdot \color{darkred}e^-_i \right) \\ &\color{darkred}e^+_i -\color{darkred}e^-_i = \color{darkblue}y_i - \color{darkred}q && \forall i \\ &\color{darkred}e^-_i,\color{darkred}e^+_i\ge 0\end{align} \]

Let's try this on some random data:

----     15 PARAMETER y  data

i1  25.457,    i2  85.894,    i3  59.534,    i4  37.102,    i5  36.299,    i6  30.165,    i7  41.485,    i8  87.064
i9  16.040,    i10 55.019,    i11 99.831,    i12 62.086,    i13 99.202,    i14 78.603,    i15 21.762,    i16 67.575
i17 24.357,    i18 32.507,    i19 70.204,    i20 49.182,    i21 42.373,    i22 41.630,    i23 21.834,    i24 23.509
i25 63.020


----     15 SET t  quantile levels

0   ,    0.25,    0.5 ,    0.75,    1


----     52 PARAMETER quantiles  Solution

016.040,    0.2530.165,    0.542.373,    0.7567.575,    199.831

Here we solved 5 LPs for quantile levels 0, 0.25, 0.5, 0.75, and finally 1.

When we do the same in R, we see identical results:

On some data sets, we may see slight differences between the optimization model and the R code. This is because R has 9 different types of quantiles! To more precisely match the optimization model use type=1:

Problem 3: Quantile regression

Finally, let's look at quantile regression. The underlying optimization model looks like [1]:

NLP model for quantile regression
\[\begin{align}\min&\sum_{i\|\color{darkred}e_i\ge 0} \color{darkblue}\tau\|\color{darkred}e_i\| +\sum_{i\|\color{darkred}e_i\lt 0} (1-\color{darkblue}\tau) \|\color{darkred}e_i\| \\ & \color{darkred}e = \color{darkblue}y-\color{darkblue}X\cdot\color{darkred}\beta\end{align}\]

We can linearize this with our familiar tools:

LP model for quantile regression
\[\begin{align}\min&\sum_i \left( \color{darkblue}\tau \cdot\color{darkred}e^+_i + (1-\color{darkblue}\tau) \cdot \color{darkred}e^-_i \right)\\ & \color{darkred}e^+_i - \color{darkred}e^-_i = \color{darkblue}y_i-\sum_j \color{darkblue}X_{i,j}\cdot\color{darkred}\beta_j && \forall i\\ &\color{darkred}e^+_i,\color{darkred}e^+_i\ge 0\end{align}\]

When we run this model on the data from [3], we get:

----     96 PARAMETER estimates  

         suppins         age       white      female      totchr   intercept

0.25453.44416.083338.08316.056782.472   -1412.889
0.5687.22235.111632.889    -260.5561332.833   -2252.556
0.75708.40987.364801.682    -554.5912855.318   -4512.045

The analysis in [3] focuses on the variable totchr. Indeed the coefficient for this variable changes a lot for different quantile levels. The R code from [3] produces:

> results <- rq(Y ~ X, data=mydata, tau=c(0.25, 0.5, 0.75))
> summary(results)

Call: rq(formula = Y ~ X, tau = c(0.25, 0.5, 0.75), data = mydata)

tau: [1] 0.25

Coefficients:
            Value       Std. Error  t value     Pr(>|t|)   
(Intercept) -1412.88889433.20179    -3.261500.00112
Xsuppins      453.4444475.053486.041620.00000
Xtotchr       782.4722237.5576920.833880.00000
Xage           16.083336.191622.597600.00943
Xfemale        16.0555672.202780.222370.82404
Xwhite        338.0833371.515224.727430.00000

Call: rq(formula = Y ~ X, tau = c(0.25, 0.5, 0.75), data = mydata)

tau: [1] 0.5

Coefficients:
            Value       Std. Error  t value     Pr(>|t|)   
(Intercept) -2252.55556846.23023    -2.661870.00781
Xsuppins      687.22222137.292645.005530.00000
Xtotchr      1332.8333374.7791317.823600.00000
Xage           35.1111111.294503.108690.00190
Xfemale      -260.55556150.46285    -1.731690.08343
Xwhite        632.88889243.057342.603870.00926

Call: rq(formula = Y ~ X, tau = c(0.25, 0.5, 0.75), data = mydata)

tau: [1] 0.75

Coefficients:
            Value       Std. Error  t value     Pr(>|t|)   
(Intercept) -4512.045452350.56284    -1.919560.05501
Xsuppins      708.40909375.769291.885220.05950
Xtotchr      2855.31818196.1258714.558600.00000
Xage           87.3636430.984102.819630.00484
Xfemale      -554.59091378.71501    -1.464400.14319
Xwhite        801.68182370.961082.161090.03077

We see our LP model can reproduce the R results.

Conclusion

Quantile regression is based on a rather simple, but non-obvious LP model. I have tried to informally derive this model using small LP models for the calculation of the median and the quantiles of a data vector. After this, the LP model for the quantile regression problem is rather obvious.

Explicitly formulating and implementing the LP model helps to make things a bit less of a black box. After this exercise, running quantile regressions in say R is more transparent: you know what you are doing

References

Koenker, R., and Bassett, G. W. (1978). “Regression Quantiles.” Econometrica 46:33–50.
Quantile Regression, https://en.wikipedia.org/wiki/Quantile_regression
Ani Katchova, Quantile Regression, https://sites.google.com/site/econometricsacademy/econometrics-models/quantile-regression, this is a very good introduction with examples
Linear Programming and L1 Regression, https://yetanothermathprogrammingconsultant.blogspot.com/2017/11/lp-and-lad-regression.html
Median, https://en.wikipedia.org/wiki/Median
Quantile, https://en.wikipedia.org/wiki/Quantile

Appendix A: GAMS code for non-smooth NLP model to calculate the median

Try with different NLP solvers to see what happens.

To have GAMS accept a function like abs() inside an NLP model, we need to declare the model as a DNLP. This is a warning that trouble may be ahead.

$ontext

   find median using a non-smooth NLP model

$offtext

set i /i1*i5/;

parameter y(i) 'data';
y(i) = uniform(0,1);
display y;

variable
   m 'median'
   z 'objective'
;

equation sumabs;

sumabs.. z =e= sum(i, abs(y(i)-m));

* initial value
m.l = 0.5;

model median /sumabs/;
option dnlp=conopt;
solve median minimizing z using dnlp;

display m.l;

Appendix B: GAMS code for finding the median as an LP

Both LP formulations are implemented here.

$ontext

   find median using an LP model

   use two formulations for the absolute values

$offtext

set
i /i1*i5/
;

parameter y(i) 'data';
y(i) = uniform(0,1);
display y;

*-------------------------------------------------------------
* formulation 1: bounding
*-------------------------------------------------------------

variables
   m    'median'
   z    'objective'
;
positivevariables
   e(i) 'absolute value, formulation 1'
;

equations
   obj1      'objective, formulation 1'
   bound1(i)
   bound2(i)
;

obj1.. z =e= sum(i, e(i));
bound1(i).. e(i) =g= y(i)-m;
bound2(i).. e(i) =g= -(y(i)-m);

model medianLP1 /obj1,bound1,bound2/;
solve medianLP1 minimizing z using lp;

display m.l;

*-------------------------------------------------------------
* formulation 2: variable splitting
*-------------------------------------------------------------
set
pm /'+','-'/
;

positivevariables
   e2(i,pm) 'absolute value, formulation 2'
;

equations
   obj2      'objective, formulation 2'
   split(i) 'variable splitting'
;

obj2.. z =e= sum((i,pm), e2(i,pm));
split(i).. e2(i,'+')-e2(i,'-') =e= y(i)-m;

model medianLP2 /obj2,split/;
solve medianLP2 minimizing z using lp;

display m.l;

Appendix C: GAMS code for finding quantiles.

$ontext

   find quantiles using an LP model

$offtext

set
i 'observations'/i1*i25/
t 'quantile levels'/'0','0.25','0.5','0.75','1'/
;

parameter y(i) 'data';
y(i) = uniform(10,100);

display y,t;

*-------------------------------------------------------------
* variable splitting LP Model
*-------------------------------------------------------------

set
pm /'+','-'/
;

scalar tau 'quantile';

positivevariables e(i,pm) 'absolute value';

variable
   z 'objective variable'
   q 'quantile'
;

equations
   obj      'objective'
   split(i) 'variable splitting'
;

obj.. z =e= sum(i, tau*e(i,'+') + (1-tau)*e(i,'-'));
split(i).. e(i,'+')-e(i,'-') =e= y(i)-q;

model quantileLP /obj,split/;

*-------------------------------------------------------------
* solve loop
*-------------------------------------------------------------

parameter quantiles(t) "Solution";

loop(t,
   tau = t.val;
   solve quantileLP minimizing z using lp;
   quantiles(t) = q.l;
);
display quantiles;

Appendix D: GAMS code for quantile regression

$ontext

   Quantile regression optimization problem

   Data from:

      https://sites.google.com/site/econometricsacademy/econometrics-models/quantile-regression

$offtext

*-------------------------------------------------------------
* data from csv file
*-------------------------------------------------------------

sets
   i 'observations'
   j0 'column names in csv file'
;

parameter data(i,*) 'all data';

$set csv d:\downloads\quantile_health.csv
$set gdx quantile_health.gdx

$call csv2gdx %csv% output=%gdx% id=data useHeader=T index=1 values=(2..8)

$gdxin %gdx%
$onmulti
$loaddc i=Dim1 j0=Dim2 data

display i,j0,data;

*-------------------------------------------------------------
* setup of regression data y,X
*-------------------------------------------------------------

set
j 'independent variables'/intercept,suppins,totchr,age,female,white/
;

parameter
   y(i)   'dependent variable'
   X(i,j) 'independent variables'
;

y(i) = data(i,'totexp');
X(i,j) = data(i,j);
X(i,'intercept') = 1;

*-------------------------------------------------------------
* quantile regression LP model
*-------------------------------------------------------------

set
pm /'+','-'/
;

scalar tau 'quantile';

positivevariables e(i,pm) 'absolute value';

variables
   z 'objective variable'
   beta(j) 'estimates'
;

equations
   obj      'objective'
   split(i) 'variable splitting'
;

obj.. z =e= sum(i, tau*e(i,'+') + (1-tau)*e(i,'-'));
split(i).. e(i,'+')-e(i,'-') =e= y(i)-sum(j,X(i,j)*beta(j));

model quantileLP /obj,split/;

*-------------------------------------------------------------
* solve loop
*-------------------------------------------------------------

set q 'quantile levels'/'0.25','0.5','0.75'/;

parameter estimates(q,j);

loop(q,
   tau = q.val;
   solve quantileLP minimizing z using lp;
   estimates(q,j) = beta.l(j);
);

display estimates;

Median, quantiles and quantile regression as linear programming problems

Problem 1: Finding the median as an optimization problem

Problem 2: Finding a quantile as an optimization problem

Problem 3: Quantile regression

Conclusion

References

Appendix A: GAMS code for non-smooth NLP model to calculate the median

Appendix B: GAMS code for finding the median as an LP

Appendix C: GAMS code for finding quantiles.

Appendix D: GAMS code for quantile regression

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...