Quantile regression is a bit of an exotic type of regression [1,2,3]. It can be seen as a generalization of \(\ell_1\) or LAD regression [4], and just as LAD regression we can formulate and solve it as an LP.
First I want to discuss some preliminaries: how to find the median and the quantiles of a data vector \(\color{darkblue}y\). That will give us the tools to formulate the quantile regression problem as an LP. The reason for adding these preliminary steps is to develop some intuition about how Quantile Regression problems are defined. I found that most papers just "define" the underlying optimization problem, without much justification. I hope to show with these small auxiliary models how we arrive at the Quantile Regression model. Along the way, we encounter some interesting titbits. I'll discuss a few details that papers typically glance over or even skip, but I find fascinating.
Problem 1: Finding the median as an optimization problem
The (sample) median is the "middle observation". Conceptually this can be done quite simply: sort the data and pick the middle observation. For an even number of observations, we have two middle observations. Often we use the average of these two. Here, in our models, we are a little bit more relaxed about this: allow something in between the two middle observations. Roughly, the median can be defined here as a number \(\color{darkred}m\) such that half of the data is below and half the data is above \(\color{darkred}m\).
Suppose we have data \(\color{darkblue}y_i\). Then the following non-linear optimization problem can find the median \(\color{darkred}m\):
| NLP model for finding the median |
|---|
| \[\min\>\sum_i \left|\color{darkblue}y_i - \color{darkred}m\right|\] |
- You may want to think a little bit about how this model indeed finds the median (or rather a median with 50% of the data below and 50% above).
- This model looks deceptively simple. Actually, this is a nasty non-differentiable NLP problem. Many general-purpose NLP solvers may have a really difficult time with it. Most of them will not be able to establish (local) optimality. In the box below are some of the frightening messages you may encounter. Standard NLP solvers expect smoothness: functions and gradients should be continuous. Ignoring this may be a bad idea.
- This is a great example of why we always need to reformulate absolute values. Absolute values are wolves in sheep's clothing and they will eat you alive.
- Note that this model works also if the \(\color{darkblue}y_i\) are endogenous instead of just exogenous data. We will use this property later.
- Why not just sort the data? This is related to the previous point. Sorting data is easy, sorting decision variables is a different thing altogether and much, much more complicated.
---- 11 PARAMETER y data
i1 0.172, i2 0.843, i3 0.550, i4 0.301, i5 0.292
**** SOLVER STATUS 4Terminated By Solver
**** MODEL STATUS 7 Feasible Solution
**** OBJECTIVE VALUE 0.9297
LOWER LEVEL UPPER MARGINAL
---- VAR m -INF 0.3011 +INF 1.0000 NOPT
---- VAR z -INF 0.9297 +INF .
m median
z objective
Scary messages from different solvers:
Conopt:
** Feasible solution. Convergence too slow. The change in objective
has been less than 3.0000E-12 for 20 consecutive iterations
MINOS:
EXIT - The current point cannot be improved.
SNOPT:
EXIT - Current point cannot be improved.
IPOPT:
EXIT: Restoration Failed!
Final point is feasible: scaled constraint violation (0) is below tol (1e-08) and unscaled constraint violation (0) is below constr_viol_tol (0.0001).
Knitro:
EXIT: Primal feasible solution estimate cannot be improved; desired accuracy
in dual feasibility could not be achieved.
We can reformulate absolute values in a linear fashion, so we end up with an LP. There are (at least) two ways to do this:
| LP model 1 for finding the median |
|---|
| \[\begin{align}\min&\sum_i\color{darkred}e_i \\ & \color{darkred}e_i\ge\color{darkblue}y_i -\color{darkred}m&& \forall i \\ & \color{darkred}e_i \ge -(\color{darkblue}y_i -\color{darkred}m) && \forall i \\ &\color{darkred}e_i\ge 0\end{align}\] |
or using variable splitting:
| LP model 2 for finding the median |
|---|
| \[\begin{align}\min&\sum_i \left(\color{darkred}e^+_i+\color{darkred}e^-_i\right) \\ & \color{darkred}e^+_i-\color{darkred}e^-_i =\color{darkblue}y_i -\color{darkred}m && \forall i \\ &\color{darkred}e^-_i,\color{darkred}e^+_i\ge 0\end{align}\] |
For the last model it may be instructive to show the errors \(\color{darkred}e^+_i,\color{darkred}e^-_i\):
---- 69 VARIABLE e2.L absolute value, formulation 2
+ -
i1 0.129
i2 0.542
i3 0.249
i5 0.009
For every error \(i\) we see that only one of \(\color{darkred}e^+_i,\color{darkred}e^-_i\) will be nonzero. This is what we would expect. Note that there is one error that is zero.
These models will not have the same problems as the NLP model. They solve easily to optimality. In the next paragraphs, we will see how LP Model 2 is really the building block for Quantile Regression.
Even though this model is just a piece of the puzzle, we can learn a lot from it.
Problem 2: Finding a quantile as an optimization problem
We can generalize the concept of a median to quantiles. E.g., a 0.75 quantile gives us the number such that 75% of the data is below (and 25% is above). Here is an optimization model for this:
| NLP model for finding the \(\tau\)-th quantile |
|---|
| \[\min\>\sum_{i|\color{darkblue}y_i\ge\color{darkred}q} \color{darkblue}\tau\left|\color{darkblue}y_i - \color{darkred}q \right| +\sum_{i|\color{darkblue}y_i\lt\color{darkred}q} (1-\color{darkblue}\tau)\left|\color{darkblue}y_i-\color{darkred}q \right| \] |
This is essentially a weighted version of our median model, with weights \(\color{darkblue}\tau\) and \((1-\color{darkblue}\tau)\). We don't even try to solve this as stated. Note that the summations themselves are already difficult. However, when we take a step back we see that the first summation is over the positive parts of \(|\color{darkblue}y-\color{darkred}q|\) and the second one over the negative parts. Splitting absolute values into positive and negative parts is something we already know how to do: variable splitting.
| LP model for finding the \(\tau\)-th quantile |
|---|
| \[\begin{align}\min&\sum_i \left( \color{darkblue}\tau \cdot \color{darkred}e^+_i +(1-\color{darkblue}\tau) \cdot \color{darkred}e^-_i \right) \\ &\color{darkred}e^+_i -\color{darkred}e^-_i = \color{darkblue}y_i - \color{darkred}q && \forall i \\ &\color{darkred}e^-_i,\color{darkred}e^+_i\ge 0\end{align} \] |
Let's try this on some random data:
---- 15 PARAMETER y data
i1 25.457, i2 85.894, i3 59.534, i4 37.102, i5 36.299, i6 30.165, i7 41.485, i8 87.064
i9 16.040, i10 55.019, i11 99.831, i12 62.086, i13 99.202, i14 78.603, i15 21.762, i16 67.575
i17 24.357, i18 32.507, i19 70.204, i20 49.182, i21 42.373, i22 41.630, i23 21.834, i24 23.509
i25 63.020
---- 15 SET t quantile levels
0 , 0.25, 0.5 , 0.75, 1
---- 52 PARAMETER quantiles Solution
016.040, 0.2530.165, 0.542.373, 0.7567.575, 199.831
Here we solved 5 LPs for quantile levels 0, 0.25, 0.5, 0.75, and finally 1.
When we do the same in R, we see identical results:
On some data sets, we may see slight differences between the optimization model and the R code. This is because R has 9 different types of quantiles! To more precisely match the optimization model use type=1:
Problem 3: Quantile regression
Finally, let's look at quantile regression. The underlying optimization model looks like [1]:
| NLP model for quantile regression |
|---|
| \[\begin{align}\min&\sum_{i|\color{darkred}e_i\ge 0} \color{darkblue}\tau|\color{darkred}e_i| +\sum_{i|\color{darkred}e_i\lt 0} (1-\color{darkblue}\tau) |\color{darkred}e_i| \\ & \color{darkred}e = \color{darkblue}y-\color{darkblue}X\cdot\color{darkred}\beta\end{align}\] |
We can linearize this with our familiar tools:
| LP model for quantile regression |
|---|
| \[\begin{align}\min&\sum_i \left( \color{darkblue}\tau \cdot\color{darkred}e^+_i + (1-\color{darkblue}\tau) \cdot \color{darkred}e^-_i \right)\\ & \color{darkred}e^+_i - \color{darkred}e^-_i = \color{darkblue}y_i-\sum_j \color{darkblue}X_{i,j}\cdot\color{darkred}\beta_j && \forall i\\ &\color{darkred}e^+_i,\color{darkred}e^+_i\ge 0\end{align}\] |
When we run this model on the data from [3], we get:
---- 96 PARAMETER estimates
suppins age white female totchr intercept
0.25453.44416.083338.08316.056782.472 -1412.889
0.5687.22235.111632.889 -260.5561332.833 -2252.556
0.75708.40987.364801.682 -554.5912855.318 -4512.045
The analysis in [3] focuses on the variable totchr. Indeed the coefficient for this variable changes a lot for different quantile levels. The R code from [3] produces:
> results <- rq(Y ~ X, data=mydata, tau=c(0.25, 0.5, 0.75))
> summary(results)
Call: rq(formula = Y ~ X, tau = c(0.25, 0.5, 0.75), data = mydata)
tau: [1] 0.25
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -1412.88889433.20179 -3.261500.00112
Xsuppins 453.4444475.053486.041620.00000
Xtotchr 782.4722237.5576920.833880.00000
Xage 16.083336.191622.597600.00943
Xfemale 16.0555672.202780.222370.82404
Xwhite 338.0833371.515224.727430.00000
Call: rq(formula = Y ~ X, tau = c(0.25, 0.5, 0.75), data = mydata)
tau: [1] 0.5
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -2252.55556846.23023 -2.661870.00781
Xsuppins 687.22222137.292645.005530.00000
Xtotchr 1332.8333374.7791317.823600.00000
Xage 35.1111111.294503.108690.00190
Xfemale -260.55556150.46285 -1.731690.08343
Xwhite 632.88889243.057342.603870.00926
Call: rq(formula = Y ~ X, tau = c(0.25, 0.5, 0.75), data = mydata)
tau: [1] 0.75
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -4512.045452350.56284 -1.919560.05501
Xsuppins 708.40909375.769291.885220.05950
Xtotchr 2855.31818196.1258714.558600.00000
Xage 87.3636430.984102.819630.00484
Xfemale -554.59091378.71501 -1.464400.14319
Xwhite 801.68182370.961082.161090.03077
We see our LP model can reproduce the R results.
Conclusion
Quantile regression is based on a rather simple, but non-obvious LP model. I have tried to informally derive this model using small LP models for the calculation of the median and the quantiles of a data vector. After this, the LP model for the quantile regression problem is rather obvious.
Explicitly formulating and implementing the LP model helps to make things a bit less of a black box. After this exercise, running quantile regressions in say R is more transparent: you know what you are doing
References
- Koenker, R., and Bassett, G. W. (1978). “Regression Quantiles.” Econometrica 46:33–50.
- Quantile Regression, https://en.wikipedia.org/wiki/Quantile_regression
- Ani Katchova, Quantile Regression, https://sites.google.com/site/econometricsacademy/econometrics-models/quantile-regression, this is a very good introduction with examples
- Linear Programming and L1 Regression, https://yetanothermathprogrammingconsultant.blogspot.com/2017/11/lp-and-lad-regression.html
- Median, https://en.wikipedia.org/wiki/Median
- Quantile, https://en.wikipedia.org/wiki/Quantile
Appendix A: GAMS code for non-smooth NLP model to calculate the median
Try with different NLP solvers to see what happens.
To have GAMS accept a function like abs() inside an NLP model, we need to declare the model as a DNLP. This is a warning that trouble may be ahead.
$ontext
find median using a non-smooth NLP model
$offtext
set i /i1*i5/;
parameter y(i) 'data'; y(i) = uniform(0,1); display y;
variable m 'median' z 'objective' ;
equation sumabs;
sumabs.. z =e= sum(i, abs(y(i)-m));
* initial value m.l = 0.5;
model median /sumabs/; option dnlp=conopt; solve median minimizing z using dnlp;
display m.l; |
Appendix B: GAMS code for finding the median as an LP
Both LP formulations are implemented here.
$ontext
find median using an LP model
use two formulations for the absolute values
$offtext
set i /i1*i5/ ;
parameter y(i) 'data'; y(i) = uniform(0,1); display y;
*------------------------------------------------------------- * formulation 1: bounding *-------------------------------------------------------------
variables m 'median' z 'objective' ; positivevariables e(i) 'absolute value, formulation 1' ;
equations obj1 'objective, formulation 1' bound1(i) bound2(i) ;
obj1.. z =e= sum(i, e(i)); bound1(i).. e(i) =g= y(i)-m; bound2(i).. e(i) =g= -(y(i)-m);
model medianLP1 /obj1,bound1,bound2/; solve medianLP1 minimizing z using lp;
display m.l;
*------------------------------------------------------------- * formulation 2: variable splitting *------------------------------------------------------------- set pm /'+','-'/ ;
positivevariables e2(i,pm) 'absolute value, formulation 2' ;
equations obj2 'objective, formulation 2' split(i) 'variable splitting' ;
obj2.. z =e= sum((i,pm), e2(i,pm)); split(i).. e2(i,'+')-e2(i,'-') =e= y(i)-m;
model medianLP2 /obj2,split/; solve medianLP2 minimizing z using lp;
display m.l;
|
Appendix C: GAMS code for finding quantiles.
$ontext
find quantiles using an LP model
$offtext
set i 'observations'/i1*i25/ t 'quantile levels'/'0','0.25','0.5','0.75','1'/ ;
parameter y(i) 'data'; y(i) = uniform(10,100);
display y,t;
*------------------------------------------------------------- * variable splitting LP Model *-------------------------------------------------------------
set pm /'+','-'/ ;
scalar tau 'quantile';
positivevariables e(i,pm) 'absolute value';
variable z 'objective variable' q 'quantile' ;
equations obj 'objective' split(i) 'variable splitting' ;
obj.. z =e= sum(i, tau*e(i,'+') + (1-tau)*e(i,'-')); split(i).. e(i,'+')-e(i,'-') =e= y(i)-q;
model quantileLP /obj,split/;
*------------------------------------------------------------- * solve loop *-------------------------------------------------------------
parameter quantiles(t) "Solution";
loop(t, tau = t.val; solve quantileLP minimizing z using lp; quantiles(t) = q.l; ); display quantiles;
|
Appendix D: GAMS code for quantile regression
$ontext
Quantile regression optimization problem
Data from:
https://sites.google.com/site/econometricsacademy/econometrics-models/quantile-regression
$offtext
*------------------------------------------------------------- * data from csv file *-------------------------------------------------------------
sets i 'observations' j0 'column names in csv file' ;
parameter data(i,*) 'all data';
$set csv d:\downloads\quantile_health.csv $set gdx quantile_health.gdx
$call csv2gdx %csv% output=%gdx% id=data useHeader=T index=1 values=(2..8)
$gdxin %gdx% $onmulti $loaddc i=Dim1 j0=Dim2 data
display i,j0,data;
*------------------------------------------------------------- * setup of regression data y,X *-------------------------------------------------------------
set j 'independent variables'/intercept,suppins,totchr,age,female,white/ ;
parameter y(i) 'dependent variable' X(i,j) 'independent variables' ;
y(i) = data(i,'totexp'); X(i,j) = data(i,j); X(i,'intercept') = 1;
*------------------------------------------------------------- * quantile regression LP model *-------------------------------------------------------------
set pm /'+','-'/ ;
scalar tau 'quantile';
positivevariables e(i,pm) 'absolute value';
variables z 'objective variable' beta(j) 'estimates' ;
equations obj 'objective' split(i) 'variable splitting' ;
obj.. z =e= sum(i, tau*e(i,'+') + (1-tau)*e(i,'-')); split(i).. e(i,'+')-e(i,'-') =e= y(i)-sum(j,X(i,j)*beta(j));
model quantileLP /obj,split/;
*------------------------------------------------------------- * solve loop *-------------------------------------------------------------
set q 'quantile levels'/'0.25','0.5','0.75'/;
parameter estimates(q,j);
loop(q, tau = q.val; solve quantileLP minimizing z using lp; estimates(q,j) = beta.l(j); );
display estimates; |