Quantcast
Channel: Yet Another Math Programming Consultant
Viewing all articles
Browse latest Browse all 809

Median, quantiles and quantile regression as linear programming problems

$
0
0
Quantile regression is a bit of an exotic type of regression [1,2,3]. It can be seen as a generalization of \(\ell_1\) or LAD regression [4], and just as LAD regression we can formulate and solve it as an LP.

First I want to discuss some preliminaries: how to find the median and the quantiles of a data vector \(\color{darkblue}y\). That will give us the tools to formulate the quantile regression problem as an LP. The reason for adding these preliminary steps is to develop some intuition about how Quantile Regression problems are defined. I found that most papers just "define" the underlying optimization problem, without much justification. I hope to show with these small auxiliary models how we arrive at the Quantile Regression model. Along the way, we encounter some interesting titbits. I'll discuss a few details that papers typically glance over or even skip, but I find fascinating.

Problem 1: Finding the median as an optimization problem


The (sample) median is the "middle observation".  Conceptually this can be done quite simply: sort the data and pick the middle observation. For an even number of observations, we have two middle observations. Often we use the average of these two. Here, in our models, we are a little bit more relaxed about this: allow something in between the two middle observations. Roughly, the median can be defined here as a number \(\color{darkred}m\) such that half of the data is below and half the data is above \(\color{darkred}m\). 

Suppose we have data \(\color{darkblue}y_i\). Then the following non-linear optimization problem can find the median \(\color{darkred}m\):

NLP model for finding the median
\[\min\>\sum_i \left|\color{darkblue}y_i - \color{darkred}m\right|\]


  • You may want to think a little bit about how this model indeed finds the median (or rather a median with 50% of the data below and 50% above).
  • This model looks deceptively simple. Actually, this is a nasty non-differentiable NLP problem. Many general-purpose NLP solvers may have a really difficult time with it. Most of them will not be able to establish (local) optimality. In the box below are some of the frightening messages you may encounter. Standard NLP solvers expect smoothness: functions and gradients should be continuous. Ignoring this may be a bad idea. 
  • This is a great example of why we always need to reformulate absolute values. Absolute values are wolves in sheep's clothing and they will eat you alive. 
  • Note that this model works also if the \(\color{darkblue}y_i\) are endogenous instead of just exogenous data. We will use this property later. 
  • Why not just sort the data? This is related to the previous point. Sorting data is easy, sorting decision variables is a different thing altogether and much, much more complicated. 


----     11 PARAMETER y  data

i1
0.172, i2 0.843, i3 0.550, i4 0.301, i5 0.292


**** SOLVER STATUS
4Terminated By Solver
**** MODEL STATUS
7 Feasible Solution
**** OBJECTIVE VALUE
0.9297


LOWER LEVEL UPPER MARGINAL

---- VAR m -INF
0.3011 +INF 1.0000 NOPT
---- VAR z -INF
0.9297 +INF .

m median
z objective



Scary messages from different solvers:


Conopt:
** Feasible solution. Convergence too slow. The change in objective
has been less than
3.0000E-12 for 20 consecutive iterations


MINOS:
EXIT - The current point cannot be improved.


SNOPT:
EXIT - Current point cannot be improved.


IPOPT:
EXIT: Restoration Failed!
Final point is feasible: scaled constraint violation (
0) is below tol (1e-08) and unscaled constraint violation (0) is below constr_viol_tol (0.0001).


Knitro:
EXIT: Primal feasible solution estimate cannot be improved; desired accuracy
in dual feasibility could not be achieved.


We can reformulate absolute values in a linear fashion, so we end up with an LP. There are (at least) two ways to do this:



LP model 1 for finding the median
\[\begin{align}\min&\sum_i\color{darkred}e_i \\ & \color{darkred}e_i\ge\color{darkblue}y_i -\color{darkred}m&& \forall i \\ & \color{darkred}e_i \ge -(\color{darkblue}y_i  -\color{darkred}m) && \forall i \\ &\color{darkred}e_i\ge 0\end{align}\]

or using variable splitting:

LP model 2 for finding the median
\[\begin{align}\min&\sum_i \left(\color{darkred}e^+_i+\color{darkred}e^-_i\right) \\ & \color{darkred}e^+_i-\color{darkred}e^-_i =\color{darkblue}y_i -\color{darkred}m && \forall i \\ &\color{darkred}e^-_i,\color{darkred}e^+_i\ge 0\end{align}\]

For the last model it may be instructive to show the errors \(\color{darkred}e^+_i,\color{darkred}e^-_i\):

----     69 VARIABLE e2.L  absolute value, formulation 2

+ -

i1 0.129
i2 0.542
i3 0.249
i5 0.009

For every error \(i\) we see that only one of \(\color{darkred}e^+_i,\color{darkred}e^-_i\) will be nonzero. This is what we would expect. Note that there is one error that is zero.

These models will not have the same problems as the NLP model. They solve easily to optimality. In the next paragraphs, we will see how LP Model 2 is really the building block for Quantile Regression. 

Even though this model is just a piece of the puzzle, we can learn a lot from it. 

Problem 2: Finding a quantile as an optimization problem


We can generalize the concept of a median to quantiles. E.g., a 0.75 quantile gives us the number such that 75% of the data is below (and 25% is above).  Here is an optimization model for this:


NLP model for finding the \(\tau\)-th quantile 
\[\min\>\sum_{i|\color{darkblue}y_i\ge\color{darkred}q} \color{darkblue}\tau\left|\color{darkblue}y_i - \color{darkred}q \right| +\sum_{i|\color{darkblue}y_i\lt\color{darkred}q} (1-\color{darkblue}\tau)\left|\color{darkblue}y_i-\color{darkred}q \right| \]


This is essentially a weighted version of our median model, with weights \(\color{darkblue}\tau\) and \((1-\color{darkblue}\tau)\). We don't even try to solve this as stated. Note that the summations themselves are already difficult. However, when we take a step back we see that the first summation is over the positive parts of \(|\color{darkblue}y-\color{darkred}q|\) and the second one over the negative parts. Splitting absolute values into positive and negative parts is something we already know how to do: variable splitting


LP model for finding the \(\tau\)-th quantile 
\[\begin{align}\min&\sum_i \left( \color{darkblue}\tau \cdot  \color{darkred}e^+_i +(1-\color{darkblue}\tau) \cdot \color{darkred}e^-_i \right) \\ &\color{darkred}e^+_i -\color{darkred}e^-_i = \color{darkblue}y_i - \color{darkred}q && \forall i \\ &\color{darkred}e^-_i,\color{darkred}e^+_i\ge 0\end{align} \]


Let's try this on some random data:


----     15 PARAMETER y  data

i1 25.457, i2 85.894, i3 59.534, i4 37.102, i5 36.299, i6 30.165, i7 41.485, i8 87.064
i9 16.040, i10 55.019, i11 99.831, i12 62.086, i13 99.202, i14 78.603, i15 21.762, i16 67.575
i17 24.357, i18 32.507, i19 70.204, i20 49.182, i21 42.373, i22 41.630, i23 21.834, i24 23.509
i25 63.020


---- 15 SET t quantile levels

0 , 0.25, 0.5 , 0.75, 1


---- 52 PARAMETER quantiles Solution

016.040, 0.2530.165, 0.542.373, 0.7567.575, 199.831


Here we solved 5 LPs for quantile levels 0, 0.25, 0.5, 0.75, and finally 1.

When we do the same in R, we see identical results:



On some data sets, we may see slight differences between the optimization model and the R code. This is because R has 9 different types of quantiles! To more precisely match the optimization model use type=1:



Problem 3: Quantile regression


Finally, let's look at quantile regression. The underlying optimization model looks like [1]:


NLP model for quantile regression 
\[\begin{align}\min&\sum_{i|\color{darkred}e_i\ge 0} \color{darkblue}\tau|\color{darkred}e_i| +\sum_{i|\color{darkred}e_i\lt 0} (1-\color{darkblue}\tau) |\color{darkred}e_i| \\ & \color{darkred}e  = \color{darkblue}y-\color{darkblue}X\cdot\color{darkred}\beta\end{align}\]

We can linearize this with our familiar tools:

LP model for quantile regression 
\[\begin{align}\min&\sum_i \left( \color{darkblue}\tau \cdot\color{darkred}e^+_i + (1-\color{darkblue}\tau) \cdot \color{darkred}e^-_i \right)\\ & \color{darkred}e^+_i - \color{darkred}e^-_i = \color{darkblue}y_i-\sum_j \color{darkblue}X_{i,j}\cdot\color{darkred}\beta_j && \forall i\\ &\color{darkred}e^+_i,\color{darkred}e^+_i\ge 0\end{align}\]

When we run this model on the data from [3], we get:


----     96 PARAMETER estimates  

suppins age white female totchr intercept

0.25453.44416.083338.08316.056782.472 -1412.889
0.5687.22235.111632.889 -260.5561332.833 -2252.556
0.75708.40987.364801.682 -554.5912855.318 -4512.045


The analysis in [3] focuses on the variable totchr. Indeed the coefficient for this variable changes a lot for different quantile levels. The R code from [3] produces:


> results <- rq(Y ~ X, data=mydata, tau=c(0.25, 0.5, 0.75))
> summary(results)

Call: rq(formula = Y ~ X, tau = c(
0.25, 0.5, 0.75), data = mydata)

tau: [
1] 0.25

Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -
1412.88889433.20179 -3.261500.00112
Xsuppins
453.4444475.053486.041620.00000
Xtotchr
782.4722237.5576920.833880.00000
Xage
16.083336.191622.597600.00943
Xfemale
16.0555672.202780.222370.82404
Xwhite
338.0833371.515224.727430.00000

Call: rq(formula = Y ~ X, tau = c(
0.25, 0.5, 0.75), data = mydata)

tau: [
1] 0.5

Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -
2252.55556846.23023 -2.661870.00781
Xsuppins
687.22222137.292645.005530.00000
Xtotchr
1332.8333374.7791317.823600.00000
Xage
35.1111111.294503.108690.00190
Xfemale -
260.55556150.46285 -1.731690.08343
Xwhite
632.88889243.057342.603870.00926

Call: rq(formula = Y ~ X, tau = c(
0.25, 0.5, 0.75), data = mydata)

tau: [
1] 0.75

Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -
4512.045452350.56284 -1.919560.05501
Xsuppins
708.40909375.769291.885220.05950
Xtotchr
2855.31818196.1258714.558600.00000
Xage
87.3636430.984102.819630.00484
Xfemale -
554.59091378.71501 -1.464400.14319
Xwhite
801.68182370.961082.161090.03077


We see our LP model can reproduce the R results.


Conclusion


Quantile regression is based on a rather simple, but non-obvious LP model. I have tried to informally derive this model using small LP models for the calculation of the median and the quantiles of a data vector. After this, the LP model for the quantile regression problem is rather obvious.

Explicitly formulating and implementing the LP model helps to make things a bit less of a black box. After this exercise, running quantile regressions in say R is more transparent: you know what you are doing

References


  1. Koenker, R., and Bassett, G. W. (1978). “Regression Quantiles.” Econometrica 46:33–50.
  2. Quantile Regression, https://en.wikipedia.org/wiki/Quantile_regression
  3. Ani Katchova, Quantile Regression, https://sites.google.com/site/econometricsacademy/econometrics-models/quantile-regression, this is a very good introduction with examples
  4. Linear Programming and L1 Regression, https://yetanothermathprogrammingconsultant.blogspot.com/2017/11/lp-and-lad-regression.html
  5. Median, https://en.wikipedia.org/wiki/Median
  6. Quantile, https://en.wikipedia.org/wiki/Quantile


Appendix A: GAMS code for non-smooth NLP model to calculate the median


Try with different NLP solvers to see what happens.

To have GAMS accept a function like abs() inside an NLP model, we need to declare the model as a DNLP. This is a warning that trouble may be ahead.


$ontext

  
find median using a non-smooth NLP model

$offtext

set i /i1*i5/;

parameter y(i) 'data';
y(i) = uniform(0,1);
display y;

variable
   m
'median'
   z
'objective'
;

equation sumabs;

sumabs.. z =e=
sum(i, abs(y(i)-m));

* initial value
m.l = 0.5;

model median /sumabs/;
option dnlp=conopt;
solve median minimizing z using dnlp;

display m.l;



Appendix B: GAMS code for finding the median as an LP


Both LP formulations are implemented here.


$ontext

  
find median using an LP model

  
use two formulations for the absolute values

$offtext

set
  i
/i1*i5/
;

parameter y(i) 'data';
y(i) = uniform(0,1);
display y;


*-------------------------------------------------------------
* formulation 1: bounding
*-------------------------------------------------------------

variables
   m   
'median'
   z   
'objective'
;
positivevariables
   e(i)
'absolute value, formulation 1'
;

equations
   obj1     
'objective, formulation 1'
   bound1(i)
   bound2(i)
;

obj1.. z =e=
sum(i, e(i));
bound1(i).. e(i) =g= y(i)-m;
bound2(i).. e(i) =g= -(y(i)-m);

model medianLP1 /obj1,bound1,bound2/;
solve medianLP1 minimizing z using lp;

display m.l;

*-------------------------------------------------------------
* formulation 2: variable splitting
*-------------------------------------------------------------
set
  pm
/'+','-'/
;

positivevariables
   e2(i,pm)
'absolute value, formulation 2'
;

equations
   obj2     
'objective, formulation 2'
   split(i) 
'variable splitting'
;

obj2.. z =e=
sum((i,pm), e2(i,pm));
split(i).. e2(i,
'+')-e2(i,'-') =e= y(i)-m;

model medianLP2 /obj2,split/;
solve medianLP2 minimizing z using lp;

display m.l;



Appendix C: GAMS code for finding quantiles.


$ontext

  
find quantiles using an LP model

$offtext

set
  i
'observations'/i1*i25/
  t
'quantile levels'/'0','0.25','0.5','0.75','1'/
;

parameter y(i) 'data';
y(i) = uniform(10,100);

display y,t;

*-------------------------------------------------------------
* variable splitting LP Model
*-------------------------------------------------------------

set
  pm
/'+','-'/
;

scalar tau 'quantile';

positivevariables e(i,pm) 'absolute value';

variable
   z
'objective variable'
   q
'quantile'
;


equations
   obj     
'objective'
   split(i) 
'variable splitting'
;

obj.. z =e=
sum(i, tau*e(i,'+') + (1-tau)*e(i,'-'));
split(i).. e(i,
'+')-e(i,'-') =e= y(i)-q;

model quantileLP /obj,split/;


*-------------------------------------------------------------
* solve loop
*-------------------------------------------------------------

parameter quantiles(t) "Solution";

loop(t,
   tau = t.val;
  
solve quantileLP minimizing z using lp;
   quantiles(t) = q.l;
);
display quantiles;



Appendix D: GAMS code for quantile regression

$ontext

  
Quantile regression optimization problem

  
Data from:

     
https://sites.google.com/site/econometricsacademy/econometrics-models/quantile-regression

$offtext


*-------------------------------------------------------------
* data from csv file
*-------------------------------------------------------------


sets
   i
'observations'
   j0
'column names in csv file'
;

parameter data(i,*) 'all data';

$set csv d:\downloads\quantile_health.csv
$set gdx quantile_health.gdx

$call csv2gdx %csv% output=%gdx% id=data useHeader=T index=1 values=(2..8)

$gdxin %gdx%
$onmulti
$loaddc i=Dim1 j0=Dim2 data

display i,j0,data;

*-------------------------------------------------------------
* setup of regression data y,X
*-------------------------------------------------------------

set
  j
'independent variables'/intercept,suppins,totchr,age,female,white/
;

parameter
   y(i)  
'dependent variable'
   X(i,j)
'independent variables'
;

y(i) = data(i,
'totexp');
X(i,j) = data(i,j);
X(i,
'intercept') = 1;


*-------------------------------------------------------------
* quantile regression LP model
*-------------------------------------------------------------

set
  pm
/'+','-'/
;

scalar tau 'quantile';

positivevariables e(i,pm) 'absolute value';

variables
   z
'objective variable'
   beta(j)
'estimates'
;

equations
   obj     
'objective'
   split(i) 
'variable splitting'
;

obj.. z =e=
sum(i, tau*e(i,'+') + (1-tau)*e(i,'-'));
split(i).. e(i,
'+')-e(i,'-') =e= y(i)-sum(j,X(i,j)*beta(j));

model quantileLP /obj,split/;


*-------------------------------------------------------------
* solve loop
*-------------------------------------------------------------

set q 'quantile levels'/'0.25','0.5','0.75'/;

parameter estimates(q,j);

loop(q,
   tau = q.val;
  
solve quantileLP minimizing z using lp;
   estimates(q,j) = beta.l(j);
);

display estimates;



Viewing all articles
Browse latest Browse all 809

Trending Articles