Quantcast
Channel: Yet Another Math Programming Consultant
Viewing all articles
Browse latest Browse all 809

Micro-econometrics: discrete choice models

$
0
0
In this post, I want to discuss some statistical models from [1]. I'll implement these models in GAMS. First of all to emphasize these are all (nonlinear) optimization problems. Instead of using canned routines using a statistical package, this can help to get a better understanding of what is really going on. At least for me, not using a black-box routine, forces me to understand the underlying optimization models. Another application can be to have this part of a larger GAMS model. Some mathematical programming models just need some estimation code before the real model can be attacked. If the rest of the model is in GAMS, it may be a little bit easier to also use GAMS in the estimation tasks. 


Notation


When discussing statistical optimization models, it is always important to understand what is meant by \(x\) (or \(X\)). From a pure mathematical optimization point of view, \(x\) is often used to indicate decision variables. In statistics, \(X\) is often a data matrix. 

Similarly, the terms parameter and coefficient may mean different things in optimization and statistics. In mathematical programming parameters and coefficients are constants. For a regression, these terms indicate the quantities we want to estimate: they are decision variables in the mathematical optimization model.

This can cause some confusion.

Data


The problem we want to use in this experiment is from [2]: Does a new method of teaching economics (PSI) show improvements in grades (GRADE). The statistical variables (not optimization variables) are:

  • GRADE: dependent variable indicating better grades in later economics classes,
  • PSI: use of new teaching method (Personalized System of Instruction),
  • GPA: Grade Point Average,
  • TUCE: test score on the subject before entering the class

 

Data
OBS      GPA      TUCE     PSI      GRADE 
1 2.66 20 0 0
2 2.89 22 0 0
3 3.28 24 0 0
4 2.92 12 0 0
5 4.00 21 0 1
6 2.86 17 0 0
7 2.76 17 0 0
8 2.87 21 0 0
9 3.03 25 0 0
10 3.92 29 0 1
11 2.63 20 0 0
12 3.32 23 0 0
13 3.57 23 0 0
14 3.26 25 0 1
15 3.53 26 0 0
16 2.74 19 0 0
17 2.75 25 0 0
18 2.83 19 0 0
19 3.12 23 1 0
20 3.16 25 1 1
21 2.06 22 1 0
22 3.62 28 1 1
23 2.89 14 1 0
24 3.51 26 1 0
25 3.54 24 1 1
26 2.83 27 1 1
27 3.39 17 1 1
28 2.67 24 1 0
29 3.65 21 1 1
30 4.00 23 1 1
31 3.10 21 1 0
32 2.39 19 1 1

Data Prep


To work with a more familiar notation, I create parameters \(y\) and \(X\). Also, a column with ones is added to \(X\), so we don't have to worry adding a constant term to each of the models. So we form the following derived data:


----79PARAMETERydependentvariable(grade)

case51.000,case101.000,case141.000,case201.000,case221.000,case251.000,case261.000
case271.000,case291.000,case301.000,case321.000


----79PARAMETERXindependentvariables

constantgpatucepsi

case11.0002.66020.000
case21.0002.89022.000
case31.0003.28024.000
case41.0002.92012.000
case51.0004.00021.000
case61.0002.86017.000
case71.0002.76017.000
case81.0002.87021.000
case91.0003.03025.000
case101.0003.92029.000
case111.0002.63020.000
case121.0003.32023.000
case131.0003.57023.000
case141.0003.26025.000
case151.0003.53026.000
case161.0002.74019.000
case171.0002.75025.000
case181.0002.83019.000
case191.0003.12023.0001.000
case201.0003.16025.0001.000
case211.0002.06022.0001.000
case221.0003.62028.0001.000
case231.0002.89014.0001.000
case241.0003.51026.0001.000
case251.0003.54024.0001.000
case261.0002.83027.0001.000
case271.0003.39017.0001.000
case281.0002.67024.0001.000
case291.0003.65021.0001.000
case301.0004.00023.0001.000
case311.0003.10021.0001.000
case321.0002.39019.0001.000



The dependent variable grade is binary. To analyze models with this property, techniques like logit and probit are used. The independent variable psi is also binary. That is not changing much in the analysis.

  

Exercise


In this post, I want to reproduce some of the numbers from table 17.1 in [1]:



In this table APE stands for Average Partial Effects defined as: \[APE=\frac{1}{n}\sum_{i=1}^{n} f(\pmb{x}_i'\hat{\pmb{\beta}})\hat{\pmb{\beta}}\] For a discrete explanatory variable (PSI in our case), a different formula is used: \[APE=\frac{1}{n}\sum_{i=1}^{n} \left[ F(\pmb{x1}_i'\hat{\pmb{\beta}})-F(\pmb{x0}_i'\hat{\pmb{\beta}})\right]\] where \(\pmb{x1}_i\) is the i-th observation with \(PSI=1\) imputed, and similar,  \(\pmb{x0}_i\) is the i-th observation with \(PSI=0\) imputed. I did not completely understand the notation used in the footnote of table 17.1 for quite a while. Here I tried to use a different notation with some additional verbal explanation. In case of doubt, the precise implementation of this formula is in the GAMS model in the appendix.

Note that \(f(x)\) indicates the density and \(F(x)\) the cumulative distribution function. 


OLS: Ordinary Least Squares


Almost always, before looking into more complex models, it is a good idea to perform some standard linear regressions. I'll discuss a few ways of doing a least-squares fit. This is a bit of a side track, but I know that this is quite useful for many users. Least-squares is an important modeling concept.

The first optimization model is how I usually formulate least squares models in GAMS:

QP Model 1
\[\begin{align}\min&\sum_i \color{darkred}e_i^2 \\& \color{darkblue}y_i = \sum_j \color{darkblue}X_{i,j} \color{darkred}\beta_j + \color{darkred}e_i &&\forall i \end{align}\]

These are easy models, and they can be solved with any QP or NLP solver. We can substitute out \(\color{darkred}e_i\), and a more standard unconstrained QP follows:

Unconstrained QP Model 2
\[\min\>\sum_i \left(\color{darkblue}y_i - \sum_j \color{darkblue}X_{i,j} \color{darkred}\beta_j\right)^2 \]


This model is also often written in matrix notation: \[\min\>\|\color{darkblue}y-\color{darkblue}X\color{darkred}\beta\|_2^2\]


A method based on solving the so-called normal equations is as follows:

OLS Normal Equations
\[(\color{darkblue}X'\color{darkblue}X)\color{darkred}\beta = \color{darkblue}X'\color{darkblue}y\]

In GAMS, systems of equations can be solved as CNS (Constrained Nonlinear System) or MCP (Mixed Complementarity Problem), or using an LP/NLP with a dummy objective. Note that this method is numerically not very stable: when forming \((\color{darkblue}X'\color{darkblue}X)\) some very large numbers may be created.



A final approach can be to use some regression subroutine. These don't do any optimization, but typically use QR or SVD decomposition. For instance Python has several least squares functions (e.g., numpy.linalg.lstsq, sklearn.linear_model.LinearRegression, statsmodels.regression.linear_model.OLS). GAMS has an interface to the numpy.linalg.lstsq function. I believe this is using a divide-and-conquer SVD algorithm.

This last approach is recommended for numerically more challenging cases. For well-behaved problems, all methods should work. If there are constraints on the coefficients, the optimization models are obvious choices. 

For our data set, we see:

 
----356PARAMETERresultsresultsfromestimationmethods

OLS(QP1)OLS(QP2)OLS(NRML)OLS(py)
coeffcoeffcoeffcoeff

constant-1.498-1.498-1.498-1.498
gpa0.4640.4640.4640.464
tuce0.0100.0100.0100.010
psi0.3790.3790.3790.379


The interpretation is that we basically ignore that the dependent variable grade is a binary response. We just consider it as a standard, continuous response variable. The signs of the coefficients for gpa, tuce and psi make sense.

The APE (Average Partial Effect) for OLS is not so interesting. 

Logit and Probit Models


The logit and probit models deal explicitly with \(\color{darkblue}y\) being binary. They are closely related. The underlying probability models are:

\[\rm{Prob}(Y=1|\bf{x}) = \begin{cases} \Phi(\bf{x}'\bf{\beta}) & \text{for the probit model} \\ \Lambda(\bf{x}'\bf{\beta}) & \text{for the logit model}\end{cases}\] where \(\Phi(.)\) and \(\Lambda(.)\) are the normal and logistic distribution functions. 

In the text below, I use \[{\color{darkblue}{\bf x}}_i'{\color{darkred}\beta}\equiv\sum_j \color{darkblue}X_{i,j}\color{darkred}\beta_j\]

The standard way to estimate the coefficients is to form a log-likelihood function and maximize that. The likelihood function is \[\color{darkred}L=\prod_i F({\color{darkblue}{\bf x}}_i'{\color{darkred}\beta})^{\color{darkblue}y_i}\cdot [1-F(\color{darkblue}{\bf x}_i'\color{darkred}\beta)]^{1-\color{darkblue}y_i}\] So, the log-likelihood function is: \[\ln L = \sum_i \left\{\color{darkblue}y_i \ln F({\color{darkblue}{\bf x}}_i'{\color{darkred}\beta}) + (1-\color{darkred}y_i)\ln [1-F({\color{darkblue}{\bf x}}_i'{\color{darkred}\beta})]  \right\}\]

For the logit model, we have \[F(x) = \Lambda(x) = \frac{\exp(x)}{1+\exp(x)}\] Another way of writing this is \[F(x) = \frac{1}{1+\exp(-x)}\] The distribution function is \[f(x) = \Lambda(x)\left[1-\Lambda(x)\right]\] This is also often written as \[f(x)=\frac{\exp(-x)}{(1+\exp(-x))^2}=\frac{\exp(x)}{(1+\exp(x))^2}\] Plugging this in and after simplifying a bit, we end up with


Logit Log Likelihood
\[\max \ln \color{darkred}L = \sum_i \left\{\color{darkblue}y_i {\color{darkblue}{\bf x}}_i'{\color{darkred}\beta} - \ln\left[1+\exp({\color{darkblue}{\bf x}}_i'{\color{darkred}\beta})\right]\right\}\]


The first-order conditions of this form the following system of non-linear equations:

Logit First-Order Conditions
\[\sum_i \left[ \left(\color{darkblue}y_i-\frac{\exp({\color{darkblue}{\bf x}}_i'{\color{darkred}\beta})}{1+\exp({\color{darkblue}{\bf x}}_i'{\color{darkred}\beta})} \right) \color{darkblue}X_{i,j}\right] = 0 \>\>\>\forall j \]


The probit model can be stated directly as:

Probit Log Likelihood
\[\max \ln\color{darkred}L = \sum_i \left\{\color{darkblue}y_i \ln \Phi({\color{darkblue}{\bf{x}}}_i'{\color{darkred}{\bf{\beta}}}) + (1-\color{darkblue}y_i) \ln(1- \Phi({\color{darkblue}{\bf{x}}}_i'{\color{darkred}{\bf{\beta}}})) \right\}\]

where \(\Phi(.)\) is the error function. Not much simplification we can do here.

In these models, we repeat the expression \({\bf{x}}_i'{\bf{\beta}}\). Usually, I am inclined to prevent this, at the cost of an extra variable and equality constraint. In this case, it is most likely better just to keep the duplication: the summation length is small, and the cost of adding a constraint can be rather big.

I also reproduced the more exotic complementary log log model. This uses the distribution \[F(x) = 1-\exp[-\exp(x)]\] and \[f(x)=\exp(x)\exp[-\exp(x)]\] I could not find much information on the Gompertz model (except that the name is sometimes used for the CLogLog model). The Gompertz columns in table 17.1 from [1] remain a bit of a mystery to me.




Note that the Cloglog density is not symmetric while the probit (normal) and logit models are.


The final results table looks like:


----497PARAMETERresultsreplicateGreenetable17.1

OLS(QP1)OLS(QP2)OLS(NRML)OLS(py)LOGIT1
coeffcoeffcoeffcoeffcoeff

constant-1.498-1.498-1.498-1.498-13.021
gpa0.4640.4640.4640.4642.826
tuce0.0100.0100.0100.0100.095
psi0.3790.3790.3790.3792.379

+LOGIT2LOGITPROBITPROBITCLOGLOG
coeffAPEcoeffAPEcoeff

constant-13.021-7.452-10.031
gpa2.8260.3631.6260.3612.294
tuce0.0950.0120.0520.0110.041
psi2.3790.3581.4260.3741.562
meanf(x'b)0.1280.222

+CLOGLOG
APE

gpa0.413
tuce0.007
psi0.312
meanf(x'b)0.180


Conclusion


I find that reproducing results using a GAMS model is often a good exercise. There may be more easy-to-use canned routines for this in different statistical packages, but here we can verify the original mathematical formulation. 
 

References



  1. William Greene, Econometric Analysis, 8th edition, 2017. Chapter 17, Binary Outcomes and Discrete Choices.
  2. Spector, Lee C., and Michael Mazzeo. “Probit Analysis and Economic Education.” The Journal of Economic Education 11, no. 2 (1980): 37–44.

Appendix: GAMS model


$ontext

  
Logit,Probit Estimation
  
-----------------------

  
Erwin Kalvelagen, Amsterdam Optimization

  
Data:
     
http://pages.stern.nyu.edu/~wgreene/Text/tables/TableF21-1.txt

$offtext



option qcp=cplex;


*-----------------------------------------------------------
* raw data from Greene
*-----------------------------------------------------------

sets
  i
'records'/case1*case32/
  p0
'all variables'/constant,grade,gpa,tuce,psi/
;

table data(i,p0) 'raw data'

           
GPA      TUCE    PSI    GRADE
 
case1      2.66      20      0        0
 
case2      2.89      22      0        0
 
case3      3.28      24      0        0
 
case4      2.92      12      0        0
 
case5      4.00      21      0        1
 
case6      2.86      17      0        0
 
case7      2.76      17      0        0
 
case8      2.87      21      0        0
 
case9      3.03      25      0        0
case10      3.92      29      0        1
case11      2.63      20      0        0
case12      3.32      23      0        0
case13      3.57      23      0        0
case14      3.26      25      0        1
case15      3.53      26      0        0
case16      2.74      19      0        0
case17      2.75      25      0        0
case18      2.83      19      0        0
case19      3.12      23      1        0
case20      3.16      25      1        1
case21      2.06      22      1        0
case22      3.62      28      1        1
case23      2.89      14      1        0
case24      3.51      26      1        0
case25      3.54      24      1        1
case26      2.83      27      1        1
case27      3.39      17      1        1
case28      2.67      24      1        0
case29      3.65      21      1        1
case30      4.00      23      1        1
case31      3.10      21      1        0
case32      2.39      19      1        1
;

display data;


*-----------------------------------------------------------
* extract data
* form y, x
*-----------------------------------------------------------

set p(p0) 'independent variables'/constant,gpa,tuce,psi/;

parameters
 y(i)   
'dependent variable (grade)'
 X(i,p) 
'independent variables'
;

y(i)   = data(i,
'grade');
x(i,p) = data(i,p);
x(i,
'constant') = 1;
display y,x;

* check the assumption that y(i) is binary
abort$sum(i$(y(i)<>0 and y(i)<>1),1) "GRADE should be binary";
* psi is binary is used in APE calculation
abort$sum(i$(x(i,'psi')<>0 and x(i,'psi')<>1),1) "PSI should be binary";


*-----------------------------------------------------------
* solve OLS as QP
*-----------------------------------------------------------

parameter results(*,*,*) 'replicate Greene table 17.1 ';
option results:3:1:2;

variable
  sse       
'sum of squared errors'
  coeff(p)  
'estimated coefficients'
  e(i)      
'error term'
;
equation
  obj       
'objective'
  fit(i)    
'linear fit'
;

obj..    sse =e=
sum(i, sqr(e(i)));
fit(i).. y(i) =e=
sum(p, coeff(p)*x(i,p)) + e(i);

model ols /obj,fit/;
solve ols using qcp minimizing sse;

results(p,
'OLS(QP1)','coeff') = coeff.l(p);
display results;

*-----------------------------------------------------------
* solve OLS as QP (alternative formulation)
*-----------------------------------------------------------

equation unconobj 'unconstrained objective';
unconobj.. sse =e=
sum(i, sqr(y(i)-sum(p, coeff(p)*x(i,p))));

model ols2 /unconobj/;
solve ols2 using qcp minimizing sse;

results(p,
'OLS(QP2)','coeff') = coeff.l(p);

display results;

*-----------------------------------------------------------
* solve OLS as as system of linear equations
*
* solve the normal equations
*
* (X'X) b = X'y
*
*-----------------------------------------------------------


alias(p,pp);
parameter xx(p,pp) "inner product (X'X)";
xx(p,pp) =
sum(i, x(i,p)*x(i,pp));

equation normal(p) 'normal equations';

normal(p)..
sum(pp, xx(p,pp)*coeff(pp)) =e= sum(i, x(i,p)*y(i));

model ols3 /normal/;
solve ols3 using cns;

results(p,
'OLS(NRML)','coeff') = coeff.l(p);

display results;

*-----------------------------------------------------------
* solve OLS using python/numpy
*-----------------------------------------------------------

parameter theta(p) 'estimated coefficients';
$libinclude linalg ols i p x y theta

results(p,
'OLS(py)','coeff') = theta(p);

display results;

*-----------------------------------------------------------
* Logit model 1 : optimization
*-----------------------------------------------------------

variable lnL 'log likelihood';

equation LogitObj 'log likelihood for Logit model';

LogitObj.. lnL =e=
sum(i, y(i)*sum(p, coeff(p)*x(i,p))
                       - log[1 + exp(
sum(p, coeff(p)*x(i,p)))]
                    );

model logit1 /LogitObj/;
* reset levels (no cheating)
coeff.l(pp)=0;
solve logit1 using nlp maximizing lnL;

results(p,
'LOGIT1','coeff') = coeff.l(p);
display results;

*-----------------------------------------------------------
* Logit model 2 : system of equations
*-----------------------------------------------------------

alias(p,pp);
equation LogitFirstOrder 'Logit first order conditions';

LogitFirstOrder(p)..
  
sum(i, {y(i)-exp(sum(pp, coeff(pp)*x(i,pp)))/
               [1+exp(
sum(pp, coeff(pp)*x(i,pp)))]}*x(i,p)) =e= 0;

model logit2 /LogitFirstOrder/;
* reset levels
coeff.l(pp)=0;
solve logit2 using cns;

results(p,
'LOGIT2','coeff') = coeff.l(p);
display results;


*-----------------------------------------------------------
* Logit APE
*-----------------------------------------------------------

set pnoc(p) 'p except const';
pnoc(p) =
not sameas(p,'constant');

parameter xb(i) 'Xb (intermediate expression)';
Xb(i) =
sum(p, coeff.l(p)*x(i,p));

results(pnoc,
'LOGIT','APE') =
 
sum(i, exp(Xb(i))/sqr(1+exp(Xb(i)))*coeff.l(pnoc))/card(i);

parameter xb2(i,*) 'Xb2 (Xb with PSI=0 and PSI=1)';
xb2(i,
'PSI=0') = sum(p$(not sameas(p,'PSI')), coeff.l(p)*x(i,p));
xb2(i,
'PSI=1') = xb2(i,'PSI=0')+coeff.l('PSI');

parameter plogis(i,*) 'logistic distribution';
plogis(i,
'PSI=1') = 1/(1+exp(-xb2(i,'PSI=1')));
plogis(i,
'PSI=0') = 1/(1+exp(-xb2(i,'PSI=0')));

results(
'PSI','LOGIT','APE') =
 
sum(i,plogis(i,'PSI=1')-plogis(i,'PSI=0'))/card(i);

results(
"mean f(x'b)",'LOGIT','APE') =
 
sum(i, exp(Xb(i))/sqr(1+exp(Xb(i))))/card(i);

display plogis,Xb2

display results;


*-----------------------------------------------------------
* Probit model
*-----------------------------------------------------------


equation probitObj 'log likehood for Probit model';
probitObj..
   lnL =e=
sum(i, y(i)*log(errorf(sum(p, coeff(p)*x(i,p)))) +
                  (1-y(i))*log(1-errorf(
sum(p, coeff(p)*x(i,p)))));

model probit /probitObj/;
* reset levels
coeff.l(pp)=0;
solve probit maximizing lnL using nlp;

results(p,
'PROBIT','coeff') = coeff.l(p);

results(pnoc,
'PROBIT','APE') =
 
sum(i, 1/sqrt(2*pi) * exp(-0.5*sqr(sum(p, coeff.l(p)*x(i,p)))) * coeff.l(pnoc))/card(i);

* same as for logit APE
xb2(i,
'PSI=0') = sum(p$(not sameas(p,'PSI')), coeff.l(p)*x(i,p));
xb2(i,
'PSI=1') = xb2(i,'PSI=0')+coeff.l('PSI');

results(
'PSI','PROBIT','APE') = sum(i, errorf(xb2(i,'PSI=1'))-errorf(xb2(i,'PSI=0')))/card(i);


results(
"mean f(x'b)",'PROBIT','APE') =
 
sum(i, 1/sqrt(2*pi) * exp(-0.5*sqr(sum(p, coeff.l(p)*x(i,p)))))/card(i);

display results;


*-----------------------------------------------------------
* Complementary log log model
*-----------------------------------------------------------

equation compLogLogObj 'log likehood for comp log log model';
compLogLogObj..
   lnL =e=
sum(i, y(i)*log(1-exp(-exp(sum(p, coeff(p)*x(i,p))))) +
                  (1-y(i))*log(exp(-exp(
sum(p, coeff(p)*x(i,p))))));

model comploglog /compLogLogObj/;
* reset levels
coeff.l(pp)=0;
solve comploglog maximizing lnL using nlp;

results(p,
'CLOGLOG','coeff') = coeff.l(p);

Xb(i) =
sum(p, coeff.l(p)*x(i,p));

results(pnoc,
'CLOGLOG','APE') =
 
sum(i, exp(Xb(i))*exp(-exp(Xb(i))) * coeff.l(pnoc))/card(i);

* same as for logit APE
xb2(i,
'PSI=0') = sum(p$(not sameas(p,'PSI')), coeff.l(p)*x(i,p));
xb2(i,
'PSI=1') = xb2(i,'PSI=0')+coeff.l('PSI');

results(
'PSI','CLOGLOG','APE') = sum(i, exp(-exp(xb2(i,'PSI=0')))-exp(-exp(xb2(i,'PSI=1'))))/card(i);

results(
"mean f(x'b)",'CLOGLOG','APE') =
 
sum(i, exp(Xb(i))*exp(-exp(Xb(i))))/card(i);

display results;




Viewing all articles
Browse latest Browse all 809

Trending Articles