Subset selection in regression

The following is a well-know application of branch-and-bound: select the k best independent variables to include in the regression. Often best means: best residual sum of squares. When we use a slightly different objective we can let the solver decide on k instead of fixing k in advance.

The paper:

shows the detailed MIQP model:

We could easily reproduce some of their results in GAMS.

First we load the data into our familiar y and X arrays:

Then we first solve a full OLS to get an estimate of the variance. We need to use the level of the SSQ variable from the solution:

Finally we run the MIQP on the Cp criterion:

Note that the n is a constant, so we could drop it from the objective and find the same optimal solutions. We keep it to make the objective interpretable as the Mallow’s Cp quantity.

Many of the data sets used in the paper contain columns with categorical data. In regression models these are typically handled by a set of dummy variables. When using stepwise regression or a method like MIQP we probably should include (or exclude) a whole block of dummy variables (corresponding to the same categorical variable) instead of considering them individually for inclusion. That makes the MIQP model a bit more complicated to express (but easier to solve).

The housing data set used above had just a single dummy variable. In this case there is no problem.

For problems with real categorical data (with more than two levels), we want to adjust the model. In this case we introduce nlevels-1 dummy variables. These dummies belong to the same “block” as they refer to the same categorical variable. We can adapt the MIQP model in two ways:

Keep all j binary variables z and add equality constraints for the z’s belonging to the same block.
Generate only z(k) binary variables and bound all the beta’s belonging to the same block by the same z(k). I.e. we need to map from beta(j) to z(k).

As the paper notices, when we use indicator constraints (Cplex, Xpress, SCIP), we could use implications so we don’t have to use a big-M constant

GAMS does not have language facilities to handle this (it allows indicator variables in a Cplex option file, but I feel uncomfortable if an option file – for tolerances and such things – changes the meaning of the model in such a substantial way). Ampl does this better by really having implications in the language.

I don’t really mind using big-M values. As these are essentially bounds on the decision variables, in practice we should be able to come up with a reasonable value.

Subset selection in regression

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112