In this post an interesting problem is posted. I am trying to suggest an optimization model opposed to some heuristic algorithm in Matlab. Looking at the zero points I get for my answer, without much success. I still think the MIP model is rather elegant. Here is the problem:
We have a large matrix with 10 columns and around 250k rows. The values are –1 and +1. We need to select N rows such that if we add up the columns of the selected rows, the average of the absolute values of those sums is minimized.
We can formulate this as very large MIP model with about 250k binary variables that indicate if a row is selected.
There is however a data trick we can use. Observe that each cell has two possibilities, so there are only 2^10=1024 different possible rows. Our data is now essentially a fixed 1024 x 10 matrix and 1024-vector of counts indicating how many rows of each type we saw.
Now we only have to use 1024 integer variables. The upper bounds on the integer variables are the counts. Actually we can tighten that a little bit: it is the minimum of the count and N. The complete MIP model can look like:
We use a traditional variable splitting technique to model the absolute value. Note also that instead of the average or mean we just minimize the sum.
To test this formulation first we need to generate this 1,024 row matrix. In GAMS you can do this as follows:
set
i /n1*n1024/
j /j1*j10/
val /'-1','+1'/
ps(i,j,val) /system.PowerSetRight /
;
parameter a(i,j);
a(i,j)$ps(i,j,'+1')=1;
a(i,j)$ps(i,j,'-1')=-1;
This will generate (first 11 rows shown):
For some large Ns the MIP solver will solve this problem surprisingly fast. Here is a log with CBC:
COIN-OR Branch and Cut (CBC Library 2.8) Calling CBC main solution routine... Solved to optimality. |
Very impressive results (other good MIP solvers perform similar).