Quantcast
Channel: Yet Another Math Programming Consultant
Viewing all articles
Browse latest Browse all 804

Finding a cluster of closest points

$
0
0

Problem statement


From [1]:

Consider \(n=100\) points in an 12 dimensional space. Find \(m=8\) points such that they are as close as possible.


Models


The problem can be stated as a simple MIQP (Mixed Integer Quadratic Programming) model.

 
MIQP Model
\[\begin{align} \min & \sum_{i\lt j} \color{darkred}x_i \cdot \color{darkred}x_j \cdot \color{darkblue}{\mathit{dist}}_{i,j} \\ & \sum_i \color{darkred}x_i = \color{darkblue}m \\ & \color{darkred}x_i \in \{0,1\} \end{align}\]


This is a simple model. The main wrinkle is that we want to make sure that we only count each distance once. For this reason, we only consider distances with \(i \lt j\). Of course, we can exploit this also in calculating the distance matrix \( \color{darkblue}{\mathit{dist}}_{i,j}\), and make this a strictly upper-triangular matrix.

As the \( \color{darkred}x\) variables are binary, we can easily linearize problem:


MIP Model
\[\begin{align} \min & \sum_{i\lt j} \color{darkred}y_{i,j} \cdot \color{darkblue}{\mathit{dist}}_{i,j} \\ & \color{darkred}y_{i,j} \ge \color{darkred}x_i + \color{darkred}x_j - 1 && \forall i \lt j\\ & \sum_i \color{darkred}x_i = \color{darkblue}m \\ & \color{darkred}x_i \in \{0,1\} \\ &\color{darkred}y_{i,j} \in [0,1] \end{align}\]


The inequality implements the implication \[\color{darkred}x_i= 1 \textbf{ and } \color{darkred}x_j = 1 \Rightarrow \color{darkred}y_{i,j} = 1 \] The variables \(\color{darkred}y_{i,j}\) can be binary or can be relaxed to be continuous between 0 and 1. Finally, we can also consider a slightly different problem. Instead of minimizing the sum of the distances of the selected points, we can also minimize the maximum distance within this group of selected points. The model can look like:

Minmax Model
\[\begin{align} \min\> & \color{darkred}z \\ & \color{darkred}z \ge \color{darkred}y_{i,j} \cdot \color{darkblue}{\mathit{dist}}_{i,j} && \forall i \lt j \\ & \color{darkred}y_{i,j} \ge \color{darkred}x_i + \color{darkred}x_j - 1 && \forall i \lt j\\ & \sum_i \color{darkred}x_i = \color{darkblue}m \\ & \color{darkred}x_i \in \{0,1\} \\ &\color{darkred}y_{i,j} \in [0,1] \end{align}\]

We again use our linearization here.

Finally, I also tried a very simple greedy heuristic:
  1. Select the two points that are closest to each other.
  2. Select a new unselected point that is closest to our already selected points.
  3. Repeat step 2 until we have selected 8 points.
We can expect our optimization models to do a bit better than this simplistic approach. 

In our methods here, we did not make any assumptions about the distance metric being used We just needed the distance matrix. This means you can use Euclidean, Manhattan, or other metrics. In addition, you can use some normalization or weighting before calculating the distances. This may be useful if the features have different units. These models also do not change whether one uses low- or high-dimensional data.

There are alternative models one could envision, such as finding the smallest enclosing sphere or box. These methods do not make things simpler and will likely not improve upon our models. 

Small data set


Let's start with selecting \(m=8\) points from  \(n=50\), using random 2d coordinates.


----10PARAMETERcoordcoordinates

xy

point10.1720.843
point20.5500.301
point30.2920.224
point40.3500.856
point50.0670.500
point60.9980.579
point70.9910.762
point80.1310.640
point90.1600.250
point100.6690.435
point110.3600.351
point120.1310.150
point130.5890.831
point140.2310.666
point150.7760.304
point160.1100.502
point170.1600.872
point180.2650.286
point190.5940.723
point200.6280.464
point210.4130.118
point220.3140.047
point230.3390.182
point240.6460.561
point250.7700.298
point260.6610.756
point270.6270.284
point280.0860.103
point290.6410.545
point300.0320.792
point310.0730.176
point320.5260.750
point330.1780.034
point340.5850.621
point350.3890.359
point360.2430.246
point370.1310.933
point380.3800.783
point390.3000.125
point400.7490.069
point410.2020.005
point420.2700.500
point430.1510.174
point440.3310.317
point450.3220.964
point460.9940.370
point470.3730.772
point480.3970.913
point490.1200.735
point500.0550.576


The MIQP and MIP model gives the same solution, but (as expected) the MIMAX model is slightly different. Here are the results with Cplex:

 
HEURISTICMIQPMIPMINMAX

point21.000
point31.0001.0001.000
point91.000
point101.000
point111.0001.000
point121.000
point151.000
point181.0001.0001.000
point201.000
point231.0001.0001.000
point241.000
point251.000
point271.000
point291.000
point351.0001.000
point361.0001.0001.000
point391.0001.0001.000
point431.000
point441.0001.000
statusOptimalOptimalOptimal
obj3.4473.4470.210
sum4.9973.4473.4473.522
max0.2910.2500.2500.210
time33.9372.1250.562


When we plot the results, we see:




Obviously, our optimization models do much better than the heuristic. Two optimization models have quite an overlap in the selected points.

Large data set 


Here we select \(m=8\) points from  \(n=100\), using random 12d coordinates. Using Gurobi on a faster machine we see:


----112PARAMETERresults

HEURISTICMIQPMIPMINMAX

point51.0001.000
point81.0001.000
point171.0001.000
point191.0001.000
point241.000
point351.0001.000
point381.0001.000
point391.000
point421.0001.000
point431.000
point451.0001.0001.000
point511.000
point561.000
point761.000
point811.0001.000
point891.0001.000
point941.0001.0001.0001.000
point971.000
statusTimeLimitOptimalOptimal
obj26.23026.2301.147
sum30.73126.23026.23027.621
max1.5861.3571.3571.147
time1015.98449.0009.594
gap%23.379


The MIQP model hit a time limit of 1000 seconds but actually found the optimal solution. It just did not have the time to prove it. The overlap between the MIQP/MIP models and the MINMAX model is very small (just one shared point). I expected more overlap.

Conclusions


The MINMAX model seems to solve faster than the other versions. This is a little bit surprising: often these max or min objectives are slower than direct sums. It always pays off to do some experiments with different formulations. This is a great example of that.

These models are independent of the dimensionality of the data and the distance metric being used.

References


  1. Given a set of points or vectors, find the set of N points that are closest to each other, https://stackoverflow.com/questions/64542442/given-a-set-of-points-or-vectors-find-the-set-of-n-points-that-are-closest-to-e


Appendix: GAMS code


Some interesting features:
  • There are not that many models where I can use xor operators. Here it is used in the Greedy Heuristic where we want to consider points \(i\) and \(j\) where one of them is in the cluster and one outside. 
  • Macros are used to prevent repetitive code in the reporting of results.
  • Acronyms are used in reporting. 


$ontext

 
find cluster of m=8 points closest to each other

 
compare:
      
simple greedy heuristic (HEURISTIC)
      
quadratic model (MIQP)
      
linearized model (MIP)
      
minmax model (MINMAX)

$offtext

set
  c
'coordinates'/x,y/
  i
'points'/point1*point50/
;

scalar m 'number of points to select'/8/;

parameter coord(i,c) 'random coordinates';
coord(i,c) = uniform(0,1);
display coord;

alias(i,j);
set ij(i,j) 'upper triangular structure';
ij(i,j) =
ord(i) < ord(j);

parameter dist(i,j) 'euclidean distances';
dist(ij(i,j)) = sqrt(
sum(c, sqr(coord(i,c)-coord(j,c))));

binaryvariables x(i) 'selected points';
positivevariable y(i,j) 'x(i)*x(j), relaxed to be continuous';
y.up(ij)=1;

variable z 'objective';

equations
    obj1 
'quadratic objective'
    obj2 
'linearized objective'
    obj3 
'minmax objective'
    select
'number of selected points'
    bound 
'implication x(i)=x(j)=1 ==> y(i,j)=1'
;

obj1.. z =e=
sum(ij(i,j), x(i)*x(j)*dist(ij));
obj2.. z =e=
sum(ij, y(ij)*dist(ij));
obj3(ij).. z =g= y(ij)*dist(ij);
select..
sum(i, x(i)) =e= m;
bound(ij(i,j))..  y(i,j) =g= x(i) + x(j) - 1;

model m1 /obj1,select/;
model m2 /obj2,select,bound/;
model m3 /obj3,select,bound/;
option optcr=0,mip=cplex,miqcp=cplex,threads=8;

*------------------------------------------------
* reporting macros
*------------------------------------------------

parameter results(*,*);

set dummy 'ordering for output'/  status,obj,sum,max,time,'gap%'/;

acronym TimeLimit;
acronym Optimal;

* macros for reporting
$macro sumdist    sum(ij(i,j), x.l(i)*x.l(j)*dist(ij))
$macro maxdist    smax(ij(i,j), x.l(i)*x.l(j)*dist(ij))
$macro report(m,label)  \
    x.l(i) = round(x.l(i));  \
    results(i,label) = x.l(i); \
    results(
'status',label)$(m.solvestat=1) = Optimal; \
    results(
'status',label)$(m.solvestat=3) = TimeLimit; \
    results(
'obj',label) = z.l; \
    results(
'sum',label) = sumdist; \
    results(
'max',label) = maxdist; \
    results(
'time',label) = m.resusd; \
    results(
'gap%',label)$(m.solvestat=3) = 100*abs(m.objest - m.objval)/abs(m.objest);


*------------------------------------------------
* heuristic
*------------------------------------------------


* step 1 : select points 1 and 2 by minimizing distance
scalar dmin,dmin1,dmin2;
dmin =
smin(ij,dist(ij));
loop(ij(i,j)$(dist(ij)=dmin),
   x.l(i)=1;
   x.l(j)=1;
  
break;
);

* add points 3..m, closest to points selected earlier
scalar k;
for(k = 3 to m,
   dmin =
smin(ij(i,j)$(x.l(i) xor x.l(j)), dist(ij));
  
loop(ij(i,j)$(dist(ij)=dmin),
     x.l(i)=1;
     x.l(j)=1;
    
break;
   );
);

results(i,
'HEURISTIC') = x.l(i);
results(
'sum','HEURISTIC') = sumdist;
results(
'max','HEURISTIC') = maxdist;


*------------------------------------------------
* run models m1 through m3
*------------------------------------------------

solve m1 minimizing z using miqcp;
report(m1,
'MIQP')

solve m2 minimizing z using mip;
report(m2,
'MIP')

solve m3 minimizing z using mip;
report(m3,
'MINMAX')

display results;



Viewing all articles
Browse latest Browse all 804

Trending Articles