k out of n with smallest bandwidth

The problem from [1], slightly generalized, is easy to describe:

Given an $n$ vector $\color{darkblue}v_i$ with data, select $k\lt n$ elements that are closest to each other.

Of course, we can state the problem as a formal optimization model:

Optimization Model
\[\begin{align}\min\>&\color{darkred}U-\color{darkred}L \\ & \color{darkred}\delta_i=1 \Rightarrow \color{darkblue}v_i \in [\color{darkred}L,\color{darkred}U] \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \end{align}\]

We can model this as a standard Mixed-Integer Programming (MIP) problem as follows:

MIP Model
\[\begin{align}\min\>& \color{darkred}U-\color{darkred}L\\ & \color{darkblue}v_i \ge \color{darkred}L - (1-\color{darkred}\delta_i)\cdot\color{darkblue}M \\ & \color{darkblue}v_i \le \color{darkred}U + (1-\color{darkred}\delta_i)\cdot \color{darkblue}M \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \\ & \color{darkred}L,\color{darkred}U \in [ \min_i \color{darkblue}v_i, \max_i \color{darkblue}v_i] \end{align}\]

MIP Model

\[\begin{align}\min\>& \color{darkred}U-\color{darkred}L\\ & \color{darkblue}v_i \ge \color{darkred}L - (1-\color{darkred}\delta_i)\cdot\color{darkblue}M \\ & \color{darkblue}v_i \le \color{darkred}U + (1-\color{darkred}\delta_i)\cdot \color{darkblue}M \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \\ & \color{darkred}L,\color{darkred}U \in [ \min_i \color{darkblue}v_i, \max_i \color{darkblue}v_i] \end{align}\]

Here $\color{darkblue}M$ is a large enough, but as small as possible, constant. A reasonable value is: \[\color{darkblue}M := \max_i \color{darkblue}v_i - \min_i \color{darkblue}v_i\]

Data

To experiment, I chose $n=1000, k=100$ and generated random data \[\color{darkblue}v_i \sim U(-100,100)\] We would expect the optimal range to be a bit better than 20. Indeed the optimal solution has an objective of 15.923.

Performance

"k out of n" models can be difficult. Indeed the performance of this default model is not great.

We can improve the performance significantly by changing the model just slightly:

Improved MIP Model
\[\begin{align}\min\>& \color{darkred}R \\ & \color{darkred}R = \color{darkred}U-\color{darkred}L\\ & \color{darkblue}v_i \ge \color{darkred}L - (1-\color{darkred}\delta_i)\cdot\color{darkblue}M \\ & \color{darkblue}v_i \le \color{darkred}U + (1-\color{darkred}\delta_i)\cdot \color{darkblue}M \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \\ & \color{darkred}L,\color{darkred}U \in [ \min_i \color{darkblue}v_i, \max_i \color{darkblue}v_i] \\ & \color{darkred}R \ge 0\end{align}\]

Improved MIP Model

\[\begin{align}\min\>& \color{darkred}R \\ & \color{darkred}R = \color{darkred}U-\color{darkred}L\\ & \color{darkblue}v_i \ge \color{darkred}L - (1-\color{darkred}\delta_i)\cdot\color{darkblue}M \\ & \color{darkblue}v_i \le \color{darkred}U + (1-\color{darkred}\delta_i)\cdot \color{darkblue}M \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \\ & \color{darkred}L,\color{darkred}U \in [ \min_i \color{darkblue}v_i, \max_i \color{darkblue}v_i] \\ & \color{darkred}R \ge 0\end{align}\]

----    114 PARAMETER results

                   MIP    IMPROVED

points        1000.0001000.000
vars          1003.0001003.000
  discr       1000.0001000.000
equs          2002.0002002.000
status         Optimal     Optimal
obj             15.92315.923
time            75.2032.578
nodes        53387.0006445.000
iterations  355004.00088794.000

The effect of our change is rather dramatic. I have seen this before, so this is not just an accidental result. I hoped the presolver would be smart enough to deduce this, but alas, we need to worry about this ourselves.

Algorithm

A totally different approach is as follows:

Sort the vector $\color{darkblue}v$.
Calculate the ranges $r_j := \color{darkblue}v_{j+k-1} - \color{darkblue}v_j $ for $j=1,\dots,n-k+1$.
Pick the best $r_j$.

Sorting algorithm

When I implement this, I see:

----    170 PARAMETER results

                   MIP    IMPROVED     SORTING

points        1000.0001000.0001000.000
vars          1003.0001003.000
  discr       1000.0001000.000
equs          2002.0002002.000
status         Optimal     Optimal
obj             15.92315.92315.923
time            75.2032.5780.085
nodes        53387.0006445.000
iterations  355004.00088794.000

Conclusion

The MIP models for this problem are interesting: minor changes in the model can improve the solution time by a large amount. However, an algorithm based on sorting is much faster than solving a discrete optimization problem.

References

Find n-1 closest values based on criteria in a dataframe in R, https://stackoverflow.com/questions/73337722/find-n-1-closest-values-based-on-criteria-in-a-dataframe-in-r

Appendix: GAMS code

$ontext

   Given a data vector v(i) of legth n, find k elements (k<n)
   that are closest to each other.

$offtext

*-------------------------------------------------------
* problem size
*-------------------------------------------------------

set i 'length of data vector'/i1*i1000/;

scalar k 'number of selected items'/100/;

*-------------------------------------------------------
* random data
*-------------------------------------------------------

parameter v(i) 'random data';
v(i) = uniform(-100,100);
display v,k;

*-------------------------------------------------------
* derived data
*-------------------------------------------------------

scalars
   vmin
   vmax
   M 'big-M'
;
vmin = smin(i,v(i));
vmax = smax(i,v(i));
M = vmax-vmin;

*-------------------------------------------------------
* reporting macros
*-------------------------------------------------------

acronym TimeLimit;
acronym Optimal;
acronym Error;

parameter results(*,*);
$macro report(m,label) \
    results('points',label) = card(i); \
    results('vars',label) = m.numVar; \
    results(' discr',label) = m.numDVar; \
    results('equs',label) = m.numEqu; \
    results('status',label) = Error; \
    results('status',label)$(m.solvestat=1) = Optimal; \
    results('status',label)$(m.solvestat=3) = TimeLimit; \
    results('obj',label) = m.objval; \
    results('time',label) = m.resusd; \
    results('nodes',label) = m.nodusd; \
    results('iterations',label) = m.iterusd; \
    results('gap%',label)$(m.solvestat=3) = 100*abs(m.objest - m.objval)/abs(m.objest); \
    display results;

*-------------------------------------------------------
* MIP Model
*
* Note: R will be substituted out. So the solver
* will see U-L as objective function.
*-------------------------------------------------------

variables
   L     'lower bound on selected values'
   U     'upper bound on selected values'
   R     'U-L'
;

U.lo = vmin; U.up = vmax;
L.lo = vmin; L.up = vmax;

binaryvariables delta(i) 'selection of data points';

equations
    obj          'objective'
    vminbound(i) 'delta(i)=1 ==> v(i)>=L'
    vmaxbound(i) 'delta(i)=1 ==> v(i)<=U'
    select       'k to be selected'
;

obj.. R =e= U-L;

vmaxbound(i).. v(i) =l= U + (1-delta(i))*M;
vminbound(i).. v(i) =g= L - (1-delta(i))*M;

select.. sum(i, delta(i)) =e= k;

model m1 /all/;
options threads=0;
solve m1 minimizing R using mip;

display delta.l,U.l,L.l,R.l;
report(m1,"MIP")

*-------------------------------------------------------
* Add bound R>=0
*
* R will no longer be substituted out.
*-------------------------------------------------------

* the next one is important for performance
R.lo = 0;

solve m1 minimizing R using mip;
report(m1,"IMPROVED")

*-------------------------------------------------------
* Algorithm
*-------------------------------------------------------

* in case we have zeros
v(i) = EPS+v(i);

set psol(i) 'Python solution';

scalar time;
time = timeElapsed;

embeddedCode Python:

#
# get GAMS data
#
k = int(list(gams.get("k"))[0])
v = list(gams.get("v"))

#
# sort v
#
v.sort(key = lambda x: x[1])

#
# find best range
#
n = len(v)
bestr = 1.0e10
bestj = 1e10
for j in range(0,n-k+1):
    r = v[j+k-1][1]-v[j][1]
    if r < bestr:
       bestr = r
       bestj = j
print(f"Best range:{bestr} at j={bestj}")

#
# report solution back as subset
#
sol = []
for j in range(bestj,bestj+k):
    sol.append(v[j][0])
gams.set("psol",sol)

endEmbeddedCode psol

time = timeElapsed-time;

display psol;
results('obj','SORTING') = smax(psol,v[psol])-smin(psol,v[psol]);
results('time','SORTING') = time;
results('points','SORTING') = card(i);
display results;

k out of n with smallest bandwidth

Data

Performance

Algorithm

Conclusion

References

Appendix: GAMS code

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...