Clustering vs Assignment

Problem description

This problem is based on the post [1].

We have $k$ salespersons and $n$ destinations (both with their locations). The sales reps deal with supplying hotels in each destination. Each destination point has a number of hotels. The problem is to cluster the destination nodes together such that:

The number of destinations is not unreasonable unequal. This is done by applying lower- and upperbounds on the number of points in a cluster
Similarly we want to put limits on the total number of hotels in a cluster.

Finally we want to assign sales representatives to clusters or sales regions.

Random data

Let's generate some random data. Here we have 50 locations and 5 salespeople.

Random data

Unconstrained Clustering

As we have 5 sales reps, I created here 5 clusters using the kmeans function in R.

Results of unconstrained kmeans clustering

For this random example things look ok. But in some cases you may end up with very small or very large clusters (size expressed as number of points in the cluster). Again, with kmeans we just dropped the constraints on the clusters.

A different issue, is that $k$-means clustering is a (stochastic) heuristic. Running it again, may lead to a different solution;

Solution after running kmeans again

This alternative solution illustrates some imbalances: the blue cluster on the right has just 4 points, while the clusters on the left have 15 points. We can stabilize the kmeans clusters by using a multi-start approach.

If we accept the unconstrained clustering solution, we can assign the sales people to their nearest center point, e.g. by using an assignment model.

Clustering as mathematical programming problem

In theory, we can formulate the clustering problem as a mathematical programming problem. If this would work, we made progress: we later can add (linear) constraints at will.

An MINLP formulation can be stated as follows:

We introduce assignment variables: \[x_{i,k} = \begin{cases} 1 & \text{if point $i$ is assigned to cluster $k$}\\ 0 & \text{otherwise}\end{cases} \] In addition we have some continuous variables indicating the center of each cluster: \[\bar{x}_{k,j} \in [L,U] \>\> \text{the coordinates $j$ of the center of cluster $k$}\] In our example we have 2-dimensional coordinates, so $j \in \{x,y\}$. With this we can write:

MINLP Formulation of Clustering Problem
\[\begin{align} \min\> & \sum_{i,j,k} \color{darkred} x_{i,k} \left( \bar{\color{darkred} x}_{k,j} - \color{darkblue} p_{i,j} \right)^2 \\ & \sum_k \color{darkred} x_{i,k} = 1 && \forall i\\ & \color{darkred}x_{i,k} \in \{0,1\} \\ & \bar{\color{darkred}x}_{k,j} \in [L,U] \end{align}\]

Basically, we minimize the within-cluster sum of squares (WCSS) [3]. This objective is not quadratic, so quadratic solvers will not able to solve this and we need to rely on general purpose MINLP solvers. However, we can reformulate the above model into a convex MIQCP.

MIQCP Formulation of Clustering Problem
\[\begin{align} \min\> & \sum_{i,k} \color{darkred} d_{i,k} \\ & \color{darkred} d_{i,k} \ge \sum_{j} \left( \bar{\color{darkred} x}_{k,j} - \color{darkblue} p_{i,j} \right)^2 - \color{darkblue}M (1-\color{darkred}x_{i,k}) && \forall i,k\\ & \sum_k \color{darkred} x_{i,k} = 1 && \forall i\\ & \color{darkred}x_{i,k} \in \{0,1\} \\ & \bar{\color{darkred}x}_{k,j} \in [L,U] \\ & \color{darkred}d_{i,k} \ge 0 \end{align}\]

MIQCP Formulation of Clustering Problem

\[\begin{align} \min\> & \sum_{i,k} \color{darkred} d_{i,k} \\ & \color{darkred} d_{i,k} \ge \sum_{j} \left( \bar{\color{darkred} x}_{k,j} - \color{darkblue} p_{i,j} \right)^2 - \color{darkblue}M (1-\color{darkred}x_{i,k}) && \forall i,k\\ & \sum_k \color{darkred} x_{i,k} = 1 && \forall i\\ & \color{darkred}x_{i,k} \in \{0,1\} \\ & \bar{\color{darkred}x}_{k,j} \in [L,U] \\ & \color{darkred}d_{i,k} \ge 0 \end{align}\]

If your solver supports indicator constraints, we can write the big-M constraint simply as: \[x_{i,k}=1 \Rightarrow d_{i,k} \ge \sum_j \left( \bar{x}_{k,j} - p_{i,j} \right)^2\]

It is noted that in these models I do not calculate the center as the mean values \[ \bar{x}_{k,j} = \frac{\sum_i x_{i,k} p_{i,j}}{\sum_i x_{i,k}}\] but just leave it to the model to find the best location of the center $\bar{x}_{k,j}$.

When I try this, well, we don't have anything good to report.

Model	Solver	Time	Objective	Gap	Notes
$k$-means	R kmeans	1	11906.61	na	nstart=100
MINLP	Baron	3600	11980.2320	na	lowerbound=0
MINLP + ordering	Baron	3600	11906.6104	na	lowerbound=0
MIQCP	Cplex	3600	20538.8830	93%	4 threads
MIQCP + ordering	Cplex	3600	20034.9799	82%	4 threads

Notes

kmeans finds the best solution in just 1 second. We use 100 repetitions as the method is sensitive to the initial configuration.
For Baron and Cplex, I used a time limit of one hour.
Baron has a lowerbound (best possible solution) of zero, so no gap is available.
We can add an ordering constraint to remove symmetry. The constraint says the centers are ordered by their $x$ coordinate: \[ \bar{x}_{k,x} \le \bar{x}_{k+1,x}\] i.e. the first cluster is the left most one. This constraint seems to help.
At least Baron is able to find the same solution as the $k$-means algorithm quite quickly. Basically all the time is spend on trying to tighten the bounds (without much luck).
This performance is rather depressing.

Alternatives

Not giving up yet. Instead of minimizing the within-cluster sum of squares we can also minimize the sum of squares of all pairwise point combinations that are in a cluster. I.e.

MIQP Formulation of Clustering Problem
\[\begin{align} \min\> & \sum_{i \lt i',k} \color{darkblue} {\mathit{dist}}^2_{i,'i} \color{darkred} x_{i,k} \color{darkred} x_{i',k} \\ & \sum_k \color{darkred} x_{i,k} = 1 && \forall i\\ & \color{darkred}x_{i,k} \in \{0,1\} \end{align}\]

This also allows us to form a linear model:

MIP Formulation of Clustering Problem
\[\begin{align} \min\> & \sum_{i \lt i',k} \color{darkblue} {\mathit{dist}}^2_{i,'i} \color{darkred} y_{i,i',k} \\ & \color{darkred} y_{i,i',k} \ge \color{darkred} x_{i,k} + \color{darkred} x_{i',k} -1 && \forall i\lt i',k\\ & \sum_k \color{darkred} x_{i,k} = 1 && \forall i\\ & \color{darkred}x_{i,k} \in \{0,1\} \\ & \color{darkred} y_{i,i',k} \in \{0,1\} \end{align}\]

References

https://stackoverflow.com/questions/56493996/implement-solver-linear-programming-in-r
Madhushini Narayana Prasad and Grani A. Hanasusanto, Improved Conic Reformulations for K-means Clustering, SIAM J. Optim., 28(4), 3105–3126, 2018
$K$-means clustering, https://en.wikipedia.org/wiki/K-means_clustering