Problem description
Consider the vector \[v = (1, 3, 6, 4, 7, 9, 6, 2, 3) \] We want to find the index \(k\) such that \[z = \left| \sum_{i \lt k} v_i - \sum_{i \ge k} v_i \right|\] is minimized. I.e. the sum left of the break point is as close as possible to the sum right of it.
This is easily programmed in code. However, the question is, can we do this in the form of an optimization problem? [1]
Model formulation
The trick is to have a binary variable \(\delta_i \in \{0,1\}\) associated with each observation \(v_i\), \(i=1,\dots, n\). The interpretation is: \[ \delta_i = \begin{cases} 0 & \text{if $v_i$ belongs to the first subset}\\ 1 & \text{if $v_i$ belongs to the second subset}\end{cases}\] To make sure all \(\delta_i = 0\) are to the left and all all \(\delta_i = 1\) are to the right of the break point, we require \[ \delta_{i+1} \ge \delta_{i}\] This implies we start with \(\delta_i = 0\) and when we see \(\delta_i = 1\) it will never go back to zero afterwards. The high-level model can look like:
| Integer-programming model |
|---|
| \[\begin{align}\min\>& \color{darkred} z= |\color{darkred}y_{\mathit{left}} - \color{darkred}y_{\mathit{right}}| \\ &\color{darkred}y_{\mathit{left}} = \sum_i (1-\color{darkred}\delta_i) \color{darkblue}v_i \\ &\color{darkred}y_{\mathit{right}} = \sum_i \color{darkred}\delta_i \color{darkblue}v_i \\ & \color{darkred}k = 1 + \sum_i (1-\color{darkred}\delta_i) \\ & \color{darkred}\delta_{i+1} \ge \color{darkred}\delta_{i} \\ & \color{darkred}\delta_{i} \in \{0,1\} \end{align}\] |
Notes:
- The expression \( \sum_i (1-\delta_i) v_i\), sums up \(v_i\) where \(\delta_i=0\), while \( \sum_i \delta_i v_i\), adds up \(v_i\) where \(\delta_i=1\).
- Make sure the ordering constraint does not go over the boundary: the constraint \(\delta_{i+1} \ge \delta_{i}\) should only hold for \(i=1,\dots,n-1\).
- Obviously, we can also write: \(\delta_{i-1} \le \delta_{i}\).
- The absolute value can be linearized by formulating it as \[-z \le y_{\mathit{left}} - y_{\mathit{right}} \le z\]
- It is noted, that there are no big-M constraints in the model. As we shall see later, this model solved very quickly.
The results look like:
---- 50 PARAMETER results
i1 i2 i3 i4 i5 i6 i7 i8 i9
v 136479623
delta 1111
---- 50 VARIABLE y.L sum of partitions
left 21.000, right 20.000
---- 50 VARIABLE k.L = 6 breakpoint
VARIABLE z.L = 1 objective
The zeros are not printed.
Of course, the model works for larger sets, and in cases where \(v_i\) are floating-point numbers. They may also contain negative numbers. An example is:
---- 51 PARAMETER v data
i1 -65.651, i2 68.653, i3 10.075, i4 -39.772, i5 -41.558, i6 -55.189, i7 -30.034
i8 71.254, i9 -86.577, i10 0.042, i11 99.624, i12 15.747, i13 98.227, i14 52.450
i15 -73.862, i16 27.944, i17 -68.096, i18 -49.984, i19 33.786, i20 -12.929, i21 -28.060
i22 -29.712, i23 -73.702, i24 -69.980, i25 17.823, i26 66.179, i27 -53.837, i28 33.147
i29 55.172, i30 -39.268, i31 -77.902, i32 0.477, i33 -67.965, i34 74.492, i35 -46.977
i36 -42.837, i37 18.791, i38 44.544, i39 25.650, i40 -7.240, i41 -17.339, i42 -76.461
i43 -37.158, i44 -90.690, i45 -32.290, i46 -63.580, i47 29.145, i48 12.149, i49 53.992
i50 -40.439, i51 32.221, i52 51.164, i53 25.489, i54 -43.227, i55 -82.715, i56 -79.497
i57 28.250, i58 9.062, i59 -93.695, i60 58.472, i61 -85.447, i62 -64.868, i63 5.127
i64 50.042, i65 -64.375, i66 -93.172, i67 17.026, i68 24.246, i69 -22.128, i70 -28.257
i71 -51.393, i72 -50.716, i73 -73.899, i74 86.690, i75 -24.012, i76 56.680, i77 -39.993
i78 -74.903, i79 49.775, i80 -86.154, i81 -59.597, i82 -98.987, i83 -46.077, i84 -0.030
i85 -69.743, i86 -65.166, i87 -33.872, i88 -36.619, i89 -35.583, i90 92.795, i91 98.720
i92 -26.019, i93 -25.422, i94 54.396, i95 -20.663, i96 82.619, i97 -76.084, i98 47.096
i99 -88.916, i100 15.260
---- 51 VARIABLE delta.L ordered binary variables
i62 1, i63 1, i64 1, i65 1, i66 1, i67 1, i68 1, i69 1, i70 1, i71 1, i72 1
i73 1, i74 1, i75 1, i76 1, i77 1, i78 1, i79 1, i80 1, i81 1, i82 1, i83 1
i84 1, i85 1, i86 1, i87 1, i88 1, i89 1, i90 1, i91 1, i92 1, i93 1, i94 1
i95 1, i96 1, i97 1, i98 1, i99 1, i100 1
---- 51 VARIABLE y.L sum of partitions
left -689.647, right -676.178
---- 51 VARIABLE k.L = 62 breakpoint
VARIABLE z.L = 13.469 objective
This model with \(n=100\) solves instantaneously. For \(n=1,000\) this becomes a second or so.
The ordering constraint is a useful construct to have in your toolbox. This is used in other contexts as well. It can be used to order solutions for better display and interpretation. It can also be used to reduce symmetry in certain models.
References
- Optimize with indexing in linear programming, https://stackoverflow.com/questions/62432731/optimize-with-indexing-in-linear-programming