In [1], I discussed some models for a small 2 line regression problem. Here I want to dive into a more challenging problem.
Data generation process
I started with the following five lines: \[\begin{align}&f_1(x)=200-3x\\&f_2(x)=150-x\\&f_3(x)=10\\&f_4(x)=-50+2x\\&f_5(x)=-120\end{align}\] To generate points, I used \[\begin{align}&x_i \sim U(0,100)\\ & k_i \sim U\{1,5\} \\ & \epsilon_i \sim N(0,15)\\ & y_i = a_{k_i} + b_{k_i} \cdot x_i + \epsilon_i \end{align}\]
This corresponds to the following picture:
Now, we drop the lines and just keep the points. Together with \(K=5\) (the number of lines), this is all the data we give to the model:
---- 32 PARAMETER xdata
i1 5.141, i2 0.601, i3 40.123, i4 51.988, i5 62.888, i6 22.575, i7 39.612
i8 27.601, i9 15.237, i10 93.632, i11 42.266, i12 13.466, i13 38.606, i14 37.463
i15 26.848, i16 94.837, i17 18.894, i18 29.751, i19 7.455, i20 40.135, i21 10.169
i22 38.389, i23 32.409, i24 19.213, i25 11.237, i26 59.656, i27 51.145, i28 4.507
i29 78.310, i30 94.575, i31 59.646, i32 60.734, i33 36.251, i34 59.407, i35 67.985
i36 50.659, i37 15.925, i38 65.689, i39 52.388, i40 12.440, i41 98.672, i42 22.812
i43 67.565, i44 77.678, i45 93.245, i46 20.124, i47 29.714, i48 19.723, i49 24.635
i50 64.648, i51 73.497, i52 8.544, i53 15.035, i54 43.419, i55 18.694, i56 69.269
i57 76.297, i58 15.481, i59 38.938, i60 69.543, i61 84.581, i62 61.272, i63 97.597
i64 2.689, i65 18.745, i66 8.712, i67 54.040, i68 12.686, i69 73.400, i70 11.323
i71 48.835, i72 79.560, i73 49.205, i74 53.356, i75 1.062, i76 54.387, i77 45.113
i78 97.533, i79 18.385, i80 16.353, i81 2.463, i82 17.782, i83 6.132, i84 1.664
i85 83.565, i86 60.166, i87 2.702, i88 19.609, i89 95.071, i90 33.554, i91 59.426
i92 25.919, i93 64.063, i94 15.525, i95 46.002, i96 39.334, i97 80.546, i98 54.099
i99 39.072, i100 55.782
---- 32 PARAMETER ydata
i1 192.922, i2 -122.916, i3 21.992, i4 70.273, i5 91.422, i6 109.528
i7 129.230, i8 -140.463, i9 180.013, i10 24.614, i11 -108.546, i12 10.211
i13 -148.679, i14 24.443, i15 122.773, i16 164.505, i17 164.359, i18 132.833
i19 -19.868, i20 15.210, i21 155.782, i22 127.528, i23 122.108, i24 149.691
i25 -0.143, i26 -120.337, i27 94.868, i28 -39.217, i29 96.683, i30 46.506
i31 35.514, i32 22.589, i33 97.811, i34 -113.523, i35 88.707, i36 94.006
i37 9.232, i38 73.173, i39 26.600, i40 0.429, i41 3.943, i42 119.583
i43 101.666, i44 -38.179, i45 39.630, i46 137.980, i47 4.808, i48 18.422
i49 -10.888, i50 83.361, i51 65.726, i52 -39.174, i53 0.179, i54 126.818
i55 139.561, i56 -18.520, i57 105.170, i58 8.206, i59 95.491, i60 87.668
i61 -64.973, i62 12.739, i63 -13.237, i64 -70.810, i65 136.727, i66 178.416
i67 7.676, i68 -51.541, i69 78.491, i70 143.387, i71 115.522, i72 66.261
i73 52.631, i74 -100.755, i75 151.703, i76 47.462, i77 127.759, i78 -65.191
i79 -6.051, i80 137.610, i81 130.662, i82 128.182, i83 148.801, i84 -23.712
i85 -64.504, i86 27.111, i87 150.200, i88 171.403, i89 72.107, i90 -105.643
i91 -112.767, i92 112.218, i93 102.355, i94 5.007, i95 86.645, i96 -127.749
i97 -30.270, i98 79.430, i99 78.474, i100 -7.158
The questions we want to address are:
- Can we use an optimization model to find the best fit for five different lines?
- Do the regression lines we find this way, correspond to the original functions we used in our data generation process?
The model has to make two decisions:
- Assign points to lines.
- Fitting: find the coefficients (constant term and slope) of each line.
I expected this to be not too difficult. This turned out to be wrong.
We have \(5\times 100=500\) binary variables to model the assignment. This does seem small for a good MIP solver, but remember \(2^{500}\) is a big number [3]:
Note: a better measure of size would be \(5^{100}\) which is a bit less: 7888609052210118054117285652827862296732064351090230047702789306640625 (70 decimal digits).
Solution process
To make things a bit easier, compared to a least squares objective, I used a linear MIP model based on LAD regression [2]:
| Linear LAD Model |
|---|
| \[\begin{align} \min\>&\sum_i \color{darkred}z_i\\ & -\color{darkred}z_i \le \color{darkred}r_i \le \color{darkred}z_i &&\forall i && \text{absolute value}\\ & \color{darkred}r_i = \color{darblue}y_i - \color{darkred}\alpha_{0,k} - \color{darkred}\alpha_{1,k}\cdot\color{darkblue}x_i +\color{darkred}s_{i,k} &&\forall i,k && \text{fit (possibly relaxed)}\\ & -\color{darkblue}M \cdot(1-\color{darkred}\delta_{i,k}) \le \color{darkred}s_{i,k} \le \color{darkblue}M \cdot(1-\color{darkred}\delta_{i,k}) && \forall i,k && \text{big-M constraint}\\ & \sum_k\color{darkred}\delta_{i,k} = 1 && \forall i && \text{pick one regression line}\\ & \color{darkred}\alpha_{0,k} \le \color{darkred}\alpha_{0,k+1} && \forall k \>\text{except last}&&\text{order by constant term} \\ & \color{darkred}\delta_{i,k} \in \{0,1\} \end{align} \] |
The big-M constraints implement the implication \[\color{darkred}\delta_{i,k}=1\Rightarrow \color{darkred}s_{i,k}=0 \]i.e. if \(\color{darkred}\delta_{i,k}=1\) then point \(i\) is used in fitting line \(k\). This model could not be solved to proven optimality. The best bound started out very bad (at zero) and moved very slowly. In the end, I used a heuristic to find a good solution and then used the MIP model to improve upon this using MIPSTART.
The heuristic looked like:
- If time or iteration limit reached: stop
- Generate a random assignment. Evaluate the objective using an L1norm regression procedure (for each of the five lines).
- For each point \(i\), see if it improves the objective if we assign it to a different line.
- If there is no longer an improvement possible, go to step 1.
This only looks at one point at a time, so it will not deliver optimal solutions. Hence we restart with a fresh random solution if we are unable to further improve the solution. In step 3 I use initially a cheap test: keep the lines constant and only check residuals without rerunning regressions. A more expensive test is to compare the residuals after rerunning regressions. This more expensive test is used once the simple test does not produce reassignments anymore.
The results were:
- Heuristic: obj=957.1482
- MIP improvement: obj=946.787
The picture below is from this solution.
The final LAD estimates are:
---- 473 PARAMETER estestimates
a0 a1
line1 -123.0120.160
line2 -66.2292.246
line3 -0.0780.041
line4 155.979 -1.123
line5 211.432 -3.213
We can compare this against the coefficients we used to generate the data.
Sidenote: checking the heuristic
Remember, our heuristic would try to move single points to another line. When it finishes it delivers a solution where such a move is no longer profitable. We can check if the heuristic is correct, by solving a slightly different version of the MIP model: restrict the number of changes in the assignment variable \(\color{darkred}\delta_{i,k}\) to one. If this model stops without an improvement, we have some confidence in the solution our heuristic produces.The modeling involved is as follows: \[\begin{align}&\color{darkred}d_{i,k} \ge \color{darkred}\delta_{i,k}-\color{darkblue}\delta^0_{i,k} \\ & \color{darkred}d_{i,k} \ge \color{darkblue}\delta^0_{i,k}-\color{darkred}\delta_{i,k} \\ & \sum_{i,k}\color{darkred}d_{i,k}\le 1 \\ & \color{darkred}d_{i,k}\in [0,1]\end{align}\] Here \(\color{darkblue}\delta^0_{i,k}\) is the solution found by the heuristic. The variable \(\color{darkred}d_{i,k}=1\) if we change an assignment from the heuristic solution.
GA heuristic
I also tried the GA algorithm in R. This heuristic only allows real-valued or binary variables. I used a real-valued vector of length 5 with bounds \([1,5.9999]\). The fitness function used this algorithm:
- Use the floor function to convert the reals to integers between 1 and 5.
- Perform an L1 regression on each line.
- The objective is minus the sum of all absolute residuals.
fitness <- function(rx) {
ix <- floor(rx)
# rx are n=100 real numbers in [1,K+0.999]
# ix are n=100 integers in [1,K]
obj <- 0
for (k in1:K) {
cnt <- length(ix[ix==k]);
if (cnt>2) {
r <- l1fit(x[ix==k],y[ix==k])
obj <- obj + sum(abs(r$residuals))
}
}
return(-obj);
}
The simplicity of this is very attractive. Also it is easy to change the fitness function to work with a least squares fit.
Note that in sometimes the GA algorithm generated an assignment with less than two points to a line. In that case we don't use regression (that would fail), but conclude the residuals are zero for that line. With two points assigned to a line, we also can prevent doing a regression.
Conclusions
- This problem was much more challenging than I expected.
- In [1] I used a variant of Chebychev regression (minimize the maximum absolution deviation). That is much easier to solve, but leaves quite some wiggle room for the lines (no unique solution).
- The MIP model based on LAD regression has real troubles, but in the end, it was a good tool to improve the heuristic solution. Sometimes you have to be happy with just good solutions. Looking at the picture, the final solution is matching the original lines quite well.
- I'll keep thinking about this a bit to see if I can come up with a better way to do this.
- Is there a way to find proven optimal solutions for this problem? Maybe even for the least-squares problem? I suspect the combinatorics are just difficult here.
References
- A strange regression problem: multi-line fitting, http://yetanothermathprogrammingconsultant.blogspot.com/2021/02/a-strange-regression-problem.html
- Linear Programming and LAD regression, https://yetanothermathprogrammingconsultant.blogspot.com/2017/11/lp-and-lad-regression.html
- https://www.wolframalpha.com/input/?i=2%5E500




