Maximum Correlation, Global MINLP vs GA

In [1], the following question is posed:

Suppose we have data like:
worker weight height weight2 height2
11206012560
21525515666
32225510020
We can for each case pick (weight,height) or (weight2,height2). The problem is to pick one of these observations for each row such that the correlation between weight and height is maximized.

Data

First, let's invent some data that we can use for some experiments.

----     27 PARAMETER data  

        height1     weight1     height2     weight2

i1    67.433285168.26287167.445523163.692389
i2    70.638374174.43775068.649190160.084811
i3    71.317794159.90967269.503911164.720010
i4    59.850261145.70415961.175728142.708300
i5    65.341938155.58698468.483909165.564991
i6    64.142009154.33500168.568683166.169507
i7    67.030368158.76881365.780803153.721717
i8    73.672863175.12695173.236515164.704340
i9    65.203516157.59358763.279277149.784500
i10   69.001848160.06342868.786656162.278007
i11   64.455422159.03919563.930208152.827710
i12   70.719334164.88570469.666096157.356595
i13   65.688428151.22346863.614565150.071072
i14   66.569252160.97867170.533320160.722483
i15   78.417676172.29865280.070076172.695207
i16   65.396154158.23470967.404942158.310596
i17   62.504967150.89942861.000439154.094647
i18   62.122630150.02429863.634554153.644324
i19   70.598400165.08652372.999194166.771223
i20   74.935107170.82061076.622182169.013550
i21   63.233956154.33154660.372876149.152520
i22   72.550105173.96191576.748649167.462369
i23   74.086553168.19086775.433331171.773607
i24   65.379648163.57769765.717553160.134888
i25   64.003038155.35760767.301426158.713710

The optimal solution that has the highest correlation coefficient between height and weight is:

----     92 PARAMETER result  optimal selected observations

        height1     weight1     height2     weight2

i1                            67.445523163.692389
i2                            68.649190160.084811
i3                            69.503911164.720010
i4    59.850261145.704159
i5    65.341938155.586984
i6    64.142009154.335001
i7    67.030368158.768813
i8                            73.236515164.704340
i9                            63.279277149.784500
i10   69.001848160.063428
i11                           63.930208152.827710
i12   70.719334164.885704
i13                           63.614565150.071072
i14                           70.533320160.722483
i15   78.417676172.298652
i16                           67.404942158.310596
i17   62.504967150.899428
i18   62.122630150.024298
i19                           72.999194166.771223
i20   74.935107170.820610
i21                           60.372876149.152520
i22                           76.748649167.462369
i23                           75.433331171.773607
i24                           65.717553160.134888
i25                           67.301426158.713710

Below we shall see how we arrive at this conclusion.

MINLP Model

A high-level model is simply:

MINLP Model
\[ \begin{align}\max\> &\mathbf{cor}(\color{darkred}h,\color{darkred}w) \\ & \color{darkred}h_i = \color{darkblue}{\mathit{height1}}_i\cdot(1-\color{darkred}x_i)+ \color{darkblue}{\mathit{height2}}_i\cdot\color{darkred}x_i\\ & \color{darkred}w_i = \color{darkblue}{\mathit{weight1}}_i\cdot(1-\color{darkred}x_i)+ \color{darkblue}{\mathit{weight2}}_i\cdot\color{darkred}x_i \\ & \color{darkred}x_i \in \{0,1\} \end{align}\]

MINLP Model

\[ \begin{align}\max\> &\mathbf{cor}(\color{darkred}h,\color{darkred}w) \\ & \color{darkred}h_i = \color{darkblue}{\mathit{height1}}_i\cdot(1-\color{darkred}x_i)+ \color{darkblue}{\mathit{height2}}_i\cdot\color{darkred}x_i\\ & \color{darkred}w_i = \color{darkblue}{\mathit{weight1}}_i\cdot(1-\color{darkred}x_i)+ \color{darkblue}{\mathit{weight2}}_i\cdot\color{darkred}x_i \\ & \color{darkred}x_i \in \{0,1\} \end{align}\]

Here \(\mathbf{cor}(h,w)\) indicates the (Pearson) correlation between vectors \(h\) and \(w\). It is noted that height and weight are positively correlated, so maximizing makes sense. Of course, GAMS does not know about correlations, so we implement this model as:

Implementation of MINLP Model
\[ \begin{align}\max\> & \color{darkred}z = \frac{\displaystyle\sum_i (\color{darkred}h_i-\bar{\color{darkred}h})(\color{darkred}w_i-\bar{\color{darkred}w})}{\sqrt{\displaystyle\sum_i(\color{darkred}h_i-\bar{\color{darkred}h})^2}\cdot \sqrt{\displaystyle\sum_i(\color{darkred}w_i-\bar{\color{darkred}w})^2}} \\ & \color{darkred}h_i = \color{darkblue}{\mathit{height1}}_i\cdot(1-\color{darkred}x_i)+ \color{darkblue}{\mathit{height2}}_i\cdot\color{darkred}x_i \\ & \color{darkred}w_i = \color{darkblue}{\mathit{weight1}}_i\cdot(1-\color{darkred}x_i)+ \color{darkblue}{\mathit{weight2}}_i\cdot\color{darkred}x_i \\ & \bar{\color{darkred}h} = \frac{1}{n}\displaystyle\sum_i \color{darkred}h_i \\ & \bar{\color{darkred}w} = \frac{1}{n}\displaystyle\sum_i \color{darkred}w_i \\ & \color{darkred}x_i \in \{0,1\} \end{align}\]

Implementation of MINLP Model

\[ \begin{align}\max\> & \color{darkred}z = \frac{\displaystyle\sum_i (\color{darkred}h_i-\bar{\color{darkred}h})(\color{darkred}w_i-\bar{\color{darkred}w})}{\sqrt{\displaystyle\sum_i(\color{darkred}h_i-\bar{\color{darkred}h})^2}\cdot \sqrt{\displaystyle\sum_i(\color{darkred}w_i-\bar{\color{darkred}w})^2}} \\ & \color{darkred}h_i = \color{darkblue}{\mathit{height1}}_i\cdot(1-\color{darkred}x_i)+ \color{darkblue}{\mathit{height2}}_i\cdot\color{darkred}x_i \\ & \color{darkred}w_i = \color{darkblue}{\mathit{weight1}}_i\cdot(1-\color{darkred}x_i)+ \color{darkblue}{\mathit{weight2}}_i\cdot\color{darkred}x_i \\ & \bar{\color{darkred}h} = \frac{1}{n}\displaystyle\sum_i \color{darkred}h_i \\ & \bar{\color{darkred}w} = \frac{1}{n}\displaystyle\sum_i \color{darkred}w_i \\ & \color{darkred}x_i \in \{0,1\} \end{align}\]

We need to watch when using divisions. I typically like to reformulate divisions by multiplications: \[\begin{align} \max \>& \color{darkred}z\\ &\color{darkred}z\cdot \sqrt{\displaystyle\sum_i(\color{darkred}h_i-\bar{\color{darkred}h})^2}\cdot \sqrt{\displaystyle\sum_i(\color{darkred}w_i-\bar{\color{darkred}w})^2} = \displaystyle\sum_i (\color{darkred}h_i-\bar{\color{darkred}h})(\color{darkred}w_i-\bar{\color{darkred}w})\end{align}\]

The advantage of this formulation is in the protection against division by zero. A disadvantage is that non-linearities are moved to the constraints. Many NLP solvers are happier when constraints are linear, and only the objective is non-linear. My answer: experiment with formulations.

When we solve this model with GAMS/Baron, we see:

----     79 VARIABLE x.L  select 1 or 2

i1  1.000000,    i2  1.000000,    i3  1.000000,    i8  1.000000,    i9  1.000000,    i11 1.000000,    i13 1.000000
i14 1.000000,    i16 1.000000,    i19 1.000000,    i21 1.000000,    i22 1.000000,    i23 1.000000,    i24 1.000000
i25 1.000000


----     79 VARIABLE z.L                   =     0.956452  objective

----     83 PARAMETER corr  

all1    0.868691,    all2    0.894532,    optimal 0.956452

The parameter corr shows different correlations:

When all \(x_i=0\), that is when we compare height1 vs weight1.
When all \(x_i=1\), so we compare height2 vs weight2.
When using an optimal solution for \(x\).

At least for this data set, cherry-picking the data can significantly improve the correlation coefficient.

Nonconvex MIQCP

Gurobi can solve non-convex quadratic models quite efficiently. To try this out, I reformulated the MINLP model into a MIQCP (Mixed-Integer Quadratically-Constrained Programming) model. We need to add some variables and constraints to make this happen. The final model is still quite small, but I encountered severe problems:

Gurobi Optimizer version 9.0.1 build v9.0.1rc0 (win64)
Optimize a model with 52 rows, 84 columns and 152 nonzeros
Model fingerprint: 0xab3fb310
Model has 7 quadratic constraints
Variable types: 59 continuous, 25 integer (25 binary)
Coefficient statistics:
  Matrix range     [1e-02, 1e+01]
  QMatrix range    [1e+00, 2e+01]
  QLMatrix range   [1e+00, 1e+00]
  Objective range  [1e+00, 1e+00]
  Bounds range     [1e+00, 2e+02]
  RHS range        [6e+01, 2e+02]
Presolve removed 50 rows and 50 columns
Presolve time: 0.00s
Presolved: 179 rows, 88 columns, 637 nonzeros
Presolved model has 7 bilinear constraint(s)
Variable types: 63 continuous, 25 integer (25 binary)

Root relaxation: unbounded, 0 iterations, 0.00 seconds

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

02  postponed    0               -          -      -     -    0s
7128364203  postponed  891               -          -      -   0.05s
134459127181  postponed 1772               -          -      -   0.010s
214323207261  postponed 1772               -          -      -   0.015s
292995285775  postponed 2630               -          -      -   0.020s
372859365589  postponed 3534               -          -      -   0.025s
452723445507  postponed 4414               -          -      -   0.030s
531395524129  postponed 4414               -          -      -   0.035s
611259604451  postponed 5145               -          -      -   0.040s
691123683851  postponed 5296               -          -      -   0.045s
770987763725  postponed 6168               -          -      -   0.050s
849659842391  postponed 6177               -          -      -   0.055s
929523922367  postponed 7058               -          -      -   0.060s
...
4259588342589647  postponed 9111               -          -      -   0.02695s
4267455542668119  postponed 9402               -          -      -   0.02700s
4275441942747901  postponed 8827               -          -      -   0.02705s
4283309142826421  postponed 9649               -          -      -   0.02710s
4291057142903743  postponed 9221               -          -      -   0.02715s

This was really bad: not even a feasible solution after 2700 seconds. Compare this to Baron which could solve the MINLP model in about 300 seconds (on a much slower machine).

Genetic Algorithm

It is interesting to see how a simple meta-heuristic would do on this problem. Here are the results using the ga function from the GA package [2]. This was actually quite easy to implement.

> df <- read.table(text="
+ id      height1     weight1     height2     weight2
+ i1    67.433285  168.262871   67.445523  163.692389
+ i2    70.638374  174.437750   68.649190  160.084811
+ i3    71.317794  159.909672   69.503911  164.720010
+ i4    59.850261  145.704159   61.175728  142.708300
+ i5    65.341938  155.586984   68.483909  165.564991
+ i6    64.142009  154.335001   68.568683  166.169507
+ i7    67.030368  158.768813   65.780803  153.721717
+ i8    73.672863  175.126951   73.236515  164.704340
+ i9    65.203516  157.593587   63.279277  149.784500
+ i10   69.001848  160.063428   68.786656  162.278007
+ i11   64.455422  159.039195   63.930208  152.827710
+ i12   70.719334  164.885704   69.666096  157.356595
+ i13   65.688428  151.223468   63.614565  150.071072
+ i14   66.569252  160.978671   70.533320  160.722483
+ i15   78.417676  172.298652   80.070076  172.695207
+ i16   65.396154  158.234709   67.404942  158.310596
+ i17   62.504967  150.899428   61.000439  154.094647
+ i18   62.122630  150.024298   63.634554  153.644324
+ i19   70.598400  165.086523   72.999194  166.771223
+ i20   74.935107  170.820610   76.622182  169.013550
+ i21   63.233956  154.331546   60.372876  149.152520
+ i22   72.550105  173.961915   76.748649  167.462369
+ i23   74.086553  168.190867   75.433331  171.773607
+ i24   65.379648  163.577697   65.717553  160.134888
+ i25   64.003038  155.357607   67.301426  158.713710
+ ", header=T)
>
>#
># print obvious cases 
>#
> cor(df$weight1,df$height1)
[1] 0.8686908
> cor(df$weight2,df$height2)
[1] 0.894532
>
>#
># fitness function
>#
>f<-function(x) {
+   w <- df$weight1*(1-x) + df$weight2*x
+   h <- df$height1*(1-x) + df$height2*x
+   cor(w,h) 
+ }
>
> library(GA)
> res <- ga(type=c("binary"),fitness=f,nBits=25,seed=123)
GA |iter=1|Mean=0.8709318|Best=0.9237155
GA |iter=2|Mean=0.8742004|Best=0.9237155
GA |iter=3|Mean=0.8736450|Best=0.9237155
GA |iter=4|Mean=0.8742228|Best=0.9384788
GA |iter=5|Mean=0.8746517|Best=0.9384788
GA |iter=6|Mean=0.8792048|Best=0.9486227
GA |iter=7|Mean=0.8844841|Best=0.9486227
GA |iter=8|Mean=0.8816874|Best=0.9486227
GA |iter=9|Mean=0.8805522|Best=0.9486227
GA |iter=10|Mean=0.8820974|Best=0.9486227
GA |iter=11|Mean=0.8859074|Best=0.9486227
GA |iter=12|Mean=0.8956467|Best=0.9486227
GA |iter=13|Mean=0.8989140|Best=0.9486227
GA |iter=14|Mean=0.9069327|Best=0.9486227
GA |iter=15|Mean=0.9078787|Best=0.9486227
GA |iter=16|Mean=0.9069163|Best=0.9489443
GA |iter=17|Mean=0.9104712|Best=0.9489443
GA |iter=18|Mean=0.9169900|Best=0.9489443
GA |iter=19|Mean=0.9175285|Best=0.9489443
GA |iter=20|Mean=0.9207076|Best=0.9489443
GA |iter=21|Mean=0.9210288|Best=0.9489443
GA |iter=22|Mean=0.9206928|Best=0.9489443
GA |iter=23|Mean=0.9210399|Best=0.9489443
GA |iter=24|Mean=0.9208985|Best=0.9489443
GA |iter=25|Mean=0.9183778|Best=0.9511446
GA |iter=26|Mean=0.9217391|Best=0.9511446
GA |iter=27|Mean=0.9274271|Best=0.9522764
GA |iter=28|Mean=0.9271156|Best=0.9522764
GA |iter=29|Mean=0.9275347|Best=0.9522764
GA |iter=30|Mean=0.9278315|Best=0.9522764
GA |iter=31|Mean=0.9300289|Best=0.9522764
GA |iter=32|Mean=0.9306409|Best=0.9528777
GA |iter=33|Mean=0.9309087|Best=0.9528777
GA |iter=34|Mean=0.9327691|Best=0.9528777
GA |iter=35|Mean=0.9309344|Best=0.9549574
GA |iter=36|Mean=0.9341977|Best=0.9549574
GA |iter=37|Mean=0.9374437|Best=0.9559043
GA |iter=38|Mean=0.9394410|Best=0.9559043
GA |iter=39|Mean=0.9405482|Best=0.9559043
GA |iter=40|Mean=0.9432749|Best=0.9564515
GA |iter=41|Mean=0.9441814|Best=0.9564515
GA |iter=42|Mean=0.9458232|Best=0.9564515
GA |iter=43|Mean=0.9469625|Best=0.9564515
GA |iter=44|Mean=0.9462313|Best=0.9564515
GA |iter=45|Mean=0.9449716|Best=0.9564515
GA |iter=46|Mean=0.9444071|Best=0.9564515
GA |iter=47|Mean=0.9437149|Best=0.9564515
GA |iter=48|Mean=0.9446355|Best=0.9564515
GA |iter=49|Mean=0.9455424|Best=0.9564515
GA |iter=50|Mean=0.9456497|Best=0.9564515
GA |iter=51|Mean=0.9461382|Best=0.9564515
GA |iter=52|Mean=0.9444960|Best=0.9564515
GA |iter=53|Mean=0.9434671|Best=0.9564515
GA |iter=54|Mean=0.9451851|Best=0.9564515
GA |iter=55|Mean=0.9481903|Best=0.9564515
GA |iter=56|Mean=0.9477778|Best=0.9564515
GA |iter=57|Mean=0.9481829|Best=0.9564515
GA |iter=58|Mean=0.9490952|Best=0.9564515
GA |iter=59|Mean=0.9505670|Best=0.9564515
GA |iter=60|Mean=0.9499329|Best=0.9564515
GA |iter=61|Mean=0.9509299|Best=0.9564515
GA |iter=62|Mean=0.9505341|Best=0.9564515
GA |iter=63|Mean=0.9519624|Best=0.9564515
GA |iter=64|Mean=0.9518618|Best=0.9564515
GA |iter=65|Mean=0.9523598|Best=0.9564515
GA |iter=66|Mean=0.9516766|Best=0.9564515
GA |iter=67|Mean=0.9521926|Best=0.9564515
GA |iter=68|Mean=0.9524419|Best=0.9564515
GA |iter=69|Mean=0.9532865|Best=0.9564515
GA |iter=70|Mean=0.9535871|Best=0.9564515
GA |iter=71|Mean=0.9536049|Best=0.9564515
GA |iter=72|Mean=0.9534035|Best=0.9564515
GA |iter=73|Mean=0.9532859|Best=0.9564515
GA |iter=74|Mean=0.9521064|Best=0.9564515
GA |iter=75|Mean=0.9534997|Best=0.9564515
GA |iter=76|Mean=0.9539987|Best=0.9564515
GA |iter=77|Mean=0.9536670|Best=0.9564515
GA |iter=78|Mean=0.9526224|Best=0.9564515
GA |iter=79|Mean=0.9531871|Best=0.9564515
GA |iter=80|Mean=0.9527495|Best=0.9564515
GA |iter=81|Mean=0.9526061|Best=0.9564515
GA |iter=82|Mean=0.9525577|Best=0.9564515
GA |iter=83|Mean=0.9525084|Best=0.9564515
GA |iter=84|Mean=0.9519052|Best=0.9564515
GA |iter=85|Mean=0.9518549|Best=0.9564515
GA |iter=86|Mean=0.9511299|Best=0.9564515
GA |iter=87|Mean=0.9505129|Best=0.9564515
GA |iter=88|Mean=0.9518203|Best=0.9564515
GA |iter=89|Mean=0.9537234|Best=0.9564515
GA |iter=90|Mean=0.9531017|Best=0.9564515
GA |iter=91|Mean=0.9514525|Best=0.9564515
GA |iter=92|Mean=0.9505517|Best=0.9564515
GA |iter=93|Mean=0.9524752|Best=0.9564515
GA |iter=94|Mean=0.9533879|Best=0.9564515
GA |iter=95|Mean=0.9519166|Best=0.9564515
GA |iter=96|Mean=0.9524416|Best=0.9564515
GA |iter=97|Mean=0.9526676|Best=0.9564515
GA |iter=98|Mean=0.9523745|Best=0.9564515
GA |iter=99|Mean=0.9523710|Best=0.9564515
GA |iter=100|Mean=0.9519255|Best=0.9564515
> res@solution
     x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25
[1,]  1110000110101101001011111
> res@fitnessValue
[1] 0.9564515

When comparing to our proven optimal Baron solution, we see that the ga function finds the optimal solution. Of course, we would not know this if we did not solve the model with a global solver like Baron.

References

Selecting sample that maximizes correlation, https://stackoverflow.com/questions/64251740/selecting-sample-that-maximizes-correlation
Package 'GA', https://cran.r-project.org/web/packages/GA/GA.pdf

Maximum Correlation, Global MINLP vs GA

Data

MINLP Model

Nonconvex MIQCP

Genetic Algorithm

References

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112