SDP Model: imputing a covariance matrix

Missing values are a well-known problem in statistics. The simplest approach is just to delete all data cases that have missing values. Another approach is to repair things by filling in reasonable values, This is called imputation. Imputation strategies can be very sophisticated (and complex).

Statistical tools have often direct support for representing missing values. E.g. the R language has NA (not available). GAMS also has NA. Python has no explicit support for missing values. By convention, the special floating point value NaN (Not a Number) is used to indicate missing values for floating point numbers. It is noted that the numpy library has some facilities to deal with missing data, but it is not really like R's NA [2].

In [1] a semi-definite programming (SDP) model is proposed to deal with a covariance matrix with some missing values by imputation. The constraint to be added is that the covariance matrix should remain positive-semi definite (PSD). A covariance matrix should be in theory PSD, but in practice it can happen it is not. The resulting model is stated as:

Impute missing values in Covariance Matrix (from [1])
\[\begin{align} \text{minimize}\>& 0\\ \text{subject to}\>&\color{darkred}{\Sigma}_{i,j} = \widetilde{\color{darkblue}{\Sigma}}_{i,j} && (i,j)\notin \color{darkblue}M\\ & \color{darkred}{\Sigma} \succeq 0 \end{align} \]

Here \(\widetilde{\Sigma}\) is the covariance matrix with missing data in locations \((i,j)\in M\). The variable \(\Sigma\) is the new covariance matrix with missing data filled in such that \(\Sigma\) is positive semi-definite. This last condition is denoted by \(\Sigma \succeq 0\). In this model there is no objective, as indicated by minimizing zero.

CVXPY implementation

There is no code provided for this model in [1]. So let me give it a try. CVXPY does not have good support for things like \(\forall (i,j) \notin M\). I can see two approaches:

Expand the constraint into scalar form. Essentially, a DIY approach.
Use a binary data matrix \(M_{i,j} \in \{0,1\}\) indicating the missing values and write \[(e\cdot e^T-M) \circ \Sigma = \widetilde{\Sigma}_0\] where \(\circ\) is elementwise multiplication (a.k.a. Hadamard product), \(e\) is a column vector of ones of appropriate size, and \(\widetilde{\Sigma}_0\) is \(\widetilde{\Sigma}\) but with NaN's replaced by zeros.

In addition, let's add a regularizing objective: minimize sum of squares of \(\Sigma_{i,j}\).

The Python code for these two models is:

import numpy as np
import pandas as pd 
import cvxpy as cp

#------------ data ----------------

cov = np.array([
 [ 0.300457, -0.158889,  0.080241, -0.143750,  0.072844, -0.032968,  0.077836,  0.049272],
 [-0.158889,  0.399624,  np.nan,    0.109056,  0.082858, -0.045462, -0.124045, -0.132096],
 [ 0.080241,  np.nan,    np.nan,   -0.031902, -0.081455,  0.098212,  0.243131,  0.120404],
 [-0.143750,  0.109056, -0.031902,  0.386109, -0.058051,  0.060246,  0.082420,  0.125786],
 [ 0.072844,  0.082858, -0.081455, -0.058051,  np.nan,    np.nan,   -0.119530, -0.054881],
 [-0.032968, -0.045462,  0.098212,  0.060246,  np.nan,    0.400641,  0.051103,  0.007308],
 [ 0.077836, -0.124045,  0.243131,  0.082420, -0.119530,  0.051103,  0.543407,  0.121709],
 [ 0.049272, -0.132096,  0.120404,  0.125786, -0.054881,  0.007308,  0.121709,  0.481395]     
])
print("Covariance data with NaNs")
print(pd.DataFrame(cov))

M = 1*np.isnan(cov)
print("M (indicator for missing values)")
print(pd.DataFrame(M))

dim = np.shape(cov)
n = dim[0]

#----------- model 1 -----------------

Sigma = cp.Variable(dim, symmetric=True)

prob = cp.Problem(
    cp.Minimize(cp.sum_squares(Sigma)),
    [ Sigma[i,j] == cov[i,j] for i in range(n) for j in range(n) if M[i,j]==0 ] +
     [ Sigma >>  0 ]
    )
prob.solve(solver=cp.SCS,verbose=True)

print("Status:",prob.status)
print("Objective:",prob.value)
print(pd.DataFrame(Sigma.value))

#----------- model 2 -----------------

e = np.ones((n,1))
cov0 =  np.nan_to_num(cov,copy=True)

prob2 = cp.Problem(
#    cp.Minimize(cp.trace(Sigma.T@Sigma)),  <--- not recognized as convex
    cp.Minimize(cp.norm(Sigma,"fro")**2),
    [ cp.multiply(e@e.T - M,Sigma) == cov0,
      Sigma >>  0 ]
    )
prob2.solve(solver=cp.SCS,verbose=True)

print("Status:",prob2.status)
print("Objective:",prob2.value)
print(pd.DataFrame(Sigma.value))

Notes:

Model 1 has a (long) list of scalar constraints. The objective is \[\min\>\sum_{i,j} \Sigma_{i,j}^2\] Sorry for the possible confusion between the symbols for summation and covariance.
CVXPY uses the notation Sigma >> 0 to indicate \(\Sigma \succeq 0\) (i.e. \(\Sigma\) should be positive semi-definite).
We added the condition that \(\Sigma\) should be symmetric in the variable statement. This seems to be needed. Without this, the solver may return a non-symmetric matrix. I suspect that in that case, the matrix \(0.5(\Sigma+\Sigma^T)\) rather than \(\Sigma\) itself is required to be positive definite.
Model 2 is an attempt to use matrix notation. The objective can be stated as \[\min\>\mathbf{tr}(\Sigma^T\Sigma)\] but that is not recognized as being convex. As alternative I used the Frobenius norm: \[||A||_F =\sqrt{ \sum_{i,j} a_{i,j}^2}\]
The function np.nan_to_num converts NaN values to zeros.
The function cp.multiply performs elementwise multiplication (as opposed to matrix multiplication).
I don't think we can easily only pass only the upper triangular part of the covariance matrix to the solver. For large problems this would save some effort (cpu time and memory).
In a traditional optimization model we would have just \(|M|\) decision variables (corresponding to the missing values). Here, in the scalar model, we have \(n^2\) variables and \(n^2-|M|\) constraints.

The results are:

Covariance data with NaNs
0123456  \
00.300457 -0.1588890.080241 -0.1437500.072844 -0.0329680.077836
1 -0.1588890.399624       NaN  0.1090560.082858 -0.045462 -0.124045
20.080241       NaN       NaN -0.031902 -0.0814550.0982120.243131
3 -0.1437500.109056 -0.0319020.386109 -0.0580510.0602460.082420
40.0728440.082858 -0.081455 -0.058051       NaN       NaN -0.119530
5 -0.032968 -0.0454620.0982120.060246       NaN  0.4006410.051103
60.077836 -0.1240450.2431310.082420 -0.1195300.0511030.543407
70.049272 -0.1320960.1204040.125786 -0.0548810.0073080.121709

7
00.049272
1 -0.132096
20.120404
30.125786
4 -0.054881
50.007308
60.121709
70.481395
M (indicator for missing values)
01234567
000000000
100100000
201100000
300000000
400001100
500001000
600000000
700000000
----------------------------------------------------------------------------
 SCS v2.1.1 - Splitting Conic Solver
 (c) Brendan O'Donoghue, Stanford University, 2012
----------------------------------------------------------------------------
Lin-sys: sparse-direct, nnz in A = 160
eps = 1.00e-04, alpha = 1.50, max_iters = 5000, normalize = 1, scale = 1.00
acceleration_lookback = 10, rho_x = 1.00e-03
Variables n = 37, constraints m = 160
Cones: primal zero / dual free vars: 58
 soc vars: 66, soc blks: 1
 sd vars: 36, sd blks: 1
Setup time: 9.69e-03s
----------------------------------------------------------------------------
 Iter | pri res | dua res | rel gap | pri obj | dua obj | kap/tau | time (s)
----------------------------------------------------------------------------
0| 4.05e+197.57e+191.00e+00 -3.12e+191.92e+201.21e+201.53e-02
40| 2.74e-101.01e-094.51e-111.71e+001.71e+001.96e-171.88e-02
----------------------------------------------------------------------------
Status: Solved
Timing: Solve time: 1.89e-02s
 Lin-sys: nnz in L factor: 357, avg solve time: 1.54e-06s
 Cones: avg projection time: 1.52e-04s
 Acceleration: avg step time: 1.66e-05s
----------------------------------------------------------------------------
Error metrics:
dist(s, K) = 4.7898e-17, dist(y, K*) = 1.5753e-09, s'y/|s||y| = 3.7338e-12
primal res: |Ax + s - b|_2 / (1 + |b|_2) = 2.7439e-10
dual res:   |A'y + c|_2 / (1 + |c|_2) = 1.0103e-09
rel gap:    |c'x + b'y| / (1 + |c'x| + |b'y|) = 4.5078e-11
----------------------------------------------------------------------------
c'x = 1.7145, -b'y = 1.7145
============================================================================
Status: optimal
Objective: 1.714544257213233
0123456  \
00.300457 -0.1588890.080241 -0.1437500.072844 -0.0329680.077836
1 -0.1588890.399624 -0.0841960.1090560.082858 -0.045462 -0.124045
20.080241 -0.0841960.198446 -0.031902 -0.0814550.0982120.243131
3 -0.1437500.109056 -0.0319020.386109 -0.0580510.0602460.082420
40.0728440.082858 -0.081455 -0.0580510.135981 -0.041927 -0.119530
5 -0.032968 -0.0454620.0982120.060246 -0.0419270.4006410.051103
60.077836 -0.1240450.2431310.082420 -0.1195300.0511030.543407
70.049272 -0.1320960.1204040.125786 -0.0548810.0073080.121709

7
00.049272
1 -0.132096
20.120404
30.125786
4 -0.054881
50.007308
60.121709
70.481395
----------------------------------------------------------------------------
 SCS v2.1.1 - Splitting Conic Solver
 (c) Brendan O'Donoghue, Stanford University, 2012
----------------------------------------------------------------------------
Lin-sys: sparse-direct, nnz in A = 162
eps = 1.00e-04, alpha = 1.50, max_iters = 5000, normalize = 1, scale = 1.00
acceleration_lookback = 10, rho_x = 1.00e-03
Variables n = 38, constraints m = 168
Cones: primal zero / dual free vars: 64
 soc vars: 68, soc blks: 2
 sd vars: 36, sd blks: 1
Setup time: 1.02e-02s
----------------------------------------------------------------------------
 Iter | pri res | dua res | rel gap | pri obj | dua obj | kap/tau | time (s)
----------------------------------------------------------------------------
0| 3.67e+195.42e+191.00e+00 -2.45e+191.28e+201.04e+209.84e-03
40| 5.85e-101.47e-097.31e-101.71e+001.71e+008.09e-171.29e-02
----------------------------------------------------------------------------
Status: Solved
Timing: Solve time: 1.31e-02s
 Lin-sys: nnz in L factor: 368, avg solve time: 2.56e-06s
 Cones: avg projection time: 3.03e-05s
 Acceleration: avg step time: 2.45e-05s
----------------------------------------------------------------------------
Error metrics:
dist(s, K) = 4.4409e-16, dist(y, K*) = 1.5216e-09, s'y/|s||y| = 4.2866e-12
primal res: |Ax + s - b|_2 / (1 + |b|_2) = 5.8496e-10
dual res:   |A'y + c|_2 / (1 + |c|_2) = 1.4729e-09
rel gap:    |c'x + b'y| / (1 + |c'x| + |b'y|) = 7.3074e-10
----------------------------------------------------------------------------
c'x = 1.7145, -b'y = 1.7145
============================================================================
Status: optimal
Objective: 1.714544261472336
0123456  \
00.300457 -0.1588890.080241 -0.1437500.072844 -0.0329680.077836
1 -0.1588890.399624 -0.0841960.1090560.082858 -0.045462 -0.124045
20.080241 -0.0841960.198446 -0.031902 -0.0814550.0982120.243131
3 -0.1437500.109056 -0.0319020.386109 -0.0580510.0602460.082420
40.0728440.082858 -0.081455 -0.0580510.135981 -0.041927 -0.119530
5 -0.032968 -0.0454620.0982120.060246 -0.0419270.4006410.051103
60.077836 -0.1240450.2431310.082420 -0.1195300.0511030.543407
70.049272 -0.1320960.1204040.125786 -0.0548810.0073080.121709

7
00.049272
1 -0.132096
20.120404
30.125786
4 -0.054881
50.007308
60.121709
70.481395

As a sanity check we can confirm that the eigenvalues of the solution matrix are non-negative:

w,v = np.linalg.eig(Sigma.value)
print(w)

[9.46355900e-016.34465779e-012.35993549e-105.30366506e-02
1.69999646e-012.29670882e-014.36623248e-013.75907704e-01]

Practice

I don't think this is a practical way of dealing with missing values. First of all missing values in the original data will propagate in the covariance matrix. A single NA in the data leads to lots of NAs in the covariance matrix.

----     28 PARAMETER cov  effect of a single NA in the data

            j1          j2          j3          j4          j5          j6          j7          j8

j1    0.300457   -0.158889          NA   -0.1437500.072844   -0.0329680.0778360.049272
j2   -0.1588890.399624          NA    0.1090560.082858   -0.045462   -0.124045   -0.132096
j3          NA          NA          NA          NA          NA          NA          NA          NA
j4   -0.1437500.109056          NA    0.386109   -0.0580510.0602460.0824200.125786
j5    0.0728440.082858          NA   -0.0580510.354627   -0.129507   -0.119530   -0.054881
j6   -0.032968   -0.045462          NA    0.060246   -0.1295070.4006410.0511030.007308
j7    0.077836   -0.124045          NA    0.082420   -0.1195300.0511030.5434070.121709
j8    0.049272   -0.132096          NA    0.125786   -0.0548810.0073080.1217090.481395

This propagation is the result of applying the standard formula for the covariance: \[cov_{j,k} = \frac{1}{N-1} \sum_i (x_{i,j}-\mu_j)(x_{i,k}-\mu_k) \] This is of course difficult to fix in the covariance matrix. Just too much damage has been done.

A second problem with our SDP model is that we are not staying close to reasonable values for missing correlations. The model only looks at the PSD constraint.

Basically we need to look at the original data.

A simple remedy is just to throw away the record with the NA. If you have lots of data and relatively few NAs in the data, this is a reasonable approach. However there is a trick we can use. Instead of throwing a whole row of observations away in case we have an NA, we inspect pairs of columns \((j,k)\) individually. For the two columns \(j\) and \(k\) throw away the NAs in these columns and then calculate the covariance \(cov_{j,k}\). Repeat for all combinations \((j,k)\) with \(j \lt k\). R has this built-in;

> cov(a)
              [,1]         [,2]         [,3]        [,4]          [,5]          [,6]         [,7]          [,8]
[1,]  0.3269524261-0.0223359220.0240629150.026460677-0.00037359160.00213833970.0544727640-0.0008417817
[2,] -0.02233592220.3134442590.0361354130.027115454-0.00459559420.02866593340.05586108430.0222590384
[3,]  0.02406291480.0361354130.3084431820.0036633380.0014232064-0.01584312460.0308769925-0.0177244600
[4,]  0.02646067710.0271154540.0036633380.3228014480.00572219340.01750517220.01528044380.0034349411
[5,] -0.0003735916-0.0045955940.0014232060.0057221930.2920368646-0.00692135670.02271539190.0163823701
[6,]  0.00213833970.028665933-0.0158431250.017505172-0.00692135670.30959356030.00093592710.0506571760
[7,]  0.05447276400.0558610840.0308769930.0152804440.02271539190.00093592710.36350803110.0322080200
[8,] -0.00084178170.022259038-0.0177244600.0034349410.01638237010.05065717600.03220802000.2992700098
> a[2,3]=NA
> cov(a)
              [,1]         [,2] [,3]        [,4]          [,5]          [,6]         [,7]          [,8]
[1,]  0.3269524261-0.022335922NA0.026460677-0.00037359160.00213833970.0544727640-0.0008417817
[2,] -0.02233592220.313444259NA0.027115454-0.00459559420.02866593340.05586108430.0222590384
[3,]            NANANANANANANANA
[4,]  0.02646067710.027115454NA0.3228014480.00572219340.01750517220.01528044380.0034349411
[5,] -0.0003735916-0.004595594NA0.0057221930.2920368646-0.00692135670.02271539190.0163823701
[6,]  0.00213833970.028665933NA0.017505172-0.00692135670.30959356030.00093592710.0506571760
[7,]  0.05447276400.055861084NA0.0152804440.02271539190.00093592710.36350803110.0322080200
[8,] -0.00084178170.022259038NA0.0034349410.01638237010.05065717600.03220802000.2992700098
> cov(a,use="pairwise")
              [,1]         [,2]         [,3]        [,4]          [,5]          [,6]         [,7]          [,8]
[1,]  0.3269524261-0.0223359220.0240779690.026460677-0.00037359160.00213833970.0544727640-0.0008417817
[2,] -0.02233592220.3134442590.0368959960.027115454-0.00459559420.02866593340.05586108430.0222590384
[3,]  0.02407796930.0368959960.3115737540.0033773920.0013694087-0.01622316090.0310202082-0.0180617800
[4,]  0.02646067710.0271154540.0033773920.3228014480.00572219340.01750517220.01528044380.0034349411
[5,] -0.0003735916-0.0045955940.0013694090.0057221930.2920368646-0.00692135670.02271539190.0163823701
[6,]  0.00213833970.028665933-0.0162231610.017505172-0.00692135670.30959356030.00093592710.0506571760
[7,]  0.05447276400.0558610840.0310202080.0152804440.02271539190.00093592710.36350803110.0322080200
[8,] -0.00084178170.022259038-0.0180617800.0034349410.01638237010.05065717600.03220802000.2992700098
>

The disadvantage with pairwise covariances is that it is possible (even theoretically) that the final covariance matrix is not positive-semi definite. We can repair this with R's nearPD function. Essentially, this is performing an eigen-decomposition, replacing the negative eigenvalues by positive ones and then reassembling the covariance matrix (this is just matrix multiplications).

Conclusion

The model presented in [1] is interesting: it is not quite obvious how to implement it in CVXPY (and the code below the example in [1] is not directly related). However, it should be mentioned that better methods are available to address the underlying problem: how to handle missing values in a covariance matrix.

References

Semidefinite program, https://www.cvxpy.org/examples/basic/sdp.html
Missing Data Functionality in NumPy, https://docs.scipy.org/doc/numpy-1.10.1/neps/missing-data.html
Covariance matrix not positive definite in portfolio models, https://yetanothermathprogrammingconsultant.blogspot.com/2018/04/covariance-matrix-not-positive-definite.html

SDP Model: imputing a covariance matrix

CVXPY implementation

Practice

Conclusion

References

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112