Rudolf Sponsel, Erlangen, translated by Agnes Mehl, Fürth
It is shown, that by using the PESO analysis by HAIN (1994) (Pivotized Erhard Schmidt Orthonormalization) and the numeric analysis of stability (SPONSEL 1994) the analysis of collinearity can be carried through. Furthermore it is shown, how the indefinite correlation matrix by SPEARMAN & HART producing highly pathological multiple and partial correlation coefficients can be "cured" using the centroid method by THURSTONE. Finally it is shown, that multicollinearity, i.e. the multiple laws included in this correlation matrix - unfortunately - disappear after a successful "centroid therapy". Thus it is also shown, that no reliable statement can be made about indefinite correlation matrices. Numerous even historically important correlation matrices are effected by this like e.g. the "Primaries..." by THURSTONE (Sponsel 1994, Empirical Correlation Matrices report (1910-1993).
Indefinite matrices of correlation matrices, thus such matrices having lost their positive definiteness, produce absurd multiple and partial correlation coefficients. This mostly can be traced back to major methodological mistakes (Sponsel 1994). Particularily severe cases are indicated by higher negative eigenvalues (rule of thumb: > |.01|). If on the other side the negative eigenvalues are "small" (rule of thumb: in the area of the third digit after the decimal point), the disorder most probably will be a consequence of collinearity, i.e. the representation of a linear law in connection with rounding errors being unavoidable doing the concrete numeric computing. One cannot compute resonsibly and reasonably with indefinite correlation matrices any longer. Thus the question arises, whether and how such matrices can be "cured". An effective procedure is presented by KNOL & TEN BERGE (1989). We now want to demonstrate the efficiency of the centroid- method by THURSTONE on the SPEARMAN & HART matrix from 1913. We also show, how an analysis of collinearity can be carried through using the PESO analysis by HAIN (1994) (Pivotized Erhard Schmidt Orthonormalization) and the analysis of numeric stability (SPONSEL 1994).
2. Explanations on the criteria of the matrix analysis
Samp_Or_MD_NumS_Condit_Determ_HaInRatio_R_OutIn_K_Norm_C_Norm
_
Samp =: sample size
Md =: missing data information: -1 =: unknown
NumS =: valuation (& possibly number
of negative eigenvalues). Here exist so far the following rule of
thumb valuations:
+ numerically stable
+? borderline, tendency to be rather numerically
stable
-? borderline, tendency to be rather numerically
instable
- numerically instable
--Z indefinite with Z negative eigenvalues
Also only one negative eigenvalue being given, the matrix is indefinite and derailments of any kind are possible. The matrix has turned "psychotic" so to speak: no value can be trusted anymore, all is possible. Such a state has to be avoided at all costs, or to be reversed respectively immediately "treated" before any further calculating can be done. |
Condit = Highest absolut eigenvalue / Smallest absolut eigenvalue with order < 10, Condit shows > 30; with order < 20, Condit shows > 50 and thereby indicates numeric instability.
Determ =: determinant. The determinant represents the absolute value of the volume of the n-dimensional parallelotope (multidimensional object). The smaller it is, the smaller is the volume of the space. A small volume hereby can be caused by a single small vector equaling a small angle. This is the critical case. A small determinant, however, can also result very "normally" from the "natural" calculating process without having to express numeric instability. A valuation only because of the absolute value of the determinant thus is not reasonable.
HaInRatio =: HADAMARD number inverse.
The HADAMARD number of the inverse indicates, which ratio the real determinant
of the inverse has got of its theoretically maximum value with the coefficient
matrix being given. According to the rule of thumb by FADDEJEW
& FADDEJEWA an inverse determinant will be considered small if
for its ratio 1 : 50 000 is valid, thus the HaInRatio being < .00002.
R_OutIn =: LES Input Output Ratio (SPONSEL 1994). The input output ratio indicates, by how much the output will be changed, if a change by one unit is made around the third digit after the decimal point. Theoretically the value ranges from 0 to ...
K_Norm =: smallest PESO-Norm correlation
matrix (HAIN 1994). The smallest reduced norm ("shortest" norm, "flatest"
angle) of the correlation matrix is a measure of the degree of collinearity.
The smaller the value of the K_Norm, the stronger is the degree of collinearity.
The product of all reduced norms results in the absolute value of the determinant.
Thus a single small K_Norm is sufficient to bring the volume close to 0
(equivalent to the function of small eigenvalues). PESO for the correlation
matrix is adjusted in a way, that for all K_Norms < 0.01 in brackets
the number of relations (collinearities) is printed out. The root of
the K_Norm gives an upper boundary of the highest correlation coefficient.
C_Norm =: smallest PESO-Norm CHOLESKY matrix
(HAIN 1994). The C_Norm represents the smallest reduced norm of the
CHOLESKY matrix. The importance of the CHOLESKY decomposition is based
among other things on the isometry to the raw scores. The smallest C_Norm
indicates the smallest angle given for the centered standardized raw scores.
The square of the smallest C_Norm is larger or equal the smallest K_Norm.
As empirical rule of thumb is valid: (C_Norm^2)/(2...5) ~ K_Norm. Furthermore
is valid: r(multiple) = SQR (1-C_Norm^2). Thus from the C_Norm directly
multiple correlation coefficients can be determined. The relation or collinearity
can be expressed by the smallest CHOLESKY norm, thus also by the well-known
and usual multiple correlation coefficient. A C_Norm < .31 can
serve as a critical boundary for a collinearity starting to be more
significant.
Eigenvalues the practically most important and most useful criterion for collinearity with correlations matrices are eigenvalues close to 0 (< .10).
Table 1
Original
Correlation Matrix Spearman & Hart 1913
Original input data with 2-digit-accuracy and
read with
2-digit-accuracy (for control here the analysed original
matrix):
1 2
3 4 5 6
7 8 9 10
11 12 13
1 1 .77 .67 .6
.69 .57 .57 .5 .52 .48 .38 .2
.16
2 .77 1 .74 .61
.66 .59 .53 .29 .52 .16 .62 .31
.07
3 .67 .74 1 .52
.72 .45 .61 .34 .52 .14 .22 .19
.23
4 .6 .61 .52 1
.44 .76 .47 .67 .4 .29 .13 .57
-.13
5 .69 .66 .72 .44 1
.51 .65 .4 .34 .47 .23 .19 .01
6 .57 .59 .45 .76 .51
1 .41 .45 .47 .25 .03 .26
.11
7 .57 .53 .61 .47 .65
.41 1 .45 .47 .08 .26 -.05
.22
8 .5 .29 .34 .67
.4 .45 .45 1 .34 .16
.08 .05 -.05
9 .52 .52 .52 .4
.34 .47 .47 .34 1 -.07 -.01 .01 -.13
10 .48 .16 .14 .29 .47
.25 .08 .16 -.07 1 .26 .06 .19
11 .38 .62 .22 .13 .23
.03 .26 .08 -.01 .26 1 .16 .29
12 .2 .31 .19 .57 .19
.26 -.05 .05 .01 .06 .16 1
.05
13 .16 .07 .23 -.13 .01 .11
.22 -.05 -.13 .19 .29 .05 1
Table 2: Matrix Analysis Criteria
Or_ MD_NumS_Condit_
Determ_
HaInRatio_
R_OutIn_
K_Norm_
C_Norm
13 -1 --1 733.3 -.0000167538 2.21
D-12 394.2 5D-3(1) -1(-1)
Highest inverse negative diagonal value____= -.021075044
thus multiple r( 5.rest)_________________= 6.960566542
(!)
and there are 2 multiple r > 1 (!)
i.Eigenvalue Cholesky i.Eigenvalue Cholesky
i.Eigenvalue Cholesky
1. 5.63691 1
2. 1.61837 .638
3. 1.33718 .654
4. 1.09919 .7636
5. .87991 .6319
6. .75463 .6054
7. .60908 .712
8. .41475 .6371
9. .32742 .7243
10. .2103 .5212
11. .18194 -.1899 12. 7.69D-3
-.2362
13.-.07735 -.2838
The matrix is not positive definit. Cholesky decomposition
is not success-
ful (for detailed information Cholesky's diagonalvalues are
presented).
Detailed Standard-Matrix-Analysis of the Spearman & Hart correlations Matrix (1913).
3. Discussion according to criteria of the original matrix
The negative determinant indicates an indefinite and badly derailed matrix. The condition number - largest : smallest eigenvalue - shows with 733 a high value. The LES analysis, rounding up and down in the third digit after the decimal point, shows an input-output ratio of 394, i.e. a change of the input leads to a change with the output by 394 times. The HADAMARD number of the inverses indicates 2*10^-12 a very small ratio. The negative eigenvalue is with a value of -.07735 very high and at first sight leaves little hope for therapy. However, to my surprise it turned out, that a "centroid therapy" according to the centroid method by THURSTONE was successful despite the high negative eigenvalue. Probably no indefiniteness because of pure collinearity is present, but it has to be suspected that the "correlation matrix" is multiply damaged: missing data, meaning procedures, "correction for attenuation" no product-moment-coefficient? The highest multiple correlation coefficient is with r5.rest = 6.96 (!) completely derailed - as a consequence of the loss of positive definiteness.
4. Which variables are responsible for the collinearity?
The position of the eigenvalue doesn't allow any conclusion, which variable constitutes the collinearity. This can easily be controlled exchanging rows and columns of the correlation matrix and realizing, that the eigenvalues stay the same. Information however produces the PESO analysis (Pivotized Erhard Schmidt Orthonormalization) developed by Dr. HAIN (1994).
Table 3: PESO-Analysis of the correlation matrix
Var. RN Reduced Norm
ON Original Norm Ratio RN/ON
1 2.1186080323290788
2.1186080323290788 1.0000
13 1.0830630275311198
1.1414902533297001 0.9488
12 1.0688990982453775
1.2796874610809581 0.8352
10 .89566410275237469
1.3368993970637328 0.6699
11 .87263435136430641
1.3781509343086692 0.6331
8 .72829442742515084
1.6174671548075548 0.4502
6 .5198933852503035
1.8568252464596398 0.2799
5 .50466376705114477
1.9727899014746841 0.2558
9 .38749960154371265
1.6328502677286861 0.2373
7 .33804356960980048
1.8269373267907007 0.1850
3 .27300931252094519
1.9755758642348769 0.1381
4 .11747153111497607
2.0115416961068529 0.0583
2 .010887210195374477
2.1077713335900289 0.0051
products of:
1.6753770848438221D-5
844.16208542332548 1.9846627961307499D-8
Remark: As can be seen, the product of the reduced norms
results in the absolute value of the determinant. The product of the ratios
results in the HADAMARD condition number (not mentioned above). Regarding
the choosen boundary in PESO (equivalent >= a multiple correlation coefficient
of .99499) PESO finds "relations" (Term HAIN uses for almsot collinearities),
here one of them:
Table 4: Relation
1
-0.3146527856 8 0.1316158518
2
1.0000000000 9 0.0183634921
3
-0.4556784876 10 0.1700240126
4
0.1663107489 11 -0.5511173581
5
0.0554053195 12 -0.0898030079
6
-0.4653680932 13 0.3019404255
7
-0.0053392459
Practical proof: Multiplying the original matrix by this vector results according to the choosen boundary in almost-zero. The absolue values of the relations indicate something about the contribution of the respective variable to the linear dependency. One could interpret the smallest absolute values as suggestions for elimination. Eliminating variables 7, 9, 12 the following matrix results:
Table 5: Reduced Matrix (7,9,12)
1 2 3
4 5 6 8
10 11 13
Multiple correlat.
1 1 .77 .67 .6
.69 .57 .5 .48 .38 .16
1.rest .99353
2 .77 1 .74 .61
.66 .59 .29 .16 .62 .07
2.rest .99906
3 .67 .74 1 .52
.72 .45 .34 .14 .22 .23
3.rest .97033
4 .6 .61 .52 1
.44 .76 .67 .29 .13 -.13 4.rest
.992
5 .69 .66 .72 .44 1
.51 .4 .47 .23 .01 5.rest
.98908
6 .57 .59 .45 .76 .51
1 .45 .25 .03 .11
6.rest .9774
8 .5 .29 .34 .67
.4 .45 1 .16 .08 -.05
8.rest .99137
10 .48 .16 .14 .29 .47
.25 .16 1 .26 .19
10.rest .99288
11 .38 .62 .22 .13 .23
.03 .08 .26 1 .29
11.rest .99609
13 .16 .07 .23 -.13 .01 .11
-.05 .19 .29 1 13.rest
.93259
Result: The elimination of 7,9,12 brings about a just
again positive definite matrix, however including a high collinearity.
As can be seen already 6 out of 10 correlation coefficients are >
.99. It is clear, that this matrix has to be ill-conditioned, as also the
matrix analysis shows:
Table 6: Matrix analysis criteria of the reduced matrix
Or_ MD_NumS_Condit_
Determ_
HaInRatio_
R_OutIn_
K_Norm_
C_Norm
10 -1 - 4645 .000005694
1.28 D-18 330.4 1D-3(1) .043(1)
We now try to "cure the indefinite matrix using the centroid method by THURSTONE, i.e. eliminating the negative eigenvalues, and to check afterwards which collinearities remain.
5.
"Centroid-Therapie" according to THURSTONE
The method of the main components is not applicable with
negative eigenva- lues, as in the reel no square roots from negative values
can be obtained: Thus the centroid factor analysis by THURSTONE is carried
through using the complete number of variables, in this case 13, and from
the 13 factors the correlation matrix is calculated back with the main
diagonal elements set 1. Naturally this will only be possible, if the residuals
and naturally also the main diagonal elements which are essential for a
correlation matrix have small values. This we want to see first:
Table 7: 13. RESIDUAL MATRIX
-.0016 .0114-.0200-.0169 .0098-.0123 .0093 .0171 .0092-.0069-.0159
.0047 .0133
.0114-.0020 .0341-.0002-.0168 .0206-.0246-.0054 .0014-.0275
.0087 .0055-.0326
-.0200 .0341-.0035 .0087-.0045-.0437-.0084-.0210 .0138-.0304 .0059-.0192
.0395
-.0169-.0002 .0087-.0174-.0443 .0384 .0032 .0533-.0030 .0549-.0327-.0052-.0719
.0098-.0168-.0045-.0443-.0154 .0201 .0242 .0347-.0133 .0233
.0198 .0309-.0278
-.0123 .0206-.0437 .0384 .0201-.0038-.0418-.0134 .0383-.0100-.0363-.0183-.0070
.0093-.0246-.0084 .0032 .0242-.0418-.0042-.0264-.0029-.0263
.0209-.0310 .0418
.0171-.0054-.0210 .0533 .0347-.0134-.0264-.0163-.0007-.0018
.0047-.0280 .0180
.0092 .0014 .0138-.0030-.0133 .0383-.0029-.0007-.0021-.0079-.0013
.0099-.0143
-.0069-.0275-.0304 .0549 .0233-.0100-.0263-.0018-.0079-.0103 .0254-.0347
.0155
-.0159 .0087 .0059-.0327 .0198-.0363 .0209 .0047-.0013 .0254-.0083
.0091 .0347
.0047 .0055-.0192-.0052 .0309-.0183-.0310-.0280 .0099-.0347
.0091-.0076 .0242
.0133-.0326 .0395-.0719-.0278-.0070 .0418 .0180-.0143 .0155
.0347 .0242-.0132
Now we perform a standard matrix analysis and discover to our complete surprise, that the "centroid therapy" is successful despite the high negative eigenvalue with ~-.07 of the original matrix and the matrix at least regained its positive definiteness, although it is still ill-conditioned, but no longer as much as before. The decisive advantage now, however, is to gain a clear picture of the original collinearity structure.
Table 8: Centroid-cured matrix
Original input data with 2-digit-accuracy and read with 2-digit-accuracy
(for control here the analyzed original matrix):
1 .76 .69 .62 .68
.58 .56 .48 .51 .49 .4 .2
.15
.76 1 .71 .61 .68
.57 .55 .3 .52 .19 .61 .3
.1
.69 .71 1 .51 .72
.49 .62 .36 .51 .17 .21 .21 .19
.62 .61 .51 1 .48
.72 .47 .62 .4 .24 .16 .58 -.06
.68 .68 .72 .48 1
.49 .63 .37 .35 .45 .21 .16 .04
.58 .57 .49 .72 .49 1
.45 .46 .43 .26 .07 .28 .12
.56 .55 .62 .47 .63 .45
1 .48 .47 .11 .24 -.02 .18
.48 .3 .36 .62 .37 .46
.48 1 .34 .16 .08 .08 -.07
.51 .52 .51 .4 .35 .43
.47 .34 1 -.06 -.01 0 -.12
.49 .19 .17 .24 .45 .26
.11 .16 -.06 1 .23 .09 .17
.4 .61 .21 .16 .21 .07
.24 .08 -.01 .23 1 .15 .26
.2 .3 .21 .58 .16
.28 -.02 .08 0 .09 .15 1
.03
.15 .1 .19 -.06 .04 .12
.18 -.07 -.12 .17 .26 .03 1
Table 9: Centroid cured matrix analysis criteria
Or_ MD_NumS_Condit_
Determ_
HaInRatio_
R_OutIn_
K_Norm_
C_Norm
13 -1 - 148.8
.00007525 0.0000007 52
.026(0) .266(0)
i.Eigenvalue Cholesky i.Eigenvalue
Cholesky i.Eigenvalue Cholesky
1. 5.64638 1
2. 1.55102 .6499
3. 1.3104 .6651
4. 1.04562 .7543
5. .87813 .634
6. .76073 .6654
7. .58635 .7248
8. .41204 .7076
9. .31787 .7784
10. .18913 .6723
11. .15321 .4617
12. .11117 .6337
13. .03795 .8032
Cholesky decomposition successful, thus the matrix is (semi) positive
definite.
Table 10: Multiple correlations of centroid
cured matrix
r1.rest= .89232 r5.rest= .91033
r9.rest = .77994
r2.rest= .96400 r6.rest= .81196
r10.rest= .80758
r3.rest= .84951 r7.rest= .82456
r11.rest= .89483
r4.rest= .93215 r8.rest= .79282
r12.rest= .78257
r13.rest= .59575
6. Results of the Centroid-Therapy
The analysis of the multiple correlation coefficients of the "centroid cured" matrix indicates very clearly, that the strong collinearity structures of the original indefinite matrix have disappeared. This means without doubt, that the collinearity- structure- hypotheses put up above cannot be confirmed. This underlines the importance of therapy methods and how much indefinite cor- relations may simulate relations which don't exist at all. The hypotheses about the collinearity structure of the matrix cannot be confirmed. The importance of the centroid factor analysis by THURSTONE as a method of therapy for indefinite correlation matrices could be shown here: Possibly it is not (always) as good as the method by KNOL & BERGE, but very simple and quick.