Rudolf Sponsel translated by Dipl. Psych. Agnes Mehl
Survey: __ 0. Summary __ 1. Ill-conditioned correlation matrices __ 2. The two natures of collinearity: constructive and destuctive implications __ 3. Analysis of 38 Spearman matrices __ 4. Literature __
0. Summary
We shortly introduce the problem of ill-conditioned correlation matrices
and collinearity as well as its destructive and constructive aspects. Some
important criteria of the analysis of numeric stability respectively instability
are explained. Afterwards we report on the correlation matrices by Spearman
(and co-authors Hart and Holzinger): Altogether 38 correlation matrices
were included, 30 of them being genuinely different ones (order 5-14, thus
rather small matrices). 8 of the 30 (27%) contain misprints in the upper
and lower triangle matrix. Of the 30 genuinely different matrices 7 (23%)
are indefinite with 1-3 negative eigenvalues significantly above 0. 3 (10%)
are clearly ill-conditioned, 3 (10%) are borderlines, 17 (57%) are numerically
stable. In at least 4 (13%) out of the 30 cases "corrections for attenuation"
were recorded, in at least 3 cases (13%) pooling of the correlation coefficients,
which partly is not indicated. The error rate doesn't go with the otherwise
precise and systematical style of SPEARMAN. Compared to the rate of indefinite
matrices of our main study (Sponsel 1994) being 17.9%, the one of Spearman
being 23% is clearly higher.
1. Ill-Conditioned
Matrices
Systems will be called ill-conditioned, if small changes of the input
data lead to large changes of the output data. Linear equation systems
and correlation matrices quasi according to their nature are often ill-conditioned,
i.e. they mostly are very instable. Practically this means, that the coefficients
can no longer be trusted - even before any significance testing. In a historic
analysis of 769 correlation matrices from 1910 to 1993 Sponsel (1994) found
47.5% numerically instable correlation matrices. 17.9% of them were indefinite,
thus having lost their positive definiteness and producing mathematically
absurd results like e.g. multiple and partial correlation coefficients
larger than one.
Example: Multiple correlation coefficient of the matrix
by Spearman and Hart 1913
r1.rest = .9495 | r5.rest = 6.9606 | r9.rest = .7154 | r13.rest = .9175 |
r2.rest = .9922 | r6.rest = .9665 | r10.rest = .757 | |
r3.rest = .9669 | r7.rest = imaginary | r11.rest = .9755 | |
r4.rest = 1.3566 | r8.rest = .7307 | r12.rest = imaginary |
For numeric stability there are a number of criteria. Among the most effective criteria are: the smallest eigen-value, the different condition measures (largest absolute eigenvalue : smallest absolute eigenvalue and the HADAMARD condition, best applied to the inverse (FADDEJEW & FADDEJEWA 1973) and the reduced norms according to the Pivotized Erhard Schmidt Orthonormaization (PESO analysis by HAIN 1994)).
2. The two natures of collinearity: constructive and destructive implications
Correlation matrices reach their maximum numeric instability with the matrix being singular. Then the determinant is 0 and at least one eigenvalue also equals zero. Such a matrix contains at least one collinearity. Looking at it differently: the matrix contains - mathematical - redundant information or a functional relation. At least one variable is redundant. According to the theory of science collinearity implicates a law. This reflexion unfortunately did not get into the centre of research interest because of the predominance of factor analysis. In the case of at least one eigenvalue being close to 0, one gets at least one almost-collinearity. In the reality of empirics and numerics hardly any exactly singular or collinear correlation matrices occur. One always finds only approximate singularity or almost collinearity. Product-moment or Pearson correlation matrices have to be positive definite (HAIN 1994). However, as numeric calculating in reality mostly becomes finite after few roundings, it may occur, that collinearity in combination with rounding errors leads to the loss of positive definiteness of a correlation matrix. This is a simple case and easy to "cure". One recognizes the "simplicity" from the negative eigenvalues being very small (rule of thumb: 3rd digit after the decimal point). More difficult it should be, if major methodological mistakes were made, e.g. calculating tetrachoric instead of using Pearson (thus possibly violating the condition of the normal distribution as with THURSTONE "Primaries..."1938), coefficients being "corrected for attenuation", or being treated by another strange correction formula or the coefficients being based on different sample sizes as with meta analyses or wrong missing data solutions of elimination in pairs. One can recognize the problem case from the size of the negative eigenvalue (rule of thumb: eigenvalue >= 2. digit after the decimal point), whereby completely derailed multiple and partial correlation coefficients can occur. Such correlation matrices can no longer be interpreted.
3. Analysis and Report of 38 Spearman matrices
3.1. Explanation of the abbreviations of the evaluation
Samp_Or_MD_NumS_Condit_Determ_HaInRatio_R_OutIn_K_Norm_C_Norm
_
Samp =: sample size
Md =: missing data information: -1 =: unknown
NumS =: valuation (& possibly number
of negative eigenvalues). Here exist so far the following rule of
thumb valuations:
+ numerically stable
+? borderline, tendency to be rather numerically
stable
-? borderline, tendency to be rather numerically
instable
- numerically instable
--Z indefinite with Z negative eigenvalues
Also only one negative eigenvalue being given, the matrix is indefinite and derailments of any kind are possible. The matrix has turned "psychotic" so to speak: no value can be trusted anymore, all is possible. Such a state has to be avoided at all costs, or to be reversed respectively immediately "treated" before any further calculating can be done. |
Condit = Highest absolut eigenvalue / Smallest absolut eigenvalue with order < 10, Condit shows > 30; with order < 20, Condit shows > 50 and thereby indicates numeric instability.
Determ =: determinant. The determinant represents the absolute value of the volume of the n-dimensional parallelotope (multidimensional object). The smaller it is, the smaller is the volume of the space. A small volume hereby can be caused by a single small vector equaling a small angle. This is the critical case. A small determinant, however, can also result very "normally" from the "natural" calculating process without having to express numeric instability. A valuation only because of the absolute value of the determinant thus is not reasonable.
HaInRatio =: HADAMARD number inverse.
The HADAMARD number of the inverse indicates, which ratio the real determinant
of the inverse has got of its theoretically maximum value with the coefficient
matrix being given. According to the rule of thumb by FADDEJEW
& FADDEJEWA an inverse determinant will be considered small if
for its ratio 1 : 50 000 is valid, thus the HaInRatio being < .00002.
R_OutIn =: LES Input Output Ratio (SPONSEL 1994). The input output ratio indicates, by how much the output will be changed, if a change by one unit is made around the third digit after the decimal point. Theoretically the value ranges from 0 to ...
K_Norm =: smallest PESO-Norm correlation matrix (HAIN 1994). The smallest reduced norm ("shortest" norm, "flatest" angle) of the correlation matrix is a measure of the degree of collinearity. The smaller the value of the K_Norm, the stronger is the degree of collinearity. The product of all reduced norms results in the absolute value of the determinant. Thus a single small K_Norm is sufficient to bring the volume close to 0 (equivalent to the function of small eigenvalues). PESO for the correlation matrix is adjusted in a way, that for all K_Norms < 0.01 in brackets the number of relations (collinearities) is printed out. The root of the K_Norm gives an upper boundary of the highest correlation coefficient.
C_Norm =: smallest PESO-Norm CHOLESKY matrix (HAIN 1994). The C_Norm represents the smallest reduced norm of the CHOLESKY matrix. The importance of the CHOLESKY decomposition is based among other things on the isometry to the raw scores. The smallest C_Norm indicates the smallest angle given for the centered standardized raw scores. The square of the smallest C_Norm is larger or equal the smallest K_Norm. As empirical rule of thumb is valid: (C_Norm^2)/(2...5) ~ K_Norm. Furthermore is valid: r(multiple) = SQR (1-C_Norm^2). Thus from the C_Norm directly multiple correlation coefficients can be determined. The relation or collinearity can be expressed by the smallest CHOLESKY norm, thus also by the well-known and usual multiple correlation coefficient. A C_Norm < .31 can serve as a critical boundary for a collinearity starting to be more significant.
3.2 Report Analysis Spearman's Correlation Matrices
SPEARMAN, C. (GB: University College London), HART,
B. (G) "GENERAL ABILITY, ITS EXISTENCE AND NATURE"
The British Journal of Psychology, V, 1912-1913
Detailed Collinearity Analysis And Therapy Of
The Indefinite Correlation-Matrix By SPEARMAN & HART (1913).
(G1) p.54, Table I
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 13 -1 --1 733.3 -.0000167538
2.21 D-12 394.2 5D-3(1) -1(-1)
(G2) p.62, Table III "Coeffcients of Bonser, boys and girls
pooled together
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 5 -1 + 4.68
.40507469 .4136093 .8
.479(0) .816(0)
SPEARMAN, C. (GB: University College London), HOLZINGER, K. (USA:) "NOTE ON THE SAMPLING ERROR OF TETRAD DIFFERENCES", The British Journal of Psychology 16,1925/26, p.87 Table I (N=50) -> SPEARMAN, C. (A7)
(1) "'GENERAL INTELLIGENCE', OBJECTIVELY DETERMINED
AND MEASURED",
The American Journal of Psychology,
15, 1904, p.275.
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 6 -1 +? 29.5
.0174353774 .0689510 3.7 .086(0)
.431(0)
(2) "THE THEORY OF TWO FACTORS" The Psychological Review 21, 1914,
(2a) p.102 Table I 'The SIMPSON-THORNDIKE Correlations ('Raw')'
above diagonal r11.8 = .34
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 14 -1 --2 485
.000001053 2.1 D-9 326
.008(1) -1(-1)
(2b) p.102 Table I 'The SIMPSON-THORNDIKE Correlations ('Raw')'.
below diagonal r8.11 = .54
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 14 -1 --2 1149 .000000470
6.87 D-13 4479 3D-3(1) -1(-1)
Remark on (3a,3b):
The correlations r5 and r8 of the crossing out tests from table
I were pooled and combined as correlation r5 in table III: In contrast
to (A5) at least this unusual procedure was mentioned. Misprints with r13.7,
r7.13, r5.9, r9.5. See (SIM) Abilities according to (A4b)
(3a) p.112, Table III 'The SIMPSON-THORNDIKE Correlations After Pooling Together The Two Tests Of Cancellation' (above main diagonal r7.13=.27); r5(p.112)=(r5+r8)/2 (p.102)).
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 13 -1 --1 908.7 -.00000124
6.35D-11 1096.6 4D-3( 1) -1(-1)
(3b) p.112 Table III (below main diagonal r13.7=.29; r5(p.112)
= (r5+r8)/2 (p.102)).
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 13 -1 --1 963.6 -.000001251
3.43D-11 1324.9 3D-3( 1) -1(-1)
(S) "THE SUB-STRUCTURE OF THE MIND", The British Journal of Psychology, Vol.18, Part 3, 1928, N=40,
(S1) p.253: Table of correlations with n=40. Obtained by tossing
as described. The correlations are arranged in best 'hierarchical' order.
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
40 10 -1 + 31.6
.001893539 .0087655 1
.138(0) .563(0)
(S2) p.253: Table of inter-columnar correlations obtained
from the table preceding.
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
10 10 0 --1 532.1 -.0001934245
7.09 D-13 5454.1 9D-3(1) -1(-1)
"THE ABILITIES OF MAN - THEIR NATURE AND MEASUREMENT" AMS Press New York 1970 reprint 2.ed. 1932 (first 1926/27)
(A1) p.74
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 5 -1 + 14
.1907942394 .1426199 2.6
.157(0) .532(0)
(A2a) p.141 (Data from McDONNEL Biometrika 1901, N=3000) print error
r3.6,r6.3 SP141A7O.K07 r3.6=.353.
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
3000 7 -1 + 33.3
.0121928243 .0254251 .8
.08(0) .411(0)
(A2b) p.141 SP141A7U.K07 r6.3=.363.
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
3000 7 -1 + 33.4 .0121471758
.0254271 .9 .08(0)
.411(0)
(A3) p.143 SP143A8.K08 (Data from GATES, A. Journ. Educ.
Research, 1924,
p.341 (print error r8,6, r8.7).
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
115 8 -1 --1 81.1 -.00236261
.0003480 66.4 .04(0) -1(-1)
(A4a) p.144 r1.4=.579 (Data from DOLL, N=477).
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
477 6 -1 -? 40.2
.0123232675 .0199045 3.5 .077(0)
.419(0)
(A4b) p.144 r4.1=.580 (Data from DOLL, N=477)
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
477 6 -1 -? 40.3
.0123128172 .0198543 3.5 .077(0)
.419(0)
Remark refering to p. 145: There are different sources and different correlation matrices of the SIMPSON-THORNDIKE matrix:
a) SIMPSON-THORNDIKE original (not available
to me)
b) SPEARMAN "Theory Of Two Factors", Psych.Rev. 21, 1914,
p.102 Table I.
c) SPEARMAN "The Abilities Of Man", 1.ed. 1927 (p.145) according
to -> PAWLIK 1968, p.106.
d) SPEARMAN "The Abilities Of Man" 2.ed.1932, reprint of
this 1970.
The correlation matrices b), c) and d) are different.
r8.11 (in c) =.34 see also (1),(2a), (3b)
r8.11 (in b) =.54
r8.11 (in d) =.34
Thus the numeric criteria values of (c,d) and (b) differ
(essentially):
In (c,d; r8.11 =.34):
Ratio max. range out-/input = 326
Condition number HEVA/LEVA = 485
In (b, r8.11 =.54):
Ratio max. range out-/input = 4479
Condition number HEVA/LEVA = 1149
(A5) p.147 (Data from BROWN, W. Brit.J.Psych.1910 p.309)
Samp Or MD NumS Condit Determinant
HaInRatio R_OutIn K_Norm C_Norm
66 8 -1 +
7.7 .2088387535 .1645218
2.1 .348(0) .738(0)
Remark on (A5)
SPEARMAN's statements (1932) on a correlation matrix by William BROWN
don't correspond with his original statements. After longer reflections
and analysis I found, that SPEARMAN simply pooled some correlation coefficients
by BROWN, without explaining this in any way. The reconstruction furthermore
was made more difficult by the arrangement. However, it is amazing, that
the matrix is numerically stable although being pooled .
(A6) p.147 (Data (N=757) from BONSER Brit.J. Psych. 1912,p.62.
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
757 5 -1 + 4.68
.40507469 .4136093 .8
.479(0) .816(0)
(A7) p.148 (Data (N=50) from HOLZINGER)
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
50 9 -1 + 17.56
.04321785 .0304050 .9
.239(0) .667(0)
(A8a) p.149 (Data, N=149, from MAGSON Brit. J. Psych. Mon. Suppl.9,
1926), r1.7=.45, r2.5=.50.
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
149 7 -1 + 8.89
.10356768 .1976976 .6
.305(0) .719(0)
(A8b) p.149 r7.1=.48, r5.2=.28.
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
149 7 -1 + 9.23
.10814521 .1930725 .7
.29(0) .704(0)
(A9) p.152 (Data from BALDWIN)
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 6 -1 - 136.9
.0001661343 .0023895 5.2
.025(0) .262(0)
(A10a) p.153 N=2599, r4.5=.331
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
2599 7 -1 + 9.3
.1108558579 .2207524 .5
.298(0) .712(0)
(A10b) p.153 N=2599, r5.4=.337
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
2599 7 -1 + 9.3
.1106713125 .2213055 .5
.298(0) .712(0)
(A11) p.156
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 8 -1 -? 50.9
.00030370 .0211751 1.7
.066(0) .409(0)
(A12) p.171
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 5 -1 - 157.8
.004854174 .0004100 175.1 .017(0)
.196(0)
(A13) p.218 "78 Normal Children (Corrected For Attenuation)"
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
78 12 -1 - 96.9
.00008493 .0000471 21.1
.037(0) .322(0)
(A14) p.218 "22 Defective Children (Corrected For Attenuation)"
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
22 12 -1 --3 815.5 -.0000000036
2.85 D-9 426.7 5D-3(1) -1(-1)
(A15) Data, N=200, from COLLAR, Brit. J. Psych.
(A15a) r4.5=.517
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
200 6 -1 + 21.2
.0366516723 .0343203 1.7 .122(0)
.493(0)
(A15b) r5.4=.255
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
200 6 -1 +? 26.8
.0348545063 .0272194 5.8 .093(0)
.436(0)
(A16) p.296, 4.K04, N=77
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
77 4 -1 + 5.9
.322311345 .3110141 .6
.38(0) .753(0)
(A17) p.301 N=47
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
47 9 -1 + 11.3
.1337945815 .0357932 4.4 .295(0)
.695(0)
(A18) p.314
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 4 -1 + 20.8
.1618624096 .0513551 5.8 .105(0)
.443(0)
(A19a) p.315 r3.6=-.10 r5.7= -.02
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 7 -1 + 6.9
.3680140648 .0591439 28.5 .382(0) .755(0)
(A19b) p.315 r6.3= .10 r7.5= -.32
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 7 -1 + 5.5
.3679044677 .1085426 21.4 .45(0)
.795(0)
(A20) p.325
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
80 8 -1 +
7.6 .2312178632 .1175382 2.1
.383(0) .762(0)
(A21) p.346 (corrected for attenuation)
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
140 6 -1 --2 131.4 .002747316
.0003696 73 .024(0) -1(-1)
(A22) p.347 "The following is the table for the students, after
correcting for attenuation and eliminating the influence on g (by Yule's
formula see p.156)":
Samp_Or_MD_NumS_
Condit_
Determ_
HaInRatio_ R_OutIn_
K_Norm_
C_Norm
-1 8 -1 --2 232.7 .00204695
.0000197 115.2 .016(0) -1(-1)