Is Psychometric g a Myth?
Laskar Galileo Indonesia
As
an online discussion about IQ or general intelligence grows longer, the
probability of someone linking to statistician Cosma Shalizi’s essay g,
a Statistical Myth
approaches 1. Usually the link is accompanied by an assertion to the effect
that Shalizi offers a definitive refutation of the concept of general mental
ability, or psychometric g.
In
this post, I will show that Shalizi’s case against g appears strong only
because he misstates several key facts and because he omits all the best
evidence that the other side has offered in support of g. His case
hinges on three clearly erroneous arguments on which I will concentrate.
Contents
I. Positive manifold
II. Shalizi’s first error
III. Shalizi’s second error
IV. Shalizi’s third error
V. Conclusions
References
I. Positive manifold
II. Shalizi’s first error
III. Shalizi’s second error
IV. Shalizi’s third error
V. Conclusions
References
I. Positive manifold
Shalizi
writes that when all tests in a test battery are positively correlated with
each other, factor analysis will necessarily yield a general factor. He is
correct about this. All subtests of any given IQ battery are positively
correlated, and subjecting an IQ correlation matrix to factor analysis will
produce a first factor on which all subtests are positively loaded. For
example, the 29 subtests of the revised 1989 edition of the Woodcock-Johnson IQ
test are correlated in the following manner (click for larger image):
All
the subtest intercorrelations are positive, ranging from a low of 0.046 (Memory
for Words – Visual Closure) to a high of 0.728 (Quantitative Concepts – Applied
Problems). (See Woodcock 1990 for a description of the tests.)
This is the reason why we talk about general intelligence or general cognitive
ability: individuals who get a high score on one cognitive test tend to do so
on all kinds of tests regardless of test content or type (e.g., verbal, numerical,
spatial, or memory tests), while those who do bad on one type of cognitive test
usually do bad on all tests.
This
phenomenon of positive correlations among all tests, often called the “positive
manifold”, is routinely found among all collections of cognitive ability tests,
and it is one of the most replicated findings in the social and behavioral
sciences. The correlation between a given pair of ability tests is a function
of the shared common factor variance (g and other factors) and imperfect
test reliabilities (the higher the reliabilities, the higher the correlation).
All cognitive tests load on g to a smaller or greater degree, so all
tests covary at least through the g factor, if not other factors.
John
B. Carroll factor-analyzed the WJ-R matrix presented above, using confirmatory
analysis to successfully fit a ten-factor model (g and nine narrower
factors) to the data (Carroll 2003):
Loadings
on the g factor range from a low of 0.279 (Visual Closure) to a high of
0.783 (Applied Problems). The g factor accounts for 59 percent of the
common factor variance, while the other nine factors together account for 41
percent. This is a routine finding in factor analyses of IQ tests: the g
factor explains more variance than the other factors put together. (Note that
in addition to the common factor variance, there is always some variance
specific to each subtest as well as variance due to random measurement error.)
Against
the backdrop of results like the above, Shalizi makes the following claims:
The
correlations among the components in an intelligence test, and between tests
themselves, are all positive, because that’s how we design tests. […] So
making up tests so that they’re positively correlated and discovering they have
a dominant factor is just like putting together a list of big square numbers
and discovering that none of them is prime — it’s necessary side-effect of the
construction, nothing more.
[…]
What
psychologists sometimes call the “positive manifold” condition is enough, in
and of itself, to guarantee that there will appear to be a general factor.
Since intelligence tests are made to correlate with each other, it follows trivially
that there must appear to be a general factor of intelligence. This is true
whether or not there really is a single variable which explains test scores or
not.
[…]
By
this point, I’d guess it’s impossible for something to become accepted as an
“intelligence test” if it doesn’t correlate well with the Weschler [sic] and
its kin, no matter how much intelligence, in the ordinary sense, it requires,
but, as we saw with the first simulated factor analysis example, that makes it
inevitable that the leading factor fits well.
Shalizi’s
thesis is that the positive manifold is an artifact of test construction and
that full-scale scores from different IQ batteries correlate only because they
are designed to do that. It follows from this argument that if a test maker
decided to disregard the g factor and construct a battery for assessing
several independent abilities, the result would be a test with many zero or
negative correlations among its subtests. Moreover, such a test would not
correlate highly with traditional tests, at least not positively. Shalizi
alleges that there are tests that measure intelligence “in the ordinary sense”
yet are uncorrelated with traditional tests, but unfortunately he does not
gives any examples.
Inadvertent positive manifolds
There
are in fact many cognitive test batteries designed without regard to g,
so we can put Shalizi’s allegations to test. The Woodcock-Johnson test
discussed above is a case in point. Carroll, when reanalyzing data from the
test’s standardization sample, pointed out that its technical manual “reveals a
studious neglect of the role of any kind of general factor in the WJ-R.” This
dismissive stance towards g is also reflected in Richard Woodcock’s
article about the test’s theoretical background (Woodcock 1990). (Yes, the Woodcock-Johnson test
was developed by a guy named Dick Woodcock, together with his assistant
Johnson. You can’t make this up.) The WJ-R was developed based on the idea that
the g factor is a statistical artifact with no psychological relevance.
Nevertheless, all of its subtests are intercorrelated and, when factor
analyzed, it reveals a general factor that is no less prominent than those of
more traditional IQ tests. According to the WJ-R technical manual, test results
are to be interpreted at the level of nine broad abilities (such as Visual
Processing and Quantitative Ability), not any general ability. Similarly, the
manual reports factor analyses based only on the nine factors. But when Carroll
reanalyzed the data, allowing for loadings on a higher-order g factor in
addition to the nine factors, it turned out that most of the tests in the WJ-R
have their highest loadings on the g factor, not on the less general
(“broad”) factors which they were specifically designed to measure.
While
the WJ-R is not meant to be a test of g, it does provide a measure of
“broad cognitive ability”, which correlates at 0.65 and 0.64 with the
Stanford-Binet and Wechsler full-scale scores, respectively (Kamphaus 2005, p. 335). Typically, correlations
between full-scale scores from different IQ tests are around 0.8. The WJ-R
broad cognitive ability scores are probably less g-loaded than those of
other tests, because they are based on unweighted sums of scores on subtests
selected solely on the basis of their content diversity; hence the lower
correlations, I believe. In any case, the WJ-R is certainly not uncorrelated
with traditional tests. (The WJ-III, which is the newest edition of the test,
now recognizes the g factor.)
The
WJ-R serves as a forthright refutation of Shalizi’s claim that the positive
manifold and inter-battery correlations emerge by design rather than because
all cognitive abilities naturally intercorrelate. But perhaps the WJ-R is just
a giant fluke, or perhaps its 29 tests correlate as a carryover from the
previous edition of the test which had several of the same tests but was not
based on anti-g ideas. Are there other examples of psychometricians
accidentally creating strongly g-loaded tests against their best
intentions? In fact, there is a long history of such inadvertent confirmations
of the ubiquity of the g factor. This goes back at least to the 1930s
and Louis Thurstone’s research on “primary mental abilities”.
Thurstone and Guilford
In
a famous study published in 1938, Thurstone, one of the great psychometricians,
claimed to have developed a test of seven independent mental abilities (verbal
comprehension, word fluency, number facility, spatial visualization,
associative memory, perceptual speed, and reasoning; see Thurstone 1938). However, the g men
quickly responded, with Charles Spearman and Hans Eysenck publishing papers (Spearman 1939, Eysenck 1939) showing that Thurstone’s
independent abilities were not independent, indicating that his data were
compatible with Spearman’s g model. (Later in his career, Thurstone came
to accept that perhaps intelligence could best be conceptualized as a hierarchy
topped by g.)
The
idea of non-correlated abilities was taken to its extreme by J.P. Guilford who
postulated that there are as many as 160 different cognitive abilities. This
made him very popular among educationalists because his theory suggested that
everybody could be intelligent in some way. Guilford’s belief in a highly
multidimensional intelligence was influenced by his large-scale studies of
Southern California university students whose abilities were indeed not always
correlated. In 1964, he reported (Guilford 1964) that his research showed that up
to a fourth of correlations between diverse intelligence tests were not
different from zero. However, this conclusion was based on bad psychometrics. Alliger 1988 reanalyzed Guilford’s data and
showed that when you correct for artifacts such as range restriction (the
subjects were generally university students), the reported correlations are
uniformly positive.
British Ability Scales
Psychometricians
have not been discouraged by past failures to discover abilities that are
independent of the general factor. They keep constructing tests that supposedly
take the measurement of intelligence beyond g.
For
example, the British Ability Scales was carefully developed in the 1970s and
1980s to measure a wide variety of cognitive abilities, but when the published
battery was analyzed (Elliott 1986), the results were quite
disappointing:
Considering
the relatively large size of the test battery […] the solutions have
yielded perhaps a surprisingly small number of common factors. As would be
expected from any cognitive test battery, there is a substantial general
factor. After that, there does not seem to be much common variance left […]
What,
then, are we to make of the results of these analyses? Do they mean that we are
back to square one, as it were, and that after 60 years of research we have
turned full circle and are back with the theories of Spearman? Certainly, for
this sample and range of cognitive measures, there is little evidence that
strong primary factors, such as those postulated by many test theorists over
the years, have accounted for any substantial proportion of the common variance
of the British Ability Scales. This is despite the fact that the scales sample
a wide range of psychological functions, and deliberately include tests with
purely verbal and purely visual tasks, tests of fluid and crystallized mental
abilities, tests of scholastic attainment, tests of complex mental functioning
such as in the reasoning scales and tests of lower order abilities as in the
Recall of Digits scale.
CAS
An
even better example is the CAS battery. It is based on the PASS theory (which draws heavily on the ideas
of Soviet psychologist A.R. Luria, a favorite of Shalizi’s), which disavows g
and asserts that intelligence consists of four processes called Planning,
Attention-Arousal, Simultaneous, and Successive. The CAS was designed to assess
these four processes.
However,
Keith el al. 2001 did a joint confirmatory factor
analysis of the CAS together with the WJ-III battery, concluding that not only
does the CAS not measure the constructs it was designed to measure, but that
notwithstanding the test makers’ aversion to g, the g factor
derived from the CAS is large and statistically indistinguishable from the g
factor of the WJ-III. The CAS therefore appears to be the opposite of what it
was supposed to be: an excellent test of the “non-existent” g and a poor
test of the supposedly real non-g abilities it was painstakingly
designed to measure.
Triarchic intelligence
A
particularly amusing confirmation of the positive manifold resulted from Robert
Sternberg’s attempts at developing measures of non-g abilities.
Sternberg introduced his “triarchic” theory of intelligence in the 1980s and
has tirelessly promoted it ever since while at every turn denigrating the
proponents of g as troglodytes. He claims that g represents a
rather narrow domain of analytic or academic intelligence which is more or less
uncorrelated with the often much more important creative and practical forms of
intelligence. He created a test battery to test these different intellectual
domains. It turned out that the three “independent” abilities were highly
intercorrelated, which Sternberg absurdly put down to common-method
variance.
A
reanalysis of Sternberg’s data by Nathan Brody (Brody 2003a) showed that not only were the
three abilities highly correlated with each other and with Raven’s IQ test, but
also that the abilities did not exhibit the postulated differential validities
(e.g., measures of creative ability and analytic ability were equally good
predictors of measures of creativity, and analytic ability was a better
predictor of practical outcomes than practical ability), and in general the
test had little predictive validity independently of g. (Sternberg, true
to his style, refused to admit that these results had any implications for the
validity of his triarchic theory, prompting the exasperated Brody to publish an
acerbic reply called “What Sternberg should have concluded” [Brody 2003b].)