Senin, 15 Desember 2014

Is Psychometric g a Myth?


Is Psychometric g a Myth?
by Agrend Wisnu Kusuma (wisnuxpert@yahoo.com)
Laskar Galileo Indonesia

As an online discussion about IQ or general intelligence grows longer, the probability of someone linking to statistician Cosma Shalizi’s essay g, a Statistical Myth approaches 1. Usually the link is accompanied by an assertion to the effect that Shalizi offers a definitive refutation of the concept of general mental ability, or psychometric g.
In this post, I will show that Shalizi’s case against g appears strong only because he misstates several key facts and because he omits all the best evidence that the other side has offered in support of g. His case hinges on three clearly erroneous arguments on which I will concentrate.
Contents
Shalizi writes that when all tests in a test battery are positively correlated with each other, factor analysis will necessarily yield a general factor. He is correct about this. All subtests of any given IQ battery are positively correlated, and subjecting an IQ correlation matrix to factor analysis will produce a first factor on which all subtests are positively loaded. For example, the 29 subtests of the revised 1989 edition of the Woodcock-Johnson IQ test are correlated in the following manner (click for larger image):
All the subtest intercorrelations are positive, ranging from a low of 0.046 (Memory for Words – Visual Closure) to a high of 0.728 (Quantitative Concepts – Applied Problems). (See Woodcock 1990 for a description of the tests.) This is the reason why we talk about general intelligence or general cognitive ability: individuals who get a high score on one cognitive test tend to do so on all kinds of tests regardless of test content or type (e.g., verbal, numerical, spatial, or memory tests), while those who do bad on one type of cognitive test usually do bad on all tests.
This phenomenon of positive correlations among all tests, often called the “positive manifold”, is routinely found among all collections of cognitive ability tests, and it is one of the most replicated findings in the social and behavioral sciences. The correlation between a given pair of ability tests is a function of the shared common factor variance (g and other factors) and imperfect test reliabilities (the higher the reliabilities, the higher the correlation). All cognitive tests load on g to a smaller or greater degree, so all tests covary at least through the g factor, if not other factors.
John B. Carroll factor-analyzed the WJ-R matrix presented above, using confirmatory analysis to successfully fit a ten-factor model (g and nine narrower factors) to the data (Carroll 2003):
Loadings on the g factor range from a low of 0.279 (Visual Closure) to a high of 0.783 (Applied Problems). The g factor accounts for 59 percent of the common factor variance, while the other nine factors together account for 41 percent. This is a routine finding in factor analyses of IQ tests: the g factor explains more variance than the other factors put together. (Note that in addition to the common factor variance, there is always some variance specific to each subtest as well as variance due to random measurement error.)
 II. Shalizi’s first error
Against the backdrop of results like the above, Shalizi makes the following claims:
The correlations among the components in an intelligence test, and between tests themselves, are all positive, because that’s how we design tests. […] So making up tests so that they’re positively correlated and discovering they have a dominant factor is just like putting together a list of big square numbers and discovering that none of them is prime — it’s necessary side-effect of the construction, nothing more.
[…]
What psychologists sometimes call the “positive manifold” condition is enough, in and of itself, to guarantee that there will appear to be a general factor. Since intelligence tests are made to correlate with each other, it follows trivially that there must appear to be a general factor of intelligence. This is true whether or not there really is a single variable which explains test scores or not.
[…]
By this point, I’d guess it’s impossible for something to become accepted as an “intelligence test” if it doesn’t correlate well with the Weschler [sic] and its kin, no matter how much intelligence, in the ordinary sense, it requires, but, as we saw with the first simulated factor analysis example, that makes it inevitable that the leading factor fits well.
Shalizi’s thesis is that the positive manifold is an artifact of test construction and that full-scale scores from different IQ batteries correlate only because they are designed to do that. It follows from this argument that if a test maker decided to disregard the g factor and construct a battery for assessing several independent abilities, the result would be a test with many zero or negative correlations among its subtests. Moreover, such a test would not correlate highly with traditional tests, at least not positively. Shalizi alleges that there are tests that measure intelligence “in the ordinary sense” yet are uncorrelated with traditional tests, but unfortunately he does not gives any examples.
Inadvertent positive manifolds
There are in fact many cognitive test batteries designed without regard to g, so we can put Shalizi’s allegations to test. The Woodcock-Johnson test discussed above is a case in point. Carroll, when reanalyzing data from the test’s standardization sample, pointed out that its technical manual “reveals a studious neglect of the role of any kind of general factor in the WJ-R.” This dismissive stance towards g is also reflected in Richard Woodcock’s article about the test’s theoretical background (Woodcock 1990). (Yes, the Woodcock-Johnson test was developed by a guy named Dick Woodcock, together with his assistant Johnson. You can’t make this up.) The WJ-R was developed based on the idea that the g factor is a statistical artifact with no psychological relevance. Nevertheless, all of its subtests are intercorrelated and, when factor analyzed, it reveals a general factor that is no less prominent than those of more traditional IQ tests. According to the WJ-R technical manual, test results are to be interpreted at the level of nine broad abilities (such as Visual Processing and Quantitative Ability), not any general ability. Similarly, the manual reports factor analyses based only on the nine factors. But when Carroll reanalyzed the data, allowing for loadings on a higher-order g factor in addition to the nine factors, it turned out that most of the tests in the WJ-R have their highest loadings on the g factor, not on the less general (“broad”) factors which they were specifically designed to measure.
While the WJ-R is not meant to be a test of g, it does provide a measure of “broad cognitive ability”, which correlates at 0.65 and 0.64 with the Stanford-Binet and Wechsler full-scale scores, respectively (Kamphaus 2005, p. 335). Typically, correlations between full-scale scores from different IQ tests are around 0.8. The WJ-R broad cognitive ability scores are probably less g-loaded than those of other tests, because they are based on unweighted sums of scores on subtests selected solely on the basis of their content diversity; hence the lower correlations, I believe. In any case, the WJ-R is certainly not uncorrelated with traditional tests. (The WJ-III, which is the newest edition of the test, now recognizes the g factor.)
The WJ-R serves as a forthright refutation of Shalizi’s claim that the positive manifold and inter-battery correlations emerge by design rather than because all cognitive abilities naturally intercorrelate. But perhaps the WJ-R is just a giant fluke, or perhaps its 29 tests correlate as a carryover from the previous edition of the test which had several of the same tests but was not based on anti-g ideas. Are there other examples of psychometricians accidentally creating strongly g-loaded tests against their best intentions? In fact, there is a long history of such inadvertent confirmations of the ubiquity of the g factor. This goes back at least to the 1930s and Louis Thurstone’s research on “primary mental abilities”.
Thurstone and Guilford
In a famous study published in 1938, Thurstone, one of the great psychometricians, claimed to have developed a test of seven independent mental abilities (verbal comprehension, word fluency, number facility, spatial visualization, associative memory, perceptual speed, and reasoning; see Thurstone 1938). However, the g men quickly responded, with Charles Spearman and Hans Eysenck publishing papers (Spearman 1939, Eysenck 1939) showing that Thurstone’s independent abilities were not independent, indicating that his data were compatible with Spearman’s g model. (Later in his career, Thurstone came to accept that perhaps intelligence could best be conceptualized as a hierarchy topped by g.)
The idea of non-correlated abilities was taken to its extreme by J.P. Guilford who postulated that there are as many as 160 different cognitive abilities. This made him very popular among educationalists because his theory suggested that everybody could be intelligent in some way. Guilford’s belief in a highly multidimensional intelligence was influenced by his large-scale studies of Southern California university students whose abilities were indeed not always correlated. In 1964, he reported (Guilford 1964) that his research showed that up to a fourth of correlations between diverse intelligence tests were not different from zero. However, this conclusion was based on bad psychometrics. Alliger 1988 reanalyzed Guilford’s data and showed that when you correct for artifacts such as range restriction (the subjects were generally university students), the reported correlations are uniformly positive.
British Ability Scales
Psychometricians have not been discouraged by past failures to discover abilities that are independent of the general factor. They keep constructing tests that supposedly take the measurement of intelligence beyond g.
For example, the British Ability Scales was carefully developed in the 1970s and 1980s to measure a wide variety of cognitive abilities, but when the published battery was analyzed (Elliott 1986), the results were quite disappointing:
Considering the relatively large size of the test battery […] the solutions have yielded perhaps a surprisingly small number of common factors. As would be expected from any cognitive test battery, there is a substantial general factor. After that, there does not seem to be much common variance left […]
What, then, are we to make of the results of these analyses? Do they mean that we are back to square one, as it were, and that after 60 years of research we have turned full circle and are back with the theories of Spearman? Certainly, for this sample and range of cognitive measures, there is little evidence that strong primary factors, such as those postulated by many test theorists over the years, have accounted for any substantial proportion of the common variance of the British Ability Scales. This is despite the fact that the scales sample a wide range of psychological functions, and deliberately include tests with purely verbal and purely visual tasks, tests of fluid and crystallized mental abilities, tests of scholastic attainment, tests of complex mental functioning such as in the reasoning scales and tests of lower order abilities as in the Recall of Digits scale.
CAS
An even better example is the CAS battery. It is based on the PASS theory (which draws heavily on the ideas of Soviet psychologist A.R. Luria, a favorite of Shalizi’s), which disavows g and asserts that intelligence consists of four processes called Planning, Attention-Arousal, Simultaneous, and Successive. The CAS was designed to assess these four processes.
However, Keith el al. 2001 did a joint confirmatory factor analysis of the CAS together with the WJ-III battery, concluding that not only does the CAS not measure the constructs it was designed to measure, but that notwithstanding the test makers’ aversion to g, the g factor derived from the CAS is large and statistically indistinguishable from the g factor of the WJ-III. The CAS therefore appears to be the opposite of what it was supposed to be: an excellent test of the “non-existent” g and a poor test of the supposedly real non-g abilities it was painstakingly designed to measure.
Triarchic intelligence
A particularly amusing confirmation of the positive manifold resulted from Robert Sternberg’s attempts at developing measures of non-g abilities. Sternberg introduced his “triarchic” theory of intelligence in the 1980s and has tirelessly promoted it ever since while at every turn denigrating the proponents of g as troglodytes. He claims that g represents a rather narrow domain of analytic or academic intelligence which is more or less uncorrelated with the often much more important creative and practical forms of intelligence. He created a test battery to test these different intellectual domains. It turned out that the three “independent” abilities were highly intercorrelated, which Sternberg absurdly put down to common-method variance.
A reanalysis of Sternberg’s data by Nathan Brody (Brody 2003a) showed that not only were the three abilities highly correlated with each other and with Raven’s IQ test, but also that the abilities did not exhibit the postulated differential validities (e.g., measures of creative ability and analytic ability were equally good predictors of measures of creativity, and analytic ability was a better predictor of practical outcomes than practical ability), and in general the test had little predictive validity independently of g. (Sternberg, true to his style, refused to admit that these results had any implications for the validity of his triarchic theory, prompting the exasperated Brody to publish an acerbic reply called “What Sternberg should have concluded” [Brody 2003b].)

Tidak ada komentar:

Posting Komentar