Chapter 1
General Introduction
Test theory is an abbreviated expression for theory of psychological tests and measurements, which can in turn be abbreviated back to psychometric theory (psychological measurement). Test theory provides a general collection of techniques for evaluating the development and use, in assessment, of specific psychological tests. It has the same relation to the practice of testing as statistical and experimental design principles have to actual programs of experimental research in behavioral science. There are courses and corresponding textbooks available on assessmentâthe evaluation of characteristics of individuals through the use of tests, interviews, and so on.1 There are also more specialized courses and textbooksâon tests used to measure educational abilities and achievements, tests used to assess personality characteristics, and tests developed for the measurement of social attitudes. Test theory is both more general and much narrower in scope, just as the topic of experimental design and statistical analysis is more general and narrower than any survey of actual experimental studies in an area of psychology.
The objectives of this chapter are, first, to indicate the kinds of problem that motivate study of the topic known as test theory; second, to give a general orientation to it, including an outline of key historical developments; and, third, to give, as far as possible, a preview of the chapters to follow, with suggestions about the approaches that can be taken to them.
As a way to get started, let us consider three practical problems of measurement. These serve as concrete examples of the types of question answered by test theory:
1. A teacher constructs 20 items for a mathematics test. The answer to each item can be checked very easily as pass or fail. The teacher gives the examination to a number of students, and adds up the number of passes to make a score for mathematics for each of them. The teacher wonders about several problems: (a) Should there be one score for mathematics or two scores, one for items that on the face of it are about geometry, and one for items that appear to be about algebra? (b) And what about items that require both geometry and algebra? (c) Are all the items good measures of mathematical ability or are some better than others? (d) If one score is sufficient, how accurate is it as a measure of mathematics knowledge? (e) Are 20 items sufficient to give a reasonably accurate determination of each studentâs knowledge? Should more be used? Could fewer have been used, with a saving of time and effort? (f) If different items had been written, would they have measured the same thing? Equally well? In particular, can two tests be made, with different items, whose scores are completely interchangeable? Perhaps the teacher would like to put the items in a computer and have the students respond at the keyboard. A computer program could decide which items each student should be tested on. (g) Are students at the lower end of the scale measured as accurately as students in the middle or at the high end? If some students score 0 or 20, the test seems respectively too difficult or too easy for them, so perhaps easier and/or more difficult items should be added? (h) Are the items free from bias, when given to students of different backgrounds? Could some students have irrelevant problems with certain items because of differences in their background and experience? How would we know?
2. A clinical psychologist writes a set of items such as:
(Readers are invited to consider how they would write more items âof the same kind.â) The intention here is to use a score obtained from a subjectâs responses to these items to determine whether the subject suffers from a neurotic disorder. The clinical psychologist wonders about the following: Should there be one score for neurosis, or should there be several for, say, phobias, endogenous anxiety, âŚ? How accurate are the score (s) as measures of neurotic conditions? And further questions follow along the lines of (a) through (h) for the first problem.
3. A survey researcher wants to study attitudes to gun control. She starts writing a series of items, with a typical survey format, such as:
(The reader is invited to consider writing a few more items âof the same kind.â) The survey researcher wonders: Is there a good way to score the items, separately or together? [We notice that item (c) seems to measure in the opposite direction from (a) and (b), and we might believe that âstrongly agreeâ should carry more âweightâ than just âagree.â ] Do the item scores add up to make a general score for approval of gun control, or do different items measure different aspects of the question? Again, how accurately is the attitude measured? How many items are needed to cover the attitude and measure it well enough? And so on.
Test theory consists of the use of mathematical concepts that have been developed in order to refine questions such as these into more precise forms and to provide answers to them. The student may need an immediate assurance that it is possible to acquire knowledge of the essential concepts of test theory, and the skill to apply these concepts to practical problems, without understanding all of the mathematical technicalitiesâfoundations and proofs. But it is necessary to recognize from the beginning that test theory is essentially applied mathematics, overlapping with statistics.
The development of test theory has generally been motivated by the need to solve problems in psychology, including educational psychology and educational measurement. It has largely taken place at the hands of psychologists, who in many cases had to struggle to acquire the mathematics and, in particular, the statistical knowledge they needed to solve their problems. Until recent decades it was generally difficult (although there have been notable exceptions) to get mathematicians to recognize that problems of great urgency in psychological research could be interesting, and not without challenge, to the mathematicians. This fact has some good and some bad consequences for the student entering this field of work. On the good side, a great deal of the theory that has been developed was originally expressed in fairly straightforward formulas. By now, mathematicians have taught the field to express these concepts much more formally, rigorously, pedantically, and incomprehensibly, but behavioral science students and researchers can work on a principle of not demanding a higher level of formality and rigor in test theory than is needed for the comprehension and application of a particular piece of theory. We do not need to set an unnecessarily high level of aspiration in mathematical credentials. On the bad side, the development of psychometric theory has tended to be a piecemeal and rather confused process. In some ways, the field has been slow to recognize mathematical and conceptual equivalences among different developments.
Another problem with the development of test theory is that a major part of it took place before the computer revolution of the late 1950s. Yet much of psychometric theory was and is far more computationally intensive than most of the comparable developments in statistics in the earlier era. Consequently, the pioneers were forced to invent short-cut numerical devices, which sometimes made the theory itself look rather crude to mathematicians. Some of these devices have not been removed to a museum of psychometric theory, but remain in operation alongside more efficient computer methods. Sometimes these older methods provide the main defaults in computer packages. This can be very confusing for the user.
So that these last remarks can take a more concrete form, we turn in the next section to a sketchy contextual and historical introduction, mainly referring to key theoretical developments. The following section describes the approach taken in this book, as well as various ways the text can be used by different readers. The chapter concludes with an outline of the contents. Like all such outlines, this can be understood better on a second reading, after the book itself has been not merely read but worked through.
CONTEXTUAL AND HISTORICAL
It has often been remarked that psychology has a long past and a short history, traceable as past to ancient Greece at least, and as history to its splitting off from philosophy about the middle of the 19th century. Similarly, if we include educational testing within the field of psychological measurement, the practice of testing has a past traceable to ancient China, but psychometric theory has a history that begins about the mid-19th century in the psychophysical laboratory. Within that short history, there are some key developments that form the foundations of modern test theory. Chronological order is not used in this sketch. Instead, several branches of development are outlined.
In 1904, Charles Spearman published two seminal papers, which included alternative analyses of the same data.2 The first paper showed how to recognize, from test data, that the tests measure just one psychological attribute in commonâa âcommon factor.â The second showed how to estimate the amount of error in test scores. To do this, it was supposed that what the tests measure in common is a âtrue score,â each being subject to âerror of measurement.â Over decades, by further elaborations, the first paper gave rise to common factor theory (chaps. 6 and 9), and the other gave rise to classical true-score theory (chaps. 5 and 7). These theories have tended to be treated as separate and unrelated branches of psychometrics, but they need not be.
Spearman thought of his work as supporting a psychological theoryâthat cognitive performances depend on a unitary psychological function, general intelligence. In his initial work he used ordinary examination results in academic subjects as measures of intelligence.
The very first functioning intelligence test was produced by Alfred Binet and Victor Henri in 1895, with an improved version following in 1905 from Binet and Simon. This test was based on the simple but effective device of choosing items for which the percentage of correct answers increased with (chronological) age, and identifying a âmental ageâ as the chronological age of subjects who typically passed those items. In 1914, Stern introduced the concept of an intelligence quotient, defined as the ratio of mental age to chronological age multiplied by 100. Lewis Terman developed the Stanford-Binet tests of intelligence out of Binetâs work, and used Sternâs age-based IQ. David Wechsler developed intelligence tests in which the mean and standard deviation of the group of examinees used to develop the test gave an IQ based on individual deviation from others of the same age. He chose a mean of 100 and a standard deviation of 15 for the resulting deviation IQ, which makes it appear comparable to the age-based IQ of Stern. This deviation IQ points toward the need to consider the choice of scale of a psychological test.3
Following Spearmanâs initial work, psychologists made increasingly careful attempts to develop items considered to measure intelligence, with related attempts to define the meaning of the concept more precisely. These attempts provided the main context for the development of test theory. A major question for several decades was whether intelligence was a unitary function, or whether it was necessary to recognize a number of distinct scholastic aptitudesâverbal versus nonverbal intelligence, or major âgroup factorsâ of intelligence. L. L. Thurstone in the 1930s elaborated Spearmanâs model into a âmultiple factorâ model.4 This was conceived as a method of data analysis by which the psychologist could discover how many distinct kinds of ability âexist,â and what their nature is. This ambitious program required the development of increasingly elaborate mathematical and numerical devices. As already mentioned, these were complicated by the need to reduce computational effort when human operators, not electronic computers, had to perform these functions. Work in the framework given by Thurstone tended to eliminate any notion of a general intelligence or scholastic aptitude in favor of an ever-increasing number of specialized but related aptitudesânumerical ability, verbal ability, spatial ability, and so on.
The prototype of personality self-report questionnaires was developed by Woodworth, about 1920, derived from psychiatric descriptions of symptoms of neurotic patients. The four items in example 2 can be taken as representative of Woodworthâs Personal Data Sheet, as it was called, and of other personality inventories developed since. Perhaps the most widely used self-report personality inventory up to the time of writing is the Minnesota Multiphasic Personality Inventory (MMPI). Items were chosen from a large initial collection on the basis of contrasting responses of psychiatrically diagnosed groups of subjectsâsuch as depressives, hysterics, psychopaths, paranoids, schizophrenicsâto create subtests measuring tendencies toward these pathologies.5
Eventually, factor analysis methods were applied to personality tests. A degree of consensus seems to have emerged, through the work of Eysenck, Norman, and others,6 that there are probably five main personality traits, with corresponding scales to measure them. These are: