Test Theory
eBook - ePub

Test Theory

A Unified Treatment

  1. 498 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Test Theory

A Unified Treatment

About this book

This book introduces the reader to the main quantitative concepts, methods, and computational techniques needed for the development, evaluation, and application of tests in the behavioral/social sciences, including educational tests. Two empirical examples are carried throughout to illustrate alternative methods. Other data sets are used for special illustrations. Self-contained programs for confirmatory and exploratory factor analysis are available on the Web.

Intended for students of psychology, particularly educational psychology, as well as social science students interested in how tests are constructed and used, prerequisites include a course on statistics.

The programs and data files for this book can be downloaded from www.psypress.com/test-theory/

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Test Theory by Roderick P. McDonald in PDF and/or ePUB format, as well as other popular books in Psychology & Research & Methodology in Psychology. We have over one million books available in our catalogue for you to explore.
Chapter 1
General Introduction
Test theory is an abbreviated expression for theory of psychological tests and measurements, which can in turn be abbreviated back to psychometric theory (psychological measurement). Test theory provides a general collection of techniques for evaluating the development and use, in assessment, of specific psychological tests. It has the same relation to the practice of testing as statistical and experimental design principles have to actual programs of experimental research in behavioral science. There are courses and corresponding textbooks available on assessment—the evaluation of characteristics of individuals through the use of tests, interviews, and so on.1 There are also more specialized courses and textbooks—on tests used to measure educational abilities and achievements, tests used to assess personality characteristics, and tests developed for the measurement of social attitudes. Test theory is both more general and much narrower in scope, just as the topic of experimental design and statistical analysis is more general and narrower than any survey of actual experimental studies in an area of psychology.
The objectives of this chapter are, first, to indicate the kinds of problem that motivate study of the topic known as test theory; second, to give a general orientation to it, including an outline of key historical developments; and, third, to give, as far as possible, a preview of the chapters to follow, with suggestions about the approaches that can be taken to them.
As a way to get started, let us consider three practical problems of measurement. These serve as concrete examples of the types of question answered by test theory:
1. A teacher constructs 20 items for a mathematics test. The answer to each item can be checked very easily as pass or fail. The teacher gives the examination to a number of students, and adds up the number of passes to make a score for mathematics for each of them. The teacher wonders about several problems: (a) Should there be one score for mathematics or two scores, one for items that on the face of it are about geometry, and one for items that appear to be about algebra? (b) And what about items that require both geometry and algebra? (c) Are all the items good measures of mathematical ability or are some better than others? (d) If one score is sufficient, how accurate is it as a measure of mathematics knowledge? (e) Are 20 items sufficient to give a reasonably accurate determination of each student’s knowledge? Should more be used? Could fewer have been used, with a saving of time and effort? (f) If different items had been written, would they have measured the same thing? Equally well? In particular, can two tests be made, with different items, whose scores are completely interchangeable? Perhaps the teacher would like to put the items in a computer and have the students respond at the keyboard. A computer program could decide which items each student should be tested on. (g) Are students at the lower end of the scale measured as accurately as students in the middle or at the high end? If some students score 0 or 20, the test seems respectively too difficult or too easy for them, so perhaps easier and/or more difficult items should be added? (h) Are the items free from bias, when given to students of different backgrounds? Could some students have irrelevant problems with certain items because of differences in their background and experience? How would we know?
2. A clinical psychologist writes a set of items such as:
I have difficulty sleeping: True/False
I am afraid of heights: True/False
I get tired easily: True/False
I often have bad dreams: True/False …
(Readers are invited to consider how they would write more items “of the same kind.”) The intention here is to use a score obtained from a subject’s responses to these items to determine whether the subject suffers from a neurotic disorder. The clinical psychologist wonders about the following: Should there be one score for neurosis, or should there be several for, say, phobias, endogenous anxiety, …? How accurate are the score (s) as measures of neurotic conditions? And further questions follow along the lines of (a) through (h) for the first problem.
3. A survey researcher wants to study attitudes to gun control. She starts writing a series of items, with a typical survey format, such as:
(a) Assault weapons do not belong in private hands—

Strongly agree (SA)
Agree (A)
Neither agree nor disagree (nAD)
Disagree (D)
Strongly disagree (SD)
(b) All hand-guns should be licensed—

SA A nAD D SD
(c) Government interference with the right to bear arms is an infringement of liberties—

SA A nAD D SD
(The reader is invited to consider writing a few more items “of the same kind.”) The survey researcher wonders: Is there a good way to score the items, separately or together? [We notice that item (c) seems to measure in the opposite direction from (a) and (b), and we might believe that “strongly agree” should carry more “weight” than just “agree.” ] Do the item scores add up to make a general score for approval of gun control, or do different items measure different aspects of the question? Again, how accurately is the attitude measured? How many items are needed to cover the attitude and measure it well enough? And so on.
Test theory consists of the use of mathematical concepts that have been developed in order to refine questions such as these into more precise forms and to provide answers to them. The student may need an immediate assurance that it is possible to acquire knowledge of the essential concepts of test theory, and the skill to apply these concepts to practical problems, without understanding all of the mathematical technicalities—foundations and proofs. But it is necessary to recognize from the beginning that test theory is essentially applied mathematics, overlapping with statistics.
The development of test theory has generally been motivated by the need to solve problems in psychology, including educational psychology and educational measurement. It has largely taken place at the hands of psychologists, who in many cases had to struggle to acquire the mathematics and, in particular, the statistical knowledge they needed to solve their problems. Until recent decades it was generally difficult (although there have been notable exceptions) to get mathematicians to recognize that problems of great urgency in psychological research could be interesting, and not without challenge, to the mathematicians. This fact has some good and some bad consequences for the student entering this field of work. On the good side, a great deal of the theory that has been developed was originally expressed in fairly straightforward formulas. By now, mathematicians have taught the field to express these concepts much more formally, rigorously, pedantically, and incomprehensibly, but behavioral science students and researchers can work on a principle of not demanding a higher level of formality and rigor in test theory than is needed for the comprehension and application of a particular piece of theory. We do not need to set an unnecessarily high level of aspiration in mathematical credentials. On the bad side, the development of psychometric theory has tended to be a piecemeal and rather confused process. In some ways, the field has been slow to recognize mathematical and conceptual equivalences among different developments.
Another problem with the development of test theory is that a major part of it took place before the computer revolution of the late 1950s. Yet much of psychometric theory was and is far more computationally intensive than most of the comparable developments in statistics in the earlier era. Consequently, the pioneers were forced to invent short-cut numerical devices, which sometimes made the theory itself look rather crude to mathematicians. Some of these devices have not been removed to a museum of psychometric theory, but remain in operation alongside more efficient computer methods. Sometimes these older methods provide the main defaults in computer packages. This can be very confusing for the user.
So that these last remarks can take a more concrete form, we turn in the next section to a sketchy contextual and historical introduction, mainly referring to key theoretical developments. The following section describes the approach taken in this book, as well as various ways the text can be used by different readers. The chapter concludes with an outline of the contents. Like all such outlines, this can be understood better on a second reading, after the book itself has been not merely read but worked through.
CONTEXTUAL AND HISTORICAL
It has often been remarked that psychology has a long past and a short history, traceable as past to ancient Greece at least, and as history to its splitting off from philosophy about the middle of the 19th century. Similarly, if we include educational testing within the field of psychological measurement, the practice of testing has a past traceable to ancient China, but psychometric theory has a history that begins about the mid-19th century in the psychophysical laboratory. Within that short history, there are some key developments that form the foundations of modern test theory. Chronological order is not used in this sketch. Instead, several branches of development are outlined.
In 1904, Charles Spearman published two seminal papers, which included alternative analyses of the same data.2 The first paper showed how to recognize, from test data, that the tests measure just one psychological attribute in common—a “common factor.” The second showed how to estimate the amount of error in test scores. To do this, it was supposed that what the tests measure in common is a “true score,” each being subject to “error of measurement.” Over decades, by further elaborations, the first paper gave rise to common factor theory (chaps. 6 and 9), and the other gave rise to classical true-score theory (chaps. 5 and 7). These theories have tended to be treated as separate and unrelated branches of psychometrics, but they need not be.
Spearman thought of his work as supporting a psychological theory—that cognitive performances depend on a unitary psychological function, general intelligence. In his initial work he used ordinary examination results in academic subjects as measures of intelligence.
The very first functioning intelligence test was produced by Alfred Binet and Victor Henri in 1895, with an improved version following in 1905 from Binet and Simon. This test was based on the simple but effective device of choosing items for which the percentage of correct answers increased with (chronological) age, and identifying a “mental age” as the chronological age of subjects who typically passed those items. In 1914, Stern introduced the concept of an intelligence quotient, defined as the ratio of mental age to chronological age multiplied by 100. Lewis Terman developed the Stanford-Binet tests of intelligence out of Binet’s work, and used Stern’s age-based IQ. David Wechsler developed intelligence tests in which the mean and standard deviation of the group of examinees used to develop the test gave an IQ based on individual deviation from others of the same age. He chose a mean of 100 and a standard deviation of 15 for the resulting deviation IQ, which makes it appear comparable to the age-based IQ of Stern. This deviation IQ points toward the need to consider the choice of scale of a psychological test.3
Following Spearman’s initial work, psychologists made increasingly careful attempts to develop items considered to measure intelligence, with related attempts to define the meaning of the concept more precisely. These attempts provided the main context for the development of test theory. A major question for several decades was whether intelligence was a unitary function, or whether it was necessary to recognize a number of distinct scholastic aptitudes—verbal versus nonverbal intelligence, or major “group factors” of intelligence. L. L. Thurstone in the 1930s elaborated Spearman’s model into a “multiple factor” model.4 This was conceived as a method of data analysis by which the psychologist could discover how many distinct kinds of ability “exist,” and what their nature is. This ambitious program required the development of increasingly elaborate mathematical and numerical devices. As already mentioned, these were complicated by the need to reduce computational effort when human operators, not electronic computers, had to perform these functions. Work in the framework given by Thurstone tended to eliminate any notion of a general intelligence or scholastic aptitude in favor of an ever-increasing number of specialized but related aptitudes—numerical ability, verbal ability, spatial ability, and so on.
The prototype of personality self-report questionnaires was developed by Woodworth, about 1920, derived from psychiatric descriptions of symptoms of neurotic patients. The four items in example 2 can be taken as representative of Woodworth’s Personal Data Sheet, as it was called, and of other personality inventories developed since. Perhaps the most widely used self-report personality inventory up to the time of writing is the Minnesota Multiphasic Personality Inventory (MMPI). Items were chosen from a large initial collection on the basis of contrasting responses of psychiatrically diagnosed groups of subjects—such as depressives, hysterics, psychopaths, paranoids, schizophrenics—to create subtests measuring tendencies toward these pathologies.5
Eventually, factor analysis methods were applied to personality tests. A degree of consensus seems to have emerged, through the work of Eysenck, Norman, and others,6 that there are probably five main personality traits, with corresponding scales to measure them. These are:
1. Emotional stab...

Table of contents

  1. Cover
  2. Halftitle
  3. Title
  4. Copyright
  5. Contents
  6. Preface
  7. 1 General Introduction
  8. 2 Items and Item Scores
  9. 3 Item and Test Statistics
  10. 4 The Concept of a Scale
  11. 5 Reliability Theory for Total Test Scores
  12. 6 Test Homogeneity, Reliability, and Generalizability
  13. 7 Reliability—Applications
  14. 8 Prediction and Multiple Regression
  15. 9 The Common Factor Model
  16. 10 Validity
  17. 11 Classical Item Analysis
  18. 12 Item Response Models
  19. 13 Properties of Item Response Models
  20. 14 Multidimensional Item Response Models
  21. 15 Comparing Populations
  22. 16 Alternate Forms and the Problem of Equating
  23. 17 An Introduction to Structural Equation Modeling
  24. 18 Some Scaling Theory
  25. 19 Retrospective
  26. Appendix A. Some Rules for Expected Values
  27. Glossary
  28. References
  29. Author Index
  30. Subject Index