eBook - ePub

Handbook of Item Response Theory

Name: Handbook of Item Response Theory
ISBN: 9781315356907

Volume 1: Models

Wim J. van der Linden,

595 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Handbook of Item Response Theory

Volume 1: Models

Wim J. van der Linden,

About this book

Drawing on the work of internationally acclaimed experts in the field, Handbook of Item Response Theory, Volume One: Models presents all major item response models. This first volume in a three-volume set covers many model developments that have occurred in item response theory (IRT) during the last 20 years. It describes models for different response formats or response processes, the need of deeper parameterization due to a multilevel or hierarchical structure of the response data, and other extensions and insights.

In Volume One, all chapters have a common format with each chapter focusing on one family of models or modeling approach. An introductory section in every chapter includes some history of the model and a motivation of its relevance. Subsequent sections present the model more formally, treat the estimation of its parameters, show how to evaluate its fit to empirical data, illustrate the use of the model through an empirical example, and discuss further applications and remaining research issues.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Chapman and Hall/CRC

Year

2016

Print ISBN

9780367220013

eBook ISBN

9781315356907

Topic

Psychology

Subtopic

Probability & Statistics

Index

Psychology

Introduction

Wim J. van der Linden

CONTENTS

1.1Alfred Binet

1.2Louis Thurstone

1.3Frederic Lord and George Rasch

1.4Later Contributions

1.5Unifying Features of IRT

References

The foundations of item response theory (IRT) and classical test theory (CTT) were laid in two journal articles published only one year apart. In 1904, Charles Spearman published his article “The proof and measurement of two things,” in which he introduced the idea of the decomposition of an observed test score into a true score and a random error generally adopted as the basic assumption of what soon became known as the CTT model for a fixed test taker. The role of Alfred Binet as the founding father of IRT is less obvious. One reason may be the relative inaccessibility of his 1905 article “Méthodes nouvelles pour le diagnosis du niveau intellectuel des anormaux” published with his coworker Théodore Simon in l’Année Psychologie. But a more decisive factor might be the overshadowing of his pioneering statistical work by something attributed to him as more important in the later psychological literature—the launch of the first standardized psychological test. Spearman’s work soon stimulated others to elaborate on his ideas, producing such contributions as the definition of a standard error of measurement, methods to estimate test reliability, and the impact of the test lengthening on reliability. But, with the important exception of Thurstone’s original work in the 1920s, Binet’s contributions hardly had any immediate follow-up.

1.1Alfred Binet

What exactly did Binet set in motion? In order to understand his pioneering contributions, we have to view them against the backdrop of his predecessors and contemporaries, such as Fechner, Wundt, Ebbinghaus, Quetelet, and Galton. These scientists worked in areas that had already experienced considerable growth, including anthropometrics, with its measurement of the human body, and psychophysics, which had successfully studied the relationships between the strength of physical stimuli as light and sound and human sensations of them.

Binet was fully aware of these developments and did appreciate them. But the problem he faced was fundamentally different. The city of Paris had asked him to develop a test that would enable its schools to differentiate between students that were mentally retarded, with the purpose of assigning them to special education, and those that were just unmotivated. Binet had already thought deeply about the measurement of intelligence and he was painfully aware of its problems. In 1898, he wrote (Revue Psychologique; cf. Wolf, 1973, p. 149):

There is no difficulty in measurement as long as it is a question of experiments on…tactile, visual, or auditory sensations. But if it is a question of measuring the keenness of intelligence, where is the method to be found to measure the richness of intelligence, the sureness of judgment, the subtlety of mind?

Unlike the earlier anthropometricians and psychophysicists with their measurement and manipulation of simple physical quantities, Binet realized he had to measure a rather complex variable that could be assumed to be “out there” but to which we have no direct access. In short, something we now refer to as a latent variable.

Binet’s solution was innovative in several respects. First, he designed a large variety of tasks supposed to be indicative of the major mental functions, such as memory, reasoning, judgment, and abstraction, believed to be included in intelligence. The variety was assumed to cover the “richness of intelligence” in his above quote. Second, he used these tasks in what he became primarily known for—a fully standardized test. Everything in it, the testing materials, administration, and scoring rules, was carefully protocolled. As a result, each proctor independently administering the test had to produce exactly the same results for the same students. But although Binet was the first to do so, the idea of standardization was not original at all. It was entirely in agreement with the new methodological tradition of the psychological experiment with its standardization and randomization, which had psychology fully in its grip since Wundt opened his laboratory in Leipzig in 1897. Binet had been in communication with Wundt and had visited his laboratory.

Third, Binet wanted to scale his test items but realized there was no natural scale for the measurement of intelligence. His solution was equally simple as ingenious; he chose to use the chronological age of his students to determine scale values for his items. During a pretest, he tried out all items with samples of students from each of the age groups 3–11 and assigned as scale value to each item the chronological age of the group for which it appeared to be answered correctly by 75% of its students. These scale values were then used to estimate the mental age at which each student actually performed. (Six years later, William Stern proposed to use the ratio of mental and chronological age as intelligence quotient [IQ]. A few more years later, Lewis Terman introduced the convention of multiplying this IQ by 100. Ever since, the mean IQ for a population has invariably been set at 100.)

This author believes that we should honor Binet primarily for the introduction of his idea of scaling. He dared to measure something that did not exist as a directly observable or manifest variable, nevertheless felt the necessity to map both his items and students on a single scale for it, estimated the scale values of his items using empirical data collected in a pretest, and then scored his students using the scale values of the items—exactly the same practice as followed by any user of IRT today. In fact, Binet’s trust in the scaling of his test items was so strong that he avoided the naïve idea of the need for a standardized test to administer exactly the same items to each of his students. Instead, the items were selected adaptively. The protocol of his intelligence test included a set of rules that precisely prescribed how the selection of the items had to move up and down along the scale as a function of the student’s achievements. We had to wait for nearly a century before Binet’s idea of adaptive testing became generally accepted as the benchmark of efficient testing!

In Binet’s work, several notions were buried which a modern statistician would have picked up immediately. For instance, his method of scaling involved the necessity to estimate population distributions. And with his intuitive ideas about the efficiency of testing, he entered an area now generally known as optimal design in statistics. But, remarkably, Spearman did not recognize the importance of Binet’s work. He remained faithful to his linear model with a true score and measurement error and became the father of factor analysis. (And probably needed quite a bit of his time for a fight with Karl Pearson over his correction of the product–moment correlation coefficient for unreliability of measurement.)

1.2Louis Thurstone

The one who definitely did recognize the importance of Binet’s work was Louis Thurstone. In 1925, he introduced a method of scaling with the intention to remove the necessity to use age as a manifest substitute for the latent intelligence variable. He did so by exactly reversing what Binet had treated as given and to be estimated: Binet used the chronological age of his students as a given quantity and subsequently estimated the unknown shape of the empirical curves of the proportion of correct answers as a function of age for each of his items to identify their scale values. Thurstone, on the other hand, assumed an unknown, latent scale for the items but did impose a known shape on these curves—that of the cumulative normal distribution function, using their estimated location parameters as scale values for the items. As a result, he effectively disengaged intelligence from age, giving it its own scale (and fixing an unforeseen consequence of Terman’s definition of IQ, namely, an automatic decrease of it with chronological age). Figure 1.1 shows the location of a number of Binet’s test items on his new scale, with the zero and unit fixed at the mean and standard deviation of the normalized distribution of the scale values for the 3.5-year-old children in his dataset (Thurstone, 1925). Thurstone also showed how his assumption of normality could be checked empirically, and thus basically practiced the idea of a statistical model as a hypothesis to be tested for its fit to reality. A few years later, he expanded the domain of measurement by showing how something hitherto vague as an attitude could be measured. The only two things necessary to accomplish this were his new method of scaling and a set of agree–disagree responses to a collection of statements evaluating the object of the attitude. Figure 1.2 shows a few of his response functions for the items in a scale for the measurement of the attitude dimension of pacifism–militarism.

FIGURE 1.1
Thurstone’s scaling of the Binet test items. (Reproduced from Thurstone, L. L. 1925. Journal of Educational Psychology, 16, 433–451.)

FIGURE 1.2
A few items from Thurstone’s scale for pacifism–militarism. (Reproduced from Thurstone, L. L. 1928. American Journal of Sociology, 23, 529–554.)

In spite of his inventiveness and keen statistical insights, Thurstone’s work was plagued by one consistent aberration—the confusion between the use of the normal ogive as the mathematical shape of a response function and as the distribution function for the score in some population. His confusion was no doubt due to the general idea spread by Quetelet that the “true” distribution of data properly collected for any unrestricted population had to be normal (Stigler, 1986, Chapter 5). This platonic belief was so strong in his days that when an observed distribution did not look normal, it was just normalized to uncover the “true” scale for the test scores, which automatically was supposed to have “equal measurement units.”

In hindsight, it looks as if Thurstone wanted to reconcile this hypothesis of normality with the well-established use of the normal ogive as response function in psychophysics. Anyhow, the impact of psychophysical scaling on the early history of test theory is unmistakable. As already noted, psychophysicists study the relationships between the strength of physical stimuli and the psychological sensations they invoke in human subjects. Their main method was a fully standardized psychological experiment in which a stimulus of varying strength was administered, for instance, a flash of light of varying clarity or duration, and the subjects were asked to report whether or not they had perceived it. As the probability of perception obviously is a monotonically increasing function of the strength of the stimulus, the choice of a response function with a shape identical to that of the normal distribution function appeared obvious. Thurstone made the same choice but for a latent rather than a manifest variable, with the statistical challenges of having to fit it just to binary response data and the need to evaluate its goodness of fit.

Mosier (1940, 1941) was quite explicit in his description of such parallels between psychophysics and psychometrics. In his 1940 publication, he provided a table with the different terms used to describe analogous quantities in the two disciplines. Others using the same normal-ogive model as response functions for test items at about the same time include Ferguson (1942), Lawley (1943), Richardson (1936), and Tucker (1946). These authors differed from Thurstone, however, in that they fell back completely upon the psychophysical method, using the normal-ogive function just as a nonlinear regression model for the response to the individual items on the observed score on the entire test. Ferguson even used it to regress an external, dichotomous success criterion on the observed scores on several test forms with different difficulties.

1.3Frederic Lord and George Rasch

The first not to suffer in any way from the confusion between distributions functions and their use as response functions were Lord (1952) and Rasch (1960). In his two-parameter normal-ogive model, Lord directly formulated the normal-ogive function as a mathematical model for the probability of a correct response given the unknown ability θ measured by it. He also used the model to describe the bivariate distributions of observed item scores and the latent ability, explore the limiting frequency distributions of number-correct scores on large tests, and derived the bivariate distributions of number-correct scores on two tests measuring the same ability.

Rasch was more rigorous in his approach and introduced his model as an attempt to change the paradigm for socia...

Cover
Half title
Title Page
Copyright Page
Contents
Contents for Statistical Tools
Contents for Applications
Preface
Contributors
1. Introduction
Section I Dichotomous Models
Section II Nominal and Ordinal Models
Section III Multidimensional and Multicomponent Models
Section IV Models for Response Times
Section V Nonparametric Models
Section VI Models for Nonmonotone Items
Section VII Hierarchical Response Models
Section VIII Generalized Modeling Approaches
Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Handbook of Item Response Theory by Wim J. van der Linden in PDF and/or ePUB format, as well as other popular books in Psychology & Probability & Statistics. We have over 1.5 million books available in our catalogue for you to explore.

Handbook of Item Response Theory

Volume 1: Models

Handbook of Item Response Theory

Volume 1: Models

About this book

Trusted by 375,005 students

Information

Table of contents

Frequently asked questions