1 Introduction
The reader may have some familiarity with item response theory (IRT) models, which are used in many testing programs. Such models, their properties, and their uses in test development, scaling, equating, and so on, have been described in detail in many articles, books, and research reports (Baker, 1992; de Ayala, 2009; Embretson & Reise, 2000; Fischer & Molenaar, 1995; Hambleton & Swaminathan, 1985; Kolen & Brennan, 2014; Lord, 1980; Lord & Novick, 1968; McDonald, 1999; Mislevy & Stocking, 1987; Rao & Sinharay, 2007; Rasch, 1960; van der Linden, 2016ā2017; van der Linden & Hambleton, 1997; Yen & Fitzpatrick, 2006; the special issue of the Journal of Educational Measurement, Volume 14, Number 2, 1977).
The primary purpose of this work is to describe several commonly used IRT models and their use in various aspects of testing. The objective is to present the basic concepts underlying the models and use of them, in a manner that requires the least amount of mathematical sophistication without compromising accuracy. Hence, other than some coverage in Appendices A and B, no derivations are given and there is little in-depth discussion of the underlying mathematical and statistical concepts. I do include many graphics because I believe that users will understand the models and their use by studying such things as how changes in values of the parameters affect how the items function and how interpretation of graphics can help measurement professionals understand results of analyses of item response data and build better tests. In the very few sections in which I mention mathematical considerations, these sections are marked and may safely be omitted on first reading. The interested reader is referred to original and secondary sources for more advanced coverage of the models. In this work, I describe the IRT models currently most commonly used in operational testing programs, how they work, and how they may be used in analyzing test data. Initially, I provide some description of classical test theory (CTT) models and methods because they are often used in analyzing data from item tryouts prior to selection for an operational test or survey instrument. And also, because CTT analyses of operational data often precede IRT analyses to provide the user with easily acquired information useful in quality control and initial evaluations of the items.
Thus, this book is designed to provide information to students studying psychometrics, and testing professionals in the field, to help them better understand IRT models and how to use them in their work. Although models and underlying assumptions are presented and discussed, to help the reader understand the models and their use, the primary focus is on understanding and using the results of IRT analyses in testing programs and surveys.
Background and Terminology
For many years, test data were analyzed, studied, and results reported using CTT methodology, as presented, for example, by Crocker and Algina (1986), Gulliksen (1987), Lord and Novick (1968, Parts 1 and 2), and Nitko (1983). This was at least partially due to limitations of computer software and hardware. When IRT methods were first developed in the late 1940s and early 1950s (Carlson & von Davier, 2013) computer hardware and software were not sufficiently developed to carry out the complex computations necessary to use IRT efficiently. In addition, many testing programs exclusively used items that were scored on a two-point (dichotomous) scale, such as correct or incorrect, as compared to items scored on a multiple-point (polytomous) scale. As discussed in Chapters 4 and 5, IRT methodology provides better information about polytomously scored items than does CTT. However, the computer algorithms for polytomous items are more complex than those for dichotomous items and were not developed until the 1980s.
IRT models and procedures relate the test takerās proficiency in what is measured on a test or survey instrument to the probability of each possible response to an item on the instrument. Although I use the terms tests and test takers throughout this work, it is important to recognize that the methodology discussed herein also may be applied to data from survey instruments and questionnaires. On such instruments, the dichotomously scored item response data may be based on responses such as yes or no, agree or disagree, and so on. In recent decades, testing professionals have had an increased interest in IRT scaling of various constructed response tasks, essays, performance tests, portfolios, questionnaires, simulations, and so on, scored on a three- or higher-point scale. In response to this interest, there has been a concomitant further development of IRT models and software designed for polytomous item response data.
Some polytomous IRT models have been in existence since Raschās (1960, 1961) seminal work in IRT models. Additional developments have been made, notably by Andersen (1972), Andrich (1978, 1982, 1988), Bock (1972), Masters (1982), Muraki (1992), Samejima (1969, 1972), and Yen (described in Yen & Fitzpatrick, 2006). Modern software allows for analysis of test data from tests comprising both dichotomous and polytomous items.
The author believes that it is very important for measurement professionals to understand a basic fact before proceeding. This fact is that the result of any test administration is a set of responses of test takers to the test items, and these responses are not characteristics only of the items or only of the test takers but are really characteristics of the interaction between the test takers and the items, in the presence of a set of administration conditions. This is true whether the data are analyzed using CTT or IRT methodology. This may seem a small point but it is an important one. Testing professionals often look at the results of data analyses as ways of answering questions about the quality of the items. Although the data can be used directly to study the quality of the items, one should always keep in mind the specific test-taker population being assessed and the conditions under which the test was administered, scored, and a reporting scale developed, when such a study is conducted. And, without making explicit assumptions, generalizations beyond the context of those populations and conditions should not be made from such analyses. Further along these lines, for polytomously scored items that are scored either by human scorers or by computerized scoring algorithms, there are additional interactions involved in arriving at the item responses. These include interactions between scorers and scoring rules (rubrics), and between scorers (whether human or computerized) and the responses, as well as a three-way interaction between scorers, rubrics, and item responses. Some current IRT model analyses of test items can provide more information about interactions between test takers and test items, as well as the other interactions, than that typically reported under CTT analyses. Polytomously scored items can also provide more information than dichotomously scored items assessing the same content; Samejima (1969), for example, stated.1
it may be that the more profound the items, the less information we get about an examineeās ability, so far as the answers are evaluated dichotomously, i.e., success or failure. In this instance we shall be able to get more information if we modify the items so that we may evaluate their responses in a more graded way, without changing the qualities of the items.
I encourage testing professionals to think about test items, and responses of test takers to the items, in terms of a test takerās thinking processes when interacting with the task presented in the item. When evaluating items by studying item response data, knowing such things as age levels of test takers, language and cognitive development at that age level, and other characteristics will help those professionals understand and interpret the results of analyses.
Although, as I pointed out earlier, the term, test item, does not adequately describe all the tasks to which the models discussed in this work can be applied, for the sake of brevity that term will be used throughout. Concomitantly, not all measurement instruments are really tests: some are survey instruments or questionnaires on which responses may be such things as various degrees of agreement with a set of statements, which may be in either dichotomous or polytomous form. The methodology described in this work can be and has been used with such instruments as well as with tests such as those used in education, psychology, and other fields, including licensure and certification testing in many fields. The reader should remember that writing prompts, performance tasks, and so on, are included in this definition of item.
Another matter of terminology also needs some attention: dichotomous and polytomous. Strictly speaking, an item from an assessment instrument need not inherently yield two-point, three-point, and so on, scores. More correctly, the responses, or protocols, representing the interactions of test takers with an item may be scored using some rubric that yields either dichotomous or polytomous values. Again, in the interests of brevity, however, the writer uses dichotomous and polytomous as modifiers of item, item response, item response models, and other relevant measurement terms.
In a similar vein, I use the term proficiency to represent whatever trait, attitude, ability, and so on, the set of items is assumed to be measuring. The singular form of the word is being used here, indicating that I am dealing, initially, with a single dimension of proficiency. The models discussed in the first five chapters all deal with unidimensional IRT models. It should be noted, however, that there are IRT models available, and being used, that deal with multidimensional proficiencies. Some of these are covered in Chapters 6 and 7. New models, and modifications of existing ones, and related software, are continually being developed, so the reader, once basic understanding of IRT has been acquired, should keep informed by reading the current psychometric literature, and attending related presentations at professional meetings, to maintain awareness of new possibilities for analysis of testing data. Attending meetings allows one to meet and interact with others in the field; these are additional learning opportunities.
Finally, the term scaling is used to represent procedures used to estimate parameters of items and to construct score scales on which results will be reported. These procedures are used to study and infer relationships of item scores to the underlying proficiency variable that is the object of the testing. The underlying proficiency is an unobservable variable often referred to as a latent trait. As a matter of fact, IRT was initially called latent trait theory. The term calibration is also used for procedures for estimating parameters of the items. Furthermore, the development of scales covering variables designed in a context of measuring changes over time, or across grade levels, is also referred to as scaling, but usually as vertical scaling.
Contents of the Following Chapters
The second chapter provides a brief discussion of CTT and an overview of dichotomous IRT models and some comparisons between these classes of models and their assumptions. This is followed by detailed explanations of IRT concepts, designed to help the reader better understand and use the results of IRT analyses. Certain terminology and concepts that generalize to the polytomous case are emphasized in forms designed to help the reader understand the similarities and differences between these two classes of models. The models themselves are formulated in such a way as to aid in this understanding. This will help ease the userās transition from the use of dichotomous to polytomous IRT models covered in later chapters.
The third chapter is devoted to illustrations of actual analyses of item response data, first by CTT methodology and then by IRT methodology. The data used in the illustrations were created using simulated item responses. This allows for comparison of the results of analyses with the item parameters and test taker proficiencies used to generate the data. When analyzing operational test data, of course, the parameters and proficiencies are unknown, and the purpose of analyses is to estimate them. Two datasets are used; one involving very few items and test takers because that allows for more detailed illustrations, the other a much larger dataset that is more appropriate for IRT analyses.
In the fourth chapter I describe, in detail, the generalized partial credit (GPC; Muraki, 1992) model, also known as the two-parameter partial credit (2PPC; developed by Wendy Yen in 1991, as described in Yen & Fitzpatrick, 2006, p. 117) model. For simplicity, I refer to this model simply as the GPC model in most places. This model was selected because, in the writersā experience, it is currently the most commonly used model in testing programs that use polytomous models. An alternative, the graded response (GR; Samejima, 1969) model, is also discussed, as well as the similarities and differences between these models.
Chapter 5 parallels Chapter 3 in that it provides illustrations and descriptions of IRT analysis procedures, including those for tests comprising both dichotomous and polytomous items. Some computer programs that are available for scaling a mixture of dichotomously and polytomouslyscored items are described and illustrations are provided, again using simulation data. In that chapter, I also describe procedures that test development and other measurement professionals may find useful in assessing item response data to which the GPC model has been fit. The model may not always fit the data in the form specified for initial calibration. These procedures can provide suggestions for modifications to items, or the scoring rubric, that may improve the fit. They may also provide suggestions for modifying the model or selecting a different model.
In Chapter 6, I introduce and describe multidimensional item response theory (MIRT) models that have been proposed and used in the analysis of item response data. Some computer programs that may be used to analyze data, assuming these models are appropriate, are also referred to. These models are much more complex than those for unidimensional test data and the primary purpose of this chapter is to introduce and explain the models.
Chapter 7 continues the coverage of MIRT models with analyses of a two-dimensional simulated dataset. This chapter parallels the illustrations of dichotomous and polytomous unidimensional analyses in Chapters 3 and 5. I also analyze these data with a one-dimensional model for comparative purposes.
In Chapter 8, I briefly describe additional literature on more complex models that have been developed and used for analyzing item response data from operational testing programs. I do not present details of the models and their use; such details are beyond the scope of this book. The purpose of this chapter is simply to introduce these models to the readers so they become aware of the wide variety of applications of IRT that are available to them. I provide references in which those interested in the more complex models can acquire additi...