Item Generation for Test Development
eBook - ePub

Item Generation for Test Development

  1. 444 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Item Generation for Test Development

About this book

Since the mid-80s several laboratories around the world have been developing techniques for the operational use of tests derived from item-generation. According to the experts, the major thrust of test development in the next decade will be the harnessing of item generation technology to the production of computer developed tests. This is expected to revolutionize the way in which tests are constructed and delivered.

This book is a compilation of the papers presented at a symposium held at ETS in Princeton, attended by the world's foremost experts in item-generation theory and practice. Its goal is to present the major applications of cognitive principles in the construction of ability, aptitude, and achievement tests. It is an intellectual contribution to test development that is unique, with great potential for changing the ways tests are generated. The intended market includes professional educators and psychologists interested in test generation.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Item Generation for Test Development by Sidney H. Irvine,Patrick C. Kyllonen in PDF and/or ePUB format, as well as other popular books in Education & Education General. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Routledge
Year
2013
Print ISBN
9780805834413
I
PSYCHOMETRIC AND COGNITIVE
THEORY OF ITEM GENERATION

1

The Foundations of Item Generation for Mass Testing

Sidney H. Irvine
University of Plymouth

The Scientific Basis of Item Generation

When Cronbach (1957) called for the unification of experimental and correlational universes of discourse in psychology, it was not a consummation that, even if devout and desirable, could occur immediately for tests and measurements. And even now, if a degree of confluence has been achieved in the concepts and operational definitions of item-generation theory, applications to test construction are widespread neither in the domains of test content, nor in the use that is made of theory by large-scale test constructors. But if used, item-generation theory at one and the same time brings about a remarkably robust test-construction medium. To enable a perspective on the state of the art, historical and theoretical influences on the derivation of tests for initial screening of job applicants are outlined and then reviewed.

Origins

The origins of item-generation theory are, as in all new branches of science, more a matter of ostensive than precise definition. One could paraphrase Spearman on intelligence and declare that we do not yet know what item-generation theory is, only where it may be found. Ostensive definitions in published materials are available as historical landmarks; and one may readily fix the location of these in the following: Bartram (1987), Bejar (1986a, 1986b, 1986c), Carroll (1976, 1980, 1983, 1986, 1987), Christal (1984), Collis, Dann, Irvine, Tapsfield, and Wright (1995), Dann and Irvine (1986), Dennis (1993), Dennis, Collis, and Dann (1996), Embretson (1996), Goeters and Rathje (1992), Hornke and Habon (1986), Irvine, Dann, and Anderson (1990), Kyllonen and Christal, (1989, 1990) Mislevy, Wingersky, Irvine, and Dann (1991). Much of the research activity on item generation predates eventual publication by some years, but a new field of algorithmbased test construction was being charted in a number of geographically distant centres from about 1985. History will also relate that these early attempts at item generation (and also at predicting item-difficulty from item elements) were the result of much original and creative work that took place in relative scientific isolation.

Theoretical Substrates

Within these sources are embedded not one grand design of overarching theory, but a number of theoretical substrates, representing the erstwhile two disciplines of psychology—one seeking main effects in controlled cognitive experiments, and the other looking for underlying domains and dimensions of abilities in correlation matrices of varying in extent and robustness (Carroll, 1993). As far as mass testing movements are concerned, the major influences on the development of the operational British Army Recruit Battery have already been published in Irvine et al. (1990), and on The USAF CAM Experimental Battery (Kyllonen & Christal, 1989, 1990). Nevertheless, the benefit of hindsight enables a sharper focus to modify and make more evident details brought to mind by selective attention.
There are at least three measurement paradigms that qualify how item generation has developed in the past and may yet grow in future. These have been described in detail elsewhere (Irvine, Dann, & Evans, 1987; Irvine, Dann, Evans, Dennis, Collis, Thacker, & Anderson, 1989) as R (Accuracy), L (Latency), and D (Dynamic or Change) Models. In the interests of brevity and clarity, their influence on test construction methods is summarized

R-Models Those who favor accuracy or R-Models mark items as right and wrong and may use classical test theory in which true score variance and error variance are the two determinants of reliability—a notion that began with Spearman. Alternatively, item response theory is employed, a technology dating from 1960 and largely associated with Educational Testing Service (ETS) and Fred Lord (1980). Although many primary sources could be cited to do justice to all those who have contributed to the refinement of R-Models through the use of classical or modern test theory, a balanced contemporary overview can be seen in Crocker and Algina (1987) and in Wainer and Messick (1983). The elaboration of R-Models through item response theory has developed as a method of shortening tests, of replacing or replenishing item-banks annually produced by experts, and of equating one test form with another. They are part of the large item-bank generation and process that is used in the SAT test models developed at ETS, in the Graduate Record Examination, and in the Armed Services Vocational Aptitude Battery (ASVAB) used in the assessment of applicants to all arms of the United States military. The assumptions and practices of item response theory applied to right and wrong answers are at the heart of early and not all together successful attempts to introduce computerised adaptive testing.
L-Models. L-Models use time, or latency, to distinguish fast from slow performance. Moreover, they have relied on differences from baseline times, slopes of times when task difficulty has increased, intercepts when latency has been made to increase over homogeneous items by their external manipulation. Much has been written of reaction-time paradigms, and attempts have been made to determine stage processes by decomposing gross times into estimates of stage times within individuals. The experimental literature dealing with main effects in cognitive tasks directly related to ability formation testifies to their universality. For example, Chase, 1969; Clark and Chase, 1972; Evans, 1982; Miller and McKean, 1964, provide specific latency performance models for deductive reasoning tasks. The use of these methods to generate test scores for individuals suffers, nevertheless, from the inadequacy of procedures for estimating individual differences in abilities from them (cf. Lohman, 1994). Such individual scores as may be generated become even more problematic when structural relations among stage processing measures are sought by correlational methods. Latency measures within a fixed time interval are invariably experimentally dependent upon each other (Sternberg, 1977), making traditional validation by intercorrelation and latent-trait methods at best risky and at worst tendentious. It is hardly surprising that Lohman (1994) describes attempts to produce a unified theory of measurement derived from latency studies of process stages as a qualified failure. Nevertheless, attempts to grapple with a model for individual differences in latencies have a long history (Dennis & Evans, 1996; Furneaux, 1952; Restle & Davis, 1962; White, 1982; Wright, 1997).
D-Models Dynamic, learning, or D-Models involve the repeated measurement of individuals while they are learning either the task they are performing, or some other task whose outcome the task being measured is expected to predict. While they may be constructed either from scores for accuracy (R) or latency (L), D-Models operate most effectively when they require some asymptotic level of performance in the predictor, or the criterion, or both. Even if elegant mathematical models have been in place for some time (see, e.g., Neimark & Estes [1967] on stimulus sampling theory) that precisely allocate main effects they have had little or no lasting influence on the measurement of individual differences. Change scores are difficult to use in regression equations unless they are highly reliable. The need to preserve serial independence of items that are generated is nevertheless paramount in the exercise of test theory (Royer, 1971). What the subject may be predicted to learn, or operationally just as salient, be prevented from learning during the test greatly influences the choice of item generation algorithms.

The Fourth Estate

On the whole, the work that went into creating a large-scale operational model from tests that were wholly item generative, and thereby guaranteed a new test for every applicant, deliberately collected data that would reveal aspects of these three models. In the outfiles that were generated for each subject, the item order, item characteristics, latencies for each stage of item delivery and response, correct and actual response, were all collected. Nevertheless, test scores were invariably constructed around R-Models adjusted for guessing, without totally resolving the question of speed–accuracy trade off (Dennis & Evans, 1996). To enable the outfiles to be created, the nature of changes in testing that were brought with the microcomputer had to be understood and invoked as principles.
Traditionally, large-scale testing was and still is carried out in groups, using paper-and-pencil tests as the medium. This technology restricted the range of operational variables that could be constructed; and has defined ability theory in a very constrained fashion. Indeed, item response theory was a function of that delivery system because of the need to equate paper-and-pencil annual aptitude test forms that were never quite parallel in the hands of item-construction teams. Much of the early literature on computer-based testing is preoccupied with transferring old paper-and-pencil tests to computers to see if they will produce the same results. This apart, other research teams concerned with the promise of computer-adaptive testing used the computer to administer individually tailored tests.
For progress to be made, the microcomputer could not become an expensive form of continuing paper-and-pencil test conventions into the millennium. Moreover, the implications of the capacity of microcomputers to shape the future of measurement were realised much earlier than the capacity of scientists to deliver the necessary changes. At least two independent major reports outlined where decision-making functions could be left to algorithms in the machine (Bunderson, Inouye, & Olsen, 1988; Dennis & Evans, 1989). Today, computer-based testing can be said to have increased the boundaries of theory to such an extent that the limits to mental measurement require a new form of boundary specification. Here, confined to the obvious, tests in computers have been defined, in the sense of providing a key to understanding their new operational dimensions; by hardware, by software and by knowledge-based systems created for score production according to preconceived paradigms.
The most important context for item generation is the microcomputer itself, not its operating system, but in how the ergonomics of test delivery and subject response may serve to shape the tests themselves. The microcomputer can be a variable, or more exactly can be made to define a number of quasi-independent variables (which are called radicals) that will constrain and alter the nature of the mental process to be measured. These variables include display mode, information sequence, and response mode. Table 1.1 summarises these.
Table 1.1
Microcomputer Display and Response Variables Affecting Test Scores
DisplqInformationResponse
StaticModalityKeyboard
MovingAmountConsole
SequentialOrderTouch Screen
InteractivePaceQualityMouseVoice

Display Modes Visual displays can be made to vary from a completely static representation of an item (as if it were transferred without change from paper to screen) to a wholly dynamic item where parts actually move (as in an arcade game environment where movement is a precondition). Old-fashioned apparatus tests (using puzzles, cards, beads, jig-saw pieces) administered to individuals also introduce movement and manipulation of the apparatus as sources of variation, for example in the so-called Non-verbal, but more accurately Performance, scales of omnibus intelligence tests administered individually. Interactive modes allow changes in displays depending on responses. This can be seen in adaptive paper-and-pencil questionnaires that require people to go on to answer different questions, depending on whether the answer is yes or no to any one question. Note that the information categories in particular may vary within each one. Quality refers to the amount of degradation on the screen. Order, pace, and amount of information are no...

Table of contents

  1. Cover
  2. Half Title
  3. Item Generation for Test Development
  4. Copyright
  5. Contents
  6. Foreword
  7. Item Generation for test Development: An itroduction
  8. Acknowledgments
  9. Prologue and Epilogue: Remembering Samuel J. Messick
  10. PART I: PSYCHOMETRIC AND COGNITIVE THEORY OF ITEM GENERATION
  11. PART II: CONSTRUCT-ORIENTED APPROACHES TO ITEM GENERATION
  12. PART III: FROM THEORY TO IMPLEMENTATION
  13. PART IV: APPLICATIONS OF ITEM-GENERATIVE PRINCIPLES
  14. Author Index
  15. Subject Index
  16. About the Editor