I
Initial Considerations for Automatic Item Generation
1
Automatic Item Generation
An Introduction
Mark J. Gierl and Thomas M. Haladyna
A major motivation for this book is to improve the assessment of student learning, regardless of whether the context is K-12 education, higher education, professional education, or training. The Standards for Educational and Psychological Testing (AERA, APA, & NMCE, 1999) described assessment in this way:
Any systematic method of obtaining information from tests and other sources, used to draw inferences about characteristics of people, objects, or programs. (p. 112)
A test or exam is one of the most important sources of information for the assessment of student learning. Our book is devoted to improving assessment of student learning through the design, development, and validation of test items of superior quality that are used for tests that improve assessment.
Although automatic item generation (AIG) has a relatively short history, it holds much promise for many exciting, innovative, and valuable technologies for item development and validation. The chapters in this book provide a coordinated and comprehensive account of the current state of this emerging science. Two concepts of value to readers in this volume are the content and cognitive demand of achievement constructs. Every test item has a content designation and an intended cognitive demand. The content of an achievement construct is generally considered as existing in one of two types of domains. The first type of domain consists of using knowledge, skills, and strategies in complex ways. A review of national, state, and school district content standards for reading, writing, speaking, and listening, and mathematical and scientific problem solving provides good examples of this first type of domain. The second type of domain focuses on a single cognitive ability. Abilities are complex mental structures that grow slowly (Lohman, 1993; Messick, 1984; Sternberg, 1998). For credential testing in the professions, the domain consists of tasks performed in that profession (Raymond & Neustel, 2006). A domain, therefore, represents one of these cognitive abilities. Kane (2006a, 2006b) refers to these tasks as existing in a target domain. Test items are intended to model the tasks found in the target domain. The reference to knowledge, skills, and ability refers to either of these two types of constructs as they exist in current, modern measurement theory and practice.
Cognitive demand refers to the mental complexity in performing a task. The task might be a test item where the learner selects among choices or the learner creates a response to an item, question, or command. One critic referred to the classification of cognitive demand as a conceptual âswamp,â due to the perfusion of terms used to describe various types of higher-level thinking (Lewis & Smith, 1993). More recently, Haladyna and Rodriguez (in press) listed 25 different terms signifying higher-level thinking. For our purpose no taxonomy of higher-level thinking has been validated or widely accepted on scientific grounds, and none is advocated. However, several contributors of this book describe useful methods for uncovering ways to measure complex, cognitive behaviors via AIG.
Our introductory chapter has two main sections. The first section presents a context for change in educational measurement that features AIG. The second section provides a brief summary of the chapters in this book, as well as highlighting their interrelationships.
A Context for Automatic Item Generation
The Greek philosopher Heraclitus (c. 535 BCâ475BC) provided some foresight into the state of 21st century educational measurement when he claimed that the only constant was change. The evolution of educational measurementâwhere interdisciplinary forces are stemming from the fusion of the cognitive sciences, statistical theories of test scores, professional education and certification, educational psychology, operations research, educational technology, and computing scienceâis occurring rapidly. These interdisciplinary forces are also creating exciting new opportunities for both theoretical and practical changes. Although many different examples could be cited, the state of change is most clearly apparent in the areas of computer-based testing, test design, and cognitive diagnostic assessment. These three examples are noteworthy as they relate to the topics described in this book, because changes in computerized testing, test design, and diagnostic testing will directly affect the principles and practices that guide the design and development of test items.
Example #1: Computer-Based Testing
Computer-based testing, our first example, is dramatically changing educational measurement research and practice because our current test administration procedures are merging with the growing popularity of digital media, along with the explosion in internet use to create the foundation for new types of tests and testing resources. As a historical development, this transition from paper-to computer-based testing has been occurring for some time. Considerable groundwork for this transition can be traced to the early research, development, and implementation efforts focused on computerizing and adaptively administering the Armed Services Vocational Aptitude Battery, beginning in the 1960s (see Sands, Waters, & McBride, 1997). A computer-adaptive test is a paperless test administered with a computer, using a testing model that implements a process of selecting and administering items, scoring the examineeâs responses, and updating the examineeâs ability estimate after each item is administered. This process of selecting new items based on the examineeâs responses to the previously administered items is continued until a stopping rule is satisfied where there is considerable confidence in the accuracy of the score. The pioneers and early proponents of computer-adaptive testing were motivated by the potential benefits of this testing approach, which included shortened tests without a loss of measurement precision, enhanced score reliability particularly for low-and high-ability examinees, improved test security, testing on-demand, and immediate test scoring and reporting. The introduction and rapid expansion of the internet has enable many recent innovations in computerized testing. Examples include computer-adaptive multistage testing (Luecht, 1998; Luecht & Nungester, 1998; see also Luecht, this volume), linear on-the-fly testing (Folk & Smith, 2002), testlet-based computer adaptive testing (Wainer & Kiely, 1987; Wainer & Lewis, 1990), and computerized mastery testing (Lewis & Sheehan, 1990). Now, many educational tests which were once given in a paper format are now administered by computer via the internet. Many popular and well-known tests can be cited as examples, including the Graduate Management Admission Test, the Graduate Record Examination, the Test of English as a Foreign Language, the Medical Council of Canada Qualifying Exam Part I, and the American Institute of Certified Public Accountants Uniform CPA Examination. Education Weekâs 2009 Technology Counts also reported that almost half the U.S. states now administer some form of internet-based computerized educational test.
Internet-based computerized tests offer many advantages to students and educators, as compared to more traditional paper-based tests. For instance, computers enable the development of innovative item types and alternative item formats (Sireci & Zenisky, 2006; Zenisky & Sireci, 2002); items on computer-based tests can be scored immediately, thereby providing examinees with instant feedback (Drasgow & Mattern, 2006); computers permit continuous testing and testing on-demand (van der Linden & Glas, 2010). But possibly the most important advantage of computer-based testing is that it allows testing agencies to measure more complex performances by integrating test items and digital media to substantially improve the measurement of complex thinking (Bartram, 2006; Zenisky & Sireci, 2002).
The advent of computer-based, internet testing has also raised new challenges, particularly in the area of item development. Large numbers of items are needed to develop the item banks necessary for computerized testing because items are continuously administered and, therefore, exposed. As a result, these item banks need frequent replenishing in order to minimize item exposure and maintain test security. Unfortunately, traditional item development content requires experts to use test specifications, and item-writing guides to author each item. This process is very expensive. Rudner (2010) estimated that the cost of developing one operational item using the current approach in a high-stakes testing program can range from $1,500 to $2,000 per item. It is not hard to see that, at this price, the cost for developing a large item bank becomes prohibitive. Breithaupt, Ariel, and Hare (2010) recently claimed that a high-stakes 40-item computer adaptive test with two administrations per year would require, at minimum, a bank containing 2,000 items. Combined with Rudnerâs per item estimate, this requirement would translate into a cost ranging from $3,000,000 to $4,000,000 for the item bank alone. Part of this cost stems from the need to hire subject-matter experts to develop test items. When large numbers of items are required, more subject-matter experts are needed. Another part of this cost is rooted in the quality-control outcomes. Because the cognitive item structure is seldom validated and the determinants of item difficulty are poorly understood, all new test items must be field tested prior to operational use so that their psychometric properties can be documented. Of the items that are field tested, many do not perform as intended and, therefore, must be either revised or discarded. This outcome further contributes to the cost of item development. Haladyna (1994) stated, for example, that as many as 40% of expertly created items fail to perform as intended during field testing, leading to large numbers of items being either revised or discarded. In short, agencies that adopt computer-based testing are faced with the daunting task of creating thousands of new and expensive items for their testing programs. To help address this important task, the principles and practices that guide AIG are presented in this book as an alternative method for producing operational test items.
Example #2: Test Design
Although the procedures and practices for test design and development of items for traditional paper-and-pencil testing are well established (see Downing & Haladyna, 2006; Haladyna, 2004; Schmeiser & Welch, 2006), advances in computer technology are fostering new approaches for test design (Drasgow, Luecht, & Bennett, 2006; Leighton & Gierl, 2007a; Mislevy, 2006). Prominent new test design approaches that differ from more traditional approaches are emerging, including the cognitive design system (Embretson, 1998), evidence-centered design (Mislevy, Steinberg, & Almond, 2003; Mislevy & Riconscente, 2006), and assessment engineering (Luecht, 2006, 2007, 2011). Although the new approaches to test design differ in important ways from one another, these approaches are united by a view that the science of educational assessment will prevail to guide test design, development, administration, scoring, and reporting practices. We highlight the key features in one of these approaches, assessment engineering (AE) (Luecht, Chapter 5, this volume). AE is an innovative approach to measurement practice where engineering-based principles and technology-enhanced processes are used to direct the design and development of tests as well as the analysis, scoring, and reporting of test results. This design approach begins by defining the construct of interest using specific, empirically derived construct maps and cognitively based evidence models. These maps and models outline the knowledge and skills that examinees need to master in order to perform tasks or solve problems in the domain of interest. Next, task models are created to produce replicable assessment resources. A task model specifies a class of tasks by outlining the shared knowledge and skills required to solve any type of task in that class. Templates are then created to produce test items with predictable difficulty that measure the content for a specific task model. Finally, a statistical model is used for examinee response d...