I
PSYCHOMETRIC AND COGNITIVE
THEORY OF ITEM GENERATION
1
The Foundations of Item Generation for Mass Testing
Sidney H. Irvine
University of Plymouth
The Scientific Basis of Item Generation
When Cronbach (1957) called for the unification of experimental and correlational universes of discourse in psychology, it was not a consummation that, even if devout and desirable, could occur immediately for tests and measurements. And even now, if a degree of confluence has been achieved in the concepts and operational definitions of item-generation theory, applications to test construction are widespread neither in the domains of test content, nor in the use that is made of theory by large-scale test constructors. But if used, item-generation theory at one and the same time brings about a remarkably robust test-construction medium. To enable a perspective on the state of the art, historical and theoretical influences on the derivation of tests for initial screening of job applicants are outlined and then reviewed.
Origins
The origins of item-generation theory are, as in all new branches of science, more a matter of ostensive than precise definition. One could paraphrase Spearman on intelligence and declare that we do not yet know what item-generation theory is, only where it may be found. Ostensive definitions in published materials are available as historical landmarks; and one may readily fix the location of these in the following: Bartram (1987), Bejar (1986a, 1986b, 1986c), Carroll (1976, 1980, 1983, 1986, 1987), Christal (1984), Collis, Dann, Irvine, Tapsfield, and Wright (1995), Dann and Irvine (1986), Dennis (1993), Dennis, Collis, and Dann (1996), Embretson (1996), Goeters and Rathje (1992), Hornke and Habon (1986), Irvine, Dann, and Anderson (1990), Kyllonen and Christal, (1989, 1990) Mislevy, Wingersky, Irvine, and Dann (1991). Much of the research activity on item generation predates eventual publication by some years, but a new field of algorithmbased test construction was being charted in a number of geographically distant centres from about 1985. History will also relate that these early attempts at item generation (and also at predicting item-difficulty from item elements) were the result of much original and creative work that took place in relative scientific isolation.
Theoretical Substrates
Within these sources are embedded not one grand design of overarching theory, but a number of theoretical substrates, representing the erstwhile two disciplines of psychology—one seeking main effects in controlled cognitive experiments, and the other looking for underlying domains and dimensions of abilities in correlation matrices of varying in extent and robustness (Carroll, 1993). As far as mass testing movements are concerned, the major influences on the development of the operational British Army Recruit Battery have already been published in Irvine et al. (1990), and on The USAF CAM Experimental Battery (Kyllonen & Christal, 1989, 1990). Nevertheless, the benefit of hindsight enables a sharper focus to modify and make more evident details brought to mind by selective attention.
There are at least three measurement paradigms that qualify how item generation has developed in the past and may yet grow in future. These have been described in detail elsewhere (Irvine, Dann, & Evans, 1987; Irvine, Dann, Evans, Dennis, Collis, Thacker, & Anderson, 1989) as R (Accuracy), L (Latency), and D (Dynamic or Change) Models. In the interests of brevity and clarity, their influence on test construction methods is summarized
R-Models Those who favor accuracy or R-Models mark items as right and wrong and may use classical test theory in which true score variance and error variance are the two determinants of reliability—a notion that began with Spearman. Alternatively, item response theory is employed, a technology dating from 1960 and largely associated with Educational Testing Service (ETS) and Fred Lord (1980). Although many primary sources could be cited to do justice to all those who have contributed to the refinement of R-Models through the use of classical or modern test theory, a balanced contemporary overview can be seen in Crocker and Algina (1987) and in Wainer and Messick (1983). The elaboration of R-Models through item response theory has developed as a method of shortening tests, of replacing or replenishing item-banks annually produced by experts, and of equating one test form with another. They are part of the large item-bank generation and process that is used in the SAT test models developed at ETS, in the Graduate Record Examination, and in the Armed Services Vocational Aptitude Battery (ASVAB) used in the assessment of applicants to all arms of the United States military. The assumptions and practices of item response theory applied to right and wrong answers are at the heart of early and not all together successful attempts to introduce computerised adaptive testing.
L-Models. L-Models use time, or latency, to distinguish fast from slow performance. Moreover, they have relied on differences from baseline times, slopes of times when task difficulty has increased, intercepts when latency has been made to increase over homogeneous items by their external manipulation. Much has been written of reaction-time paradigms, and attempts have been made to determine stage processes by decomposing gross times into estimates of stage times within individuals. The experimental literature dealing with main effects in cognitive tasks directly related to ability formation testifies to their universality. For example, Chase, 1969; Clark and Chase, 1972; Evans, 1982; Miller and McKean, 1964, provide specific latency performance models for deductive reasoning tasks. The use of these methods to generate test scores for individuals suffers, nevertheless, from the inadequacy of procedures for estimating individual differences in abilities from them (cf. Lohman, 1994). Such individual scores as may be generated become even more problematic when structural relations among stage processing measures are sought by correlational methods. Latency measures within a fixed time interval are invariably experimentally dependent upon each other (Sternberg, 1977), making traditional validation by intercorrelation and latent-trait methods at best risky and at worst tendentious. It is hardly surprising that Lohman (1994) describes attempts to produce a unified theory of measurement derived from latency studies of process stages as a qualified failure. Nevertheless, attempts to grapple with a model for individual differences in latencies have a long history (Dennis & Evans, 1996; Furneaux, 1952; Restle & Davis, 1962; White, 1982; Wright, 1997).
D-Models Dynamic, learning, or D-Models involve the repeated measurement of individuals while they are learning either the task they are performing, or some other task whose outcome the task being measured is expected to predict. While they may be constructed either from scores for accuracy (R) or latency (L), D-Models operate most effectively when they require some asymptotic level of performance in the predictor, or the criterion, or both. Even if elegant mathematical models have been in place for some time (see, e.g., Neimark & Estes [1967] on stimulus sampling theory) that precisely allocate main effects they have had little or no lasting influence on the measurement of individual differences. Change scores are difficult to use in regression equations unless they are highly reliable. The need to preserve serial independence of items that are generated is nevertheless paramount in the exercise of test theory (Royer, 1971). What the subject may be predicted to learn, or operationally just as salient, be prevented from learning during the test greatly influences the choice of item generation algorithms.
The Fourth Estate
On the whole, the work that went into creating a large-scale operational model from tests that were wholly item generative, and thereby guaranteed a new test for every applicant, deliberately collected data that would reveal aspects of these three models. In the outfiles that were generated for each subject, the item order, item characteristics, latencies for each stage of item delivery and response, correct and actual response, were all collected. Nevertheless, test scores were invariably constructed around R-Models adjusted for guessing, without totally resolving the question of speed–accuracy trade off (Dennis & Evans, 1996). To enable the outfiles to be created, the nature of changes in testing that were brought with the microcomputer had to be understood and invoked as principles.
Traditionally, large-scale testing was and still is carried out in groups, using paper-and-pencil tests as the medium. This technology restricted the range of operational variables that could be constructed; and has defined ability theory in a very constrained fashion. Indeed, item response theory was a function of that delivery system because of the need to equate paper-and-pencil annual aptitude test forms that were never quite parallel in the hands of item-construction teams. Much of the early literature on computer-based testing is preoccupied with transferring old paper-and-pencil tests to computers to see if they will produce the same results. This apart, other research teams concerned with the promise of computer-adaptive testing used the computer to administer individually tailored tests.
For progress to be made, the microcomputer could not become an expensive form of continuing paper-and-pencil test conventions into the millennium. Moreover, the implications of the capacity of microcomputers to shape the future of measurement were realised much earlier than the capacity of scientists to deliver the necessary changes. At least two independent major reports outlined where decision-making functions could be left to algorithms in the machine (Bunderson, Inouye, & Olsen, 1988; Dennis & Evans, 1989). Today, computer-based testing can be said to have increased the boundaries of theory to such an extent that the limits to mental measurement require a new form of boundary specification. Here, confined to the obvious, tests in computers have been defined, in the sense of providing a key to understanding their new operational dimensions; by hardware, by software and by knowledge-based systems created for score production according to preconceived paradigms.
The most important context for item generation is the microcomputer itself, not its operating system, but in how the ergonomics of test delivery and subject response may serve to shape the tests themselves. The microcomputer can be a variable, or more exactly can be made to define a number of quasi-independent variables (which are called radicals) that will constrain and alter the nature of the mental process to be measured. These variables include display mode, information sequence, and response mode. Table 1.1 summarises these.
Table 1.1
Microcomputer Display and Response Variables Affecting Test Scores
| Displq | Information | Response |
| Static | Modality | Keyboard |
| Moving | Amount | Console |
| Sequential | Order | Touch Screen |
| Interactive | PaceQuality | MouseVoice |
Display Modes Visual displays can be made to vary from a completely static representation of an item (as if it were transferred without change from paper to screen) to a wholly dynamic item where parts actually move (as in an arcade game environment where movement is a precondition). Old-fashioned apparatus tests (using puzzles, cards, beads, jig-saw pieces) administered to individuals also introduce movement and manipulation of the apparatus as sources of variation, for example in the so-called Non-verbal, but more accurately Performance, scales of omnibus intelligence tests administered individually. Interactive modes allow changes in displays depending on responses. This can be seen in adaptive paper-and-pencil questionnaires that require people to go on to answer different questions, depending on whether the answer is yes or no to any one question. Note that the information categories in particular may vary within each one. Quality refers to the amount of degradation on the screen. Order, pace, and amount of information are no...