Chapter 1
Introduction1
Susan F. Chipman
U.S. Office of Naval Research
Paul D. Nichols
University of Wisconsin, Milwaukee
Robert L. Brennan
American College Testing
Coinciding developments in testing and in cognitive science have produced a moment of potential revolution in the theoretical underpinnings of psychological testing. In testing, practice and theory have led us to a rather peculiar state of affairs. An elaborate and refined mathematical apparatus has been developed for selecting well-behaved test items, for assembling them into well-behaved tests, and for converting examineesâ performance into well-scaled measures of âability.â But the process by which test items are created is rather ad hoc, more a matter of art than of science. That is, the selection of test content is not as well founded as most people believe, expect, or hope. Furthermore, traditional tests are well behaved for ranking and comparing examinees, for grading, and for predicting who will do well in some future activity. Typically, they do not provide useful diagnostic information about specific content that should be studied or taught in order to improve performance. Today, many want testing to be an integral part of instructional activity, helping to guide teachers and students to the eventual attainment of substantive educational goals.
In cognitive psychology and cognitive science, research over the past two or three decades has yielded a much improved understanding of the fundamental psychological nature of the knowledge and cognitive skills that we hope to measure in psychological testingâThese cognitive theories of knowledge and skill have reached sufficient maturity that it is now reasonable to look to them to provide a sound theoretical foundation for assessment, particularly for the content of assessments. This fact, combined with the discontents over current testing practices, has inspired efforts to bring testing and cognitive theory together to create a new theoretical framework for psychological testingâa framework developed for diagnosing learners differences rather than for ranking learners based on their differences.
The chapters of this volume present initial accomplishments of the effort to bring testing and cognitive theory together. Contributors originate from both of the relevant research communities: cognitive research and psychometric theory. Some represent collaborations between representatives of the two communities; others are efforts to reach out in the direction of the other enterprise. The gap to be bridged is quite wide. Even a rather superficial examination of the problem reveals a major issue. The models of knowledge and skill that have emerged from cognitive research take a form that is fundamentally different from the model of knowledge that underlies psychometric test theory. That theory assumes that knowledge can be represented in terms of one or at most a few dimensions, whereas modem cognitive theory typically represents knowledge in networksâeither networks of conceptual relationships or the transition networks of production systems.2 Mathematically, these are very different models. There are other significant differences between the two contributing communities. Typically, cognitive researchers have been concerned with describing the nature of knowledge in general, viewing individual differences primarily as a source of noise. In contrast, psychometricians have aimed to characterize individuals with sufficient precision that decisions about the fate and opportunities of individuals can be considered well justified. The strain of reconciling these different views of the world is evident in the chapters of this volume. These chapters fall into three major groups, which we have called student modeling, conceptual networks, and psychometric attributes.
Student Modeling
One stream of modem cognitive science research has emphasized the analysis and modeling of individual problem-solving performance and is therefore particularly promising as a source of new assessment approaches. This research began with the pioneering efforts of Newell and Simon (1972), and has matured into various production system models of cognition, including John Andersonâs evolving ACT (adaptive control of thought) series (Anderson, 1993). Such models have also become the foundation of artificially intelligent tutoring systems that contain student models. These student models epitomize cognitive diagnosis: They provide very detailed assessments of student competence at all points during instruction. These assessments are used to guide the selection of the next instructional actions. Although student models are cognitive diagnoses, their pragmatic functional role in the workings of tutoring systems has meant that they were often initially constructed in a rather ad hoc manner. Similarly, the selection of instructional actions does not necessarily demand high precision; the cost of errors is rather low. Thus, student models provide a promising starting point for cognitive diagnosis, but they need more rigorous principles and more careful examination of their quality as assessments. The chapters in this volume that describe diagnostic assessment within the context of intelligent tutoring attempt to develop assessments using psychologically and statistically defensible methods. The assessments must respond to questions like these: What is the appropriate level of detail to represent performance for the purpose of diagnosis? What statistical approaches are most useful for making inferences about the procedural knowledge underlying performance? How do you take into account the learning that is, after all, the goal of the tutor?
The second chapter in this volume, by Corbett, Anderson, & OâBrien, explores the quality of student modeling in the most mature intelligent tutoring system, the ACT programming tutor developed by Anderson (1993) and many of his associates. This tutor depends on an extremely detailed cognitive model of the programming skill that is being taught, a model that is stated as a production system. A production system is a set of condition-action rules that are called productions. When the conditions of a production are met, its action is taken, resulting in a new state of affairs that may in turn meet the conditions of further production rules. Flexible, dynamic generation of behavior results. This cognitive action takes place within an internal arena of limited capacity that is called working memory. Working memory may contain information supplied by perception of the external environment or specifications of externally observable problem-solving acts, but the operation of many productions is internal to the cognitive system and unobservable. Production system models of cognition provide a very fine level of description, so that the operation of many production rules will be involved in the solution of even rather simple problems. There are said to be hundreds of productions in the ACT programming tutor for each language taught. Consequently, inferring whether or not a student has learned a given production rule from observed performance on problems is a nontrivial matter. It is made more complex by the fact that learning is assumed to be continuing with the presentation of each additional problem. In the ACT programming tutor, estimates of student mastery of each production rule are maintained and updated after each relevant problem-solving experience, using Bayesian techniques.
Within the tutor, student actions are tracked step by step to determine whether or not they lie on a possible correct solution path. If not, students are given instructional guidance to correct what they are doing. This is a rather special circumstance. Therefore, Corbett et al. focus on whether the student model that works well for that purpose is also a good predictor of student programming performance when students are not receiving this instantaneous support. Several outcomes of their research are worthy of note. First of all, detailed examination of the learning curves for individual productions did not always reveal the smooth improvement expected. This suggested that errors had been made in the way some productions were specified, failures to make important distinctions. As a result, the specific cognitive theory was revised, increasing the number of productions by 67%. Given the large amount of prior research that went into this cognitive model, this illustrates the difficulty of arriving at a truly good model at the production system level. In addition, Corbett et al. discovered that all productions could not be treated uniformly in the predictive performance and learning model if one wanted to get truly good prediction: The parameters describing individual productionsâinitial state of learning, acquisition rate, guessing, and slipsâhad to be allowed to vary. Similarly, they discovered that the estimates of rule mastery did not fully characterize individual student performance. Even when students were estimated to have mastered all rules, there were individual differences in performance that could be predicted by the number of errors that students made on the way to estimated mastery. Individual differences in the likelihood of production learning, in the likelihood of making slips, and in retention or forgetting all seemed to have promises of improving the ability to predict final unaided performance.
In the next chapter, Mislevy focuses on the Bayesian inference techniques that can be used to update student models in a system like the ACT programming tutor or in other cognitive diagnosis situations. He gives us some appreciation of the computational complexity involved in applying Bayesian methods to a cognitive model with hundreds of rules, describing the techniques that are being developed to deal with such large problems. Although his discussion applies to production system models, it is more general. The same techniques can be applied to other types of models that postulate underlying causes for the observed behavior. Mislevy does point out, however, that the production system models are a particularly strong form of model: one that can actually perform the skill.
The chapter by Gitomer, Steinberg, and Mislevy reports on the application of these techniques within another type of intelligent tutor, a tutor of troubleshooting skill in a particular maintenance-training application, aircraft hydraulic systems. Gitomer et al. give us some appreciation of the research into the nature of the skill being trained that is necessary in order to either train or assess skills by these methods. In addition to the specific research that went into the building of this system, there is a substantial background of research on diagnosis skills, and several prior intelligent tutoring systems have been built for maintenance training. The tutor described by Gitomer et al. differs significantly from the ACT programming tutor because it includes a representation of a type of knowledge that would not ordinarily be described by productions: knowledge of the system that is to be maintained. In addition, they make a distinctionâwhich they attribute to Wenger (1987)âbetween the epistemic and individual level of cognitive diagnosis. The epistemic level is concerned with particular knowledge states of learners, such as the mastery of particular productions. The individual level is concerned with broader assertions about learners, such as their general strategy preference or general level of ability. The student model of the hydraulics tutor represents performance at the broader level of curricular goals rather than the more detailed level of productions. This more abstract representation is not a runable model, but it does predict the likelihood of specific actions. Because this student model can predict both specific actions and curricular goals, the student model can inform instructional adaptation within the tutor as well as decisions about general competency. Gitomer et al. apply Bayesian inference techniques to model student performance in their tutor. They attempt to represent in their student inference network the conceptual interdependencies between different knowledge components that were captured by the cognitive task analysis. Thus, a strategic action leads to local updating of the probabilities for strategic knowledge and also to network-wide updating of the probabilities for procedural knowledge and system knowledge.
In the chapter by Draney, Pirolli, and Wilson, we revisit an intelligent tutor of programming skill, the LISP Tutor, an earlier version of the ACT programming tutor, but with a perspective more strongly influenced by the psychometric tradition. They propose a model that is somewhat more comprehensive than the student model approaches in the previous chapters. It can represent experimental treatment effects as well as individual differences in skill. Draney et al. treat the productions as if they were items in an item response theory analysis, yielding both scales of production difficulty and measures of student ability (programming proficiency). This is a summary process rather than an online diagnosis, but they do incorporate a parameter to represent learning over the various opportunities to practice a production. Their learning model differs from the one assumed by Corbett et al., and they argue for the superiority of their choice.
The approach taken by Draney et al. shows both disadvantages and advantages, as compared to other approaches in this volume. Unlike the student models of the two tutoring systems, this measurement model is not providing dynamic estimates of the probability that an individual student has mastered a particular production at a particular time; for that purpose, these authors are also exploring the use of a Bayesian network incorporating their parameter estimates. Also, the analyses done by Corbett et al. suggest that student âabilityâ may need to be represented by more than a single parameter interpreted as programming proficiency. On the other hand, the measurement model does separate effects in the interaction between the learner and the learning environment: One can examine the expected effects of instructional treatments on the measured difficulty of individual productions, not just some overall effect. Also, it seems likely that the measures of production difficulty it generates could provide a more parsimonious account of the differences among productions than the four-parameter representation used by Corbett et al.
The next two chapters are also much more in the tradition of the snapshot assessment of competence. Polk, VanLehn, and Kalp present a fundamentally new data analysis method with potential applications far beyond cognitive diagnosis, the fitting of symbolic parameter models. It is the discrete, symbolic equivalent of curve fitting. In the context of cognitive diagnosis, the symbolic parameters are production system rulesâcorrect or defectiveâthat students use to solve problems in a domain of interest, such as logical syllogisms or subtraction problems. As already noted, such production systems form a very strong model that can actually be run on the problems to determine how examinees should act to solve the problem. ASPM, the program that Polk et al. have developed, uses this fact to compute the best-fitting set of rules to account for the performance of each individual subject, ideally yielding a complete individual cognitive diagnosis, comprising known correct rules, rules not known, and misconceptions or malrules. This is a totally deterministic approach, and it obviously depends on having a very high-quality cognitive model. Of course, by identifying areas in which there are failures of fit, the method can be used to help refine a cognitive model. Still, a completely deterministic approach is rather extreme. As yet, ASPM does not have stochastic aspects that would provide for accidental slips in examinee performance or that would enable one to decide whether a diagnosis with less than perfect fit should be considered good enough. This is an important direction for future developments of ASPM.
The chapter by Martin and VanLehn considers cognitive diagnosis in another domain that has been the focus of a considerable body of cognitive science research: physics problem solving in mechanics. Like the Corbett et al. chapter, this one combines a production system model of physics problem-solving skill with Bayesian inference techniques to build a student model on the basis of problem-solving performance. In this case, however, an intelligent tutor with a student model does not yet exist, and Martin and VanLehn are attempting to build on the full range of cognitive research techniques that have been applied to understand the growth of expertise in physics problem solving. They offer a suite of assessment tools that provide information on problem representation and solution processes. These tools include: a âpoor manâs eye-tracker,â which allows the system to gather information about the process by which a student examines example problems, a task that asks students to classify problems into groups; and another involving more atypical non-quantitative conceptual physics problems of the type that often reveal serious misconceptions.
In their applications of Bayesian inference techniques, Martin and VanLehn have also encountered computational problems because of the size and complexity of the inference problem, and they have developed some new approaches to make it tractable. As they confront the problem of integrating all of the information from the various assessment tasks into a single comprehensive cognitive diagnosis, they are encountering a limitation in the current state of cognitive theory: It does not yet provide an account of the way in which the procedural knowledge that is represented by productions is related to or integrated with more conceptual forms of knowledge. Finally, like Gitomer et al., Martin and VanLehn have attempted to provide diagnostic information at several levels of detail to support decision making for instructional planning, competence certification, and instructional evaluation. They do this by providing a flexible capability to aggregate information upward from the detailed production level assessment.
The two remaining contributions in this section emphasize the research in cognitive model development that must be done to provide the foundations of cognitive diagnosis. The chapter by Biswas and Goldman is a progress report on such an effort that proved particularly challenging. Their chosen domain, expertise in electronic circuit design, is not one that had a substantial body of prior research to build upon. Furthermore, design problem solving in general is a rather open-ended, creative enterprise that is at the very frontier of research in cognitive modeling. DuBois and Shalin, in contrast, have articulated the general schema of routine problem-solving expertise that cognitive science research has provided. They have successfully used it to guide their exploration and analysis of expertise in a somewhat unusual domain: Marinesâ land navigation skills. It will be interesting to see how successfully this approach can be applied to a variety of different domains. Certainly, there is nothing in it that is specific to skills like land navigation, but there may be limitations in the range of skills to which it is readily applied, limitations having to do with the accessibility of relevant processes to introspection. Currently these researchers have moved on to study an electronics skill that should also be tractable.
The kind of analysis that DuBois and Shalin have done could be the first step in developing a detailed production system model. Instead, however, they have used it to develop relatively conventional test items that systematically probe all the critical aspects of a cognitive skill, including those that are frequently neglected by conventional test developers. Conventional training and testing tend to emphasize the action parts of cognitive skills, neglecting knowledge of the conditions that should trigger the actions. This is true both at the lowest levels and at the higher level that we might call strategic selection of methods. Because their approach does not require the development of a fully detailed production system cognitive model, DuBois and Shalin present a more practical alternative that is probably suited for immediate application to a broad range of assessment problems. Yet, it is an approach that is well founded in the general insights provided by cognitive science research.
Conceptual Networks
A second form of knowledge representation that is widely used in cognitive science is the conceptual or semantic network. A great deal of our kn...