Validating Inferences From National Assessment of Educational Progress Achievement-Level Reporting
Robert L. Linn
Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder
Requests for reprints should be sent to Robert L. Linn, Campus Box 249, University of Colorado, Boulder, CO 80309â0249. E-mail:
[email protected] The validity of interpretations of National Assessment of Educational Progress (NAEP) achievement levels is evaluated by focusing on evidence regarding 3 types of discrepancies: (a) discrepancies between standards implied by judgments of different types of items (e.g., multiple choice vs. short answer or dichotomously scored vs. extended response tasks scored using multipoint rubrics), (b) discrepancies between descriptions of achievement levels with their associated exemplar items and the location of cut scores on the scale, and (c) discrepancies between the assessments and content standards. Large discrepancies of all 3 types raise serious questions about some of the more expansive inferences that have been made in reporting NAEP results in terms of achievement levels. It is argued that the evidence reviewed provides a strong case for making more modest inferences and interpretations of achievement levels than have frequently been made.
There is broad professional consensus that it is the uses and interpretations or inferences made from specific uses of assessment results that are validated. As stated in the Test Standards (American Education Research Association, American Psychological Association, and the National Council on Measurement on Education [AERA, APA, NCME], 1985), for example, â[t]he inferences regarding specific uses of tests are validated, not the test itself (p. 9). In a similar vein, Messick (1989) states: âWhat is validated is not the test or observation device as such but the inferences derived from test scores or other indicatorsâinferences about score meaning or interpretation and about the implication for action that the interpretation entailsâ (p. 13).
I begin with this presumably familiar observation about what is validated for two reasons. First, it makes obvious the importance of attending to the specific uses and interpretations of assessment results when validity is considered. Second, it provides the basis for focusing attention in validation work on performance standards used to report and interpret assessment results.
My remarks are framed in terms of a particular assessment program, the National Assessment of Educational Progress (NAEP). Many of the observations are applicable, however, to other large-scale assessment programs that use performance standards as a way of reporting and interpreting assessment results and that emphasize aggregate results rather than the performance of individual students. Additional issues arise in assessment programs that use performance standards to interpret the performance of and make decisions about individual students. The discussion here, however, is limited to the use of performance standards to report and interpret the performance of groups of students.
NAEP ACHIEVEMENT LEVELS
The National Assessment Governing Board (NAGB) began its effort to report results in terms of performance standards with the 1990 mathematics assessment. The decision to develop and use performance standards to report NAEP results was made by the then newly established governing board. The performance standards that were established by NAGB were named âachievement levels.â
The decision to report NAEP results in terms of achievement levels was based on the NAGBâs interpretation of the legislation that reauthorized NAEP. Among other responsibilities, the legislation assigned NAGB the responsibility of âidentifying appropriate achievement goalsâ (Augustus F. Hawkins-Robert T. Stafford Elementary and Secondary School Improvement Amendments, 1988). As was noted by the National Academy of Education (NAE) Panel on the Evaluation of the NAEP Trial State Assessment (Shepard, Glaser, Linn, & Bohrnstedt, 1993), the Board
might have responded in different ways. Given the emerging consensus for establishing national education standards, the fact that the Education Summit was silent on who should set standards, and the fact that NAEP was the only national assessment of achievement based on defensible samples, NAGB interpreted the authorizing legislation as a mandate to set performance standards, which it named âachievement levels,â for NAEP. (p. xviii)
Because of the potential importance of the achievement levels in the context of the press for national standards, the 1990 achievement levels were subjected to several evaluations (e.g., Linn, Koretz, Baker, & Burstein, 1991; Stufflebeam, Jaeger, & Scriven, 1991; U.S. General Accounting Office, 1993). NAGB was responsive to many of the criticisms of the evaluators and undertook another, more extensive standard setting effort for the 1992 mathematics and reading assessments. Evaluations of the 1992 effort (e.g., Burstein et al., 1993, 1995â1996; Shepard, 1995; Shepard et al., 1993), however, were again quite critical. The NAE panel, for example, concluded that the achievement levels may reduce rather than enhance the validity of interpretations of NAEP results. Not all agreed. Indeed, there were strong defenders of the 1992 mathematics achievement levels, the process used to set them, and the interpretations that they were intended to support (e.g., American College Testing, 1993; Cizek, 1993; Kane, 1993).
There is no need to review in any detail the controversy that has surrounded the achievement levels since they were first introduced on a trial basis for the 1990 mathematics assessment. There are two points worth making in this context, however. First, the controversy, at least in part, led to a conference on standard setting for large-scale assessments that was held in October 1994 under the joint sponsorship of NAGB and the National Center for Education Statistics (NCES). Although the conference did not resolve the controversy, several of the articles, which are now available in the conference proceedings (NAGB & NCES, 1995), addressed validation issues for performance standards and contributed to the formulation of validity research needs discussed later. Second, much of the criticism revolved around a few critical validity questions that should continue to demand attention in any serious validation effort. I will emphasize three of those questions.
The three validity questions that I consider central to the controversy all involve discrepancies: (a) discrepancies between standards implied by judgments of different types of items (e.g., multiple choice vs. short answer or dichotomously scored vs. extended response tasks scored using multipoint rubrics), (b) discrepancies between descriptions of achievement levels with their associated exemplar items and the location of cutscores on the scale, and (c) discrepancies between the assessments and content standards. Large discrepancies between the level at which judges would set standards when reviewing the multiple-choice, short-answer items and extended-response tasks raise serious questions about the conceptual coherence of the judgments. Discrepancies between descriptions of achievement levels and the location of the cutscore create a mismatch between what students with scores in the range of the scale corresponding to a given achievement level are said to be able to do and what it is that they actually did on the assessment. Finally, a misalignment between performance and content standards on the one hand and assessments and cutscores on the other undermines the construct validity of the intended interpretations of the assessment results. Although the latter discrepancy is most commonly illustrated in the area of mathematics, where comparisons are made between the assessment and the standards published in Curriculum and Evaluation Standards for School Mathematics (National Council of Teachers of Mathematics, 1989), the validity concern is obviously a much broader one and should be addressed in developing a validity argument (Kane, 1992) in any subject matter area.
VALIDITY FRAMEWORK
Much of the debate about achievement levels has focused on the standard setting method. This is hardly surprising given that the NAE panel recommended against the use of âthe Angoff method or any other item-judgment method to set achievement levelsâ because the panel concluded that such methods are âfundamentally flawedâ (Shepard, et al., 1993, p. 132). Although the method used clearly makes a difference in the outcomeâa point that no one seems to dispute1âfocus on method too often does not come to grips with the fundamental questions of validity of associated interpretations. Instead, the point is made that that there is no ârightâ answer. Mehrens (1995), for example, stated âStandards are judgmental; there is no right answer as to where a standard should be setâ (p. 254). Zieky (1995) made the same point: âWe have learned that there is NO âtrueâ standard that the application of the right method, in the right way, with enough people, will findâ (p. 29). I agree with the conclusions as stated by Mehrens and Zieky, but it should not be assumed, as it seems some may have, that this implies either (a) there is therefore no need to obtain evidence to support the validity of uses and interpretations of the standards, or (b) one method is as good as another.
The quality of methods can be distinguished on many grounds. Brennan (1995), for example, suggested that replicability of results as a criterion for evaluating methods and notes how generalizability analyses with items, judges, and occasions as facets can be used in evaluating the quality of method. Similarly, Mehrens (1995) relied on the criteria of ease of use and psychometric properties of the standard (mainly interjudge consistency) in arguing for the Angoff method as his preferred method. Two of the three criteria used by Kane (1995) to compare âtask-centered approachesâ (e.g., the Angoff method) with âexaminee-centered approachesâ were also of a similar nature âpractical feasibility in the contextâ and âtechnical considerations in the contextâ (p. 130).
There is a substantial literature on methods of setting standards. There is considerable information available to guide the selection and training of judges, the use of multiple judgment rounds, and the introduction of impact data. Comparative studies where different methods are used yield different standards (e.g., see Mehrens, 1995). There is much less information about the psychological demands of different judgment procedures on judges or on the degree to which different methods are likely to differ in the validity of interpretations of results.
Method differences in interjudge consistency or in other components of a generalizability study (e.g., occasions or subsets of items judged) are certainly relevant to an overall evaluation of validity, but they do not address core validity issues. Just as the most reliable test need not be the one that supports the most valid inferences about students, the method that yields the most replicable standards or the standards with the highest interjudge consistency need not produce the standards that yield the most valid interpretations of student achievement.
A common criticism of psychometricians is that although we all give lip service to the doctrine that âvalidity is the most important consideration in test evaluationâ (AERA, APA, & NCME, 1985, p. 9), we too often act in other ways, giving more attention to easier jobs of evaluating and enhancing reliability at the expense of the more difficult jobs of evaluating and enhancing validity (e.g., see Gipps, 1994, p. 76). This criticism also seems to fit much of the work on the development and evaluation of performance standards. Too little attention has been given to the interpretations associated with standards and the validity of interpretations in comparison to the amount of attention given to technique, interjudge consistency, and lack of consistency across methods.
There are, of course, exceptions. Kaneâs (1995) first criterion for comparing task-centered and examinee-centered standard setting methods, for example, concerned the consistency of the methods with the âmodel of achievementâ underlying the design of the assessment. Kane argued that âall aspects of an assessment program, including test development, scoring, standard setting, and reporting of results, should be consistent with the intended interpretation of the resultsâ (p. 121). He concluded that holistic models of learning and achievement are more compatible with examinee-centered methods whereas analytic models are more compatible with task-centered methods. Kane indicated that such an analysis was not conclusive with regard to NAEP because both models have adherents in the NAEP context, and there is some reason to believe that NAEP may be in a state of transition from a dominant analytic model to greater emphasis on a holistic model. The key point for present purposes, however, is that such an analysis usefully focuses attention on the achievement construct that the assessment is intended to measure. This seems the proper place to begin a serious consideration of validity of standards-based interpretations of assessment results.
The previously mentioned discrepancies in the location of the standard implied by judgments of different types of items is relevant in this regard. Item type represents method variance that is a potential source of invalidity in the location of cutscores in relation to the performance standard constructs. The discrepancies in cutscores as a function of item type reported by Shepard et al. (1993) were quite large in both reading and mathematics at all three grade levels and all three achievement levels. The largest of the discrepancies between cutscores based on different item types at a given grade and achievement level were larger than the differences between cutscores between grades or levels when the same item type was being judged. For example, at Grade 4 in mathematics, the basic-level cutscore based on...