This special issue is based on a workshop which began with a description and examination of the current National Assessment of Educational Progress (NAEP) standard-setting model, then looked to standard-setting applications outside of education. These applications included those that focus on human performance and the adequacy of human performance; in these contexts, raters were asked to focus on the knowledge and skills that underlie competent performance. Researchers also examined applications that focus on the impact of environmental agents on life and the ecology; in these cases, raters began with the knowledge that more (or less) of a substance is better and, as for NAEP, the judgment task was to determine "how good is good enough." They wished to examine parallels in the objectives, empirical grounding, judgmental requirements, and policy tensions for standard setting in NAEP and in other domains.

These papers were commissioned to examine the current state of affairs and residual issues with respect to achievement-level setting in NAEP and to help determine whether the models and methods used in other disciplines have useful application to education. It is important to note that the papers represent the authors' views, not necessarily those of the committee or National Research Council. This issue and the workshop discussion point out a number of analogies between the objectives, requisite data, judgment requirements and policy issues for NAEP and other applications. The editors hope that this issue and wide distribution of these papers will prompt others to join in this interesting analysis and debate.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.

Yes, you can access Setting Consensus Goals for Academic Achievement by James Pellegrino, Lauress Wise, Nambury Raju, James Pellegrino,Lauress Wise,Nambury Raju in PDF and/or ePUB format, as well as other popular books in Education & Education General. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Year

Print ISBN

eBook ISBN

Topic

Subtopic

Validating Inferences From National Assessment of Educational Progress Achievement-Level Reporting

Robert L. Linn

Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder

Requests for reprints should be sent to Robert L. Linn, Campus Box 249, University of Colorado, Boulder, CO 80309–0249. E-mail: [email protected]

The validity of interpretations of National Assessment of Educational Progress (NAEP) achievement levels is evaluated by focusing on evidence regarding 3 types of discrepancies: (a) discrepancies between standards implied by judgments of different types of items (e.g., multiple choice vs. short answer or dichotomously scored vs. extended response tasks scored using multipoint rubrics), (b) discrepancies between descriptions of achievement levels with their associated exemplar items and the location of cut scores on the scale, and (c) discrepancies between the assessments and content standards. Large discrepancies of all 3 types raise serious questions about some of the more expansive inferences that have been made in reporting NAEP results in terms of achievement levels. It is argued that the evidence reviewed provides a strong case for making more modest inferences and interpretations of achievement levels than have frequently been made.

There is broad professional consensus that it is the uses and interpretations or inferences made from specific uses of assessment results that are validated. As stated in the Test Standards (American Education Research Association, American Psychological Association, and the National Council on Measurement on Education [AERA, APA, NCME], 1985), for example, “[t]he inferences regarding specific uses of tests are validated, not the test itself (p. 9). In a similar vein, Messick (1989) states: “What is validated is not the test or observation device as such but the inferences derived from test scores or other indicators—inferences about score meaning or interpretation and about the implication for action that the interpretation entails” (p. 13).

I begin with this presumably familiar observation about what is validated for two reasons. First, it makes obvious the importance of attending to the specific uses and interpretations of assessment results when validity is considered. Second, it provides the basis for focusing attention in validation work on performance standards used to report and interpret assessment results.

My remarks are framed in terms of a particular assessment program, the National Assessment of Educational Progress (NAEP). Many of the observations are applicable, however, to other large-scale assessment programs that use performance standards as a way of reporting and interpreting assessment results and that emphasize aggregate results rather than the performance of individual students. Additional issues arise in assessment programs that use performance standards to interpret the performance of and make decisions about individual students. The discussion here, however, is limited to the use of performance standards to report and interpret the performance of groups of students.

NAEP ACHIEVEMENT LEVELS

The National Assessment Governing Board (NAGB) began its effort to report results in terms of performance standards with the 1990 mathematics assessment. The decision to develop and use performance standards to report NAEP results was made by the then newly established governing board. The performance standards that were established by NAGB were named “achievement levels.”

The decision to report NAEP results in terms of achievement levels was based on the NAGB’s interpretation of the legislation that reauthorized NAEP. Among other responsibilities, the legislation assigned NAGB the responsibility of “identifying appropriate achievement goals” (Augustus F. Hawkins-Robert T. Stafford Elementary and Secondary School Improvement Amendments, 1988). As was noted by the National Academy of Education (NAE) Panel on the Evaluation of the NAEP Trial State Assessment (Shepard, Glaser, Linn, & Bohrnstedt, 1993), the Board

might have responded in different ways. Given the emerging consensus for establishing national education standards, the fact that the Education Summit was silent on who should set standards, and the fact that NAEP was the only national assessment of achievement based on defensible samples, NAGB interpreted the authorizing legislation as a mandate to set performance standards, which it named “achievement levels,” for NAEP. (p. xviii)

Because of the potential importance of the achievement levels in the context of the press for national standards, the 1990 achievement levels were subjected to several evaluations (e.g., Linn, Koretz, Baker, & Burstein, 1991; Stufflebeam, Jaeger, & Scriven, 1991; U.S. General Accounting Office, 1993). NAGB was responsive to many of the criticisms of the evaluators and undertook another, more extensive standard setting effort for the 1992 mathematics and reading assessments. Evaluations of the 1992 effort (e.g., Burstein et al., 1993, 1995–1996; Shepard, 1995; Shepard et al., 1993), however, were again quite critical. The NAE panel, for example, concluded that the achievement levels may reduce rather than enhance the validity of interpretations of NAEP results. Not all agreed. Indeed, there were strong defenders of the 1992 mathematics achievement levels, the process used to set them, and the interpretations that they were intended to support (e.g., American College Testing, 1993; Cizek, 1993; Kane, 1993).

There is no need to review in any detail the controversy that has surrounded the achievement levels since they were first introduced on a trial basis for the 1990 mathematics assessment. There are two points worth making in this context, however. First, the controversy, at least in part, led to a conference on standard setting for large-scale assessments that was held in October 1994 under the joint sponsorship of NAGB and the National Center for Education Statistics (NCES). Although the conference did not resolve the controversy, several of the articles, which are now available in the conference proceedings (NAGB & NCES, 1995), addressed validation issues for performance standards and contributed to the formulation of validity research needs discussed later. Second, much of the criticism revolved around a few critical validity questions that should continue to demand attention in any serious validation effort. I will emphasize three of those questions.

The three validity questions that I consider central to the controversy all involve discrepancies: (a) discrepancies between standards implied by judgments of different types of items (e.g., multiple choice vs. short answer or dichotomously scored vs. extended response tasks scored using multipoint rubrics), (b) discrepancies between descriptions of achievement levels with their associated exemplar items and the location of cutscores on the scale, and (c) discrepancies between the assessments and content standards. Large discrepancies between the level at which judges would set standards when reviewing the multiple-choice, short-answer items and extended-response tasks raise serious questions about the conceptual coherence of the judgments. Discrepancies between descriptions of achievement levels and the location of the cutscore create a mismatch between what students with scores in the range of the scale corresponding to a given achievement level are said to be able to do and what it is that they actually did on the assessment. Finally, a misalignment between performance and content standards on the one hand and assessments and cutscores on the other undermines the construct validity of the intended interpretations of the assessment results. Although the latter discrepancy is most commonly illustrated in the area of mathematics, where comparisons are made between the assessment and the standards published in Curriculum and Evaluation Standards for School Mathematics (National Council of Teachers of Mathematics, 1989), the validity concern is obviously a much broader one and should be addressed in developing a validity argument (Kane, 1992) in any subject matter area.

VALIDITY FRAMEWORK

Much of the debate about achievement levels has focused on the standard setting method. This is hardly surprising given that the NAE panel recommended against the use of “the Angoff method or any other item-judgment method to set achievement levels” because the panel concluded that such methods are “fundamentally flawed” (Shepard, et al., 1993, p. 132). Although the method used clearly makes a difference in the outcome—a point that no one seems to dispute¹—focus on method too often does not come to grips with the fundamental questions of validity of associated interpretations. Instead, the point is made that that there is no “right” answer. Mehrens (1995), for example, stated “Standards are judgmental; there is no right answer as to where a standard should be set” (p. 254). Zieky (1995) made the same point: “We have learned that there is NO ‘true’ standard that the application of the right method, in the right way, with enough people, will find” (p. 29). I agree with the conclusions as stated by Mehrens and Zieky, but it should not be assumed, as it seems some may have, that this implies either (a) there is therefore no need to obtain evidence to support the validity of uses and interpretations of the standards, or (b) one method is as good as another.

The quality of methods can be distinguished on many grounds. Brennan (1995), for example, suggested that replicability of results as a criterion for evaluating methods and notes how generalizability analyses with items, judges, and occasions as facets can be used in evaluating the quality of method. Similarly, Mehrens (1995) relied on the criteria of ease of use and psychometric properties of the standard (mainly interjudge consistency) in arguing for the Angoff method as his preferred method. Two of the three criteria used by Kane (1995) to compare “task-centered approaches” (e.g., the Angoff method) with “examinee-centered approaches” were also of a similar nature “practical feasibility in the context” and “technical considerations in the context” (p. 130).

There is a substantial literature on methods of setting standards. There is considerable information available to guide the selection and training of judges, the use of multiple judgment rounds, and the introduction of impact data. Comparative studies where different methods are used yield different standards (e.g., see Mehrens, 1995). There is much less information about the psychological demands of different judgment procedures on judges or on the degree to which different methods are likely to differ in the validity of interpretations of results.

Method differences in interjudge consistency or in other components of a generalizability study (e.g., occasions or subsets of items judged) are certainly relevant to an overall evaluation of validity, but they do not address core validity issues. Just as the most reliable test need not be the one that supports the most valid inferences about students, the method that yields the most replicable standards or the standards with the highest interjudge consistency need not produce the standards that yield the most valid interpretations of student achievement.

A common criticism of psychometricians is that although we all give lip service to the doctrine that “validity is the most important consideration in test evaluation” (AERA, APA, & NCME, 1985, p. 9), we too often act in other ways, giving more attention to easier jobs of evaluating and enhancing reliability at the expense of the more difficult jobs of evaluating and enhancing validity (e.g., see Gipps, 1994, p. 76). This criticism also seems to fit much of the work on the development and evaluation of performance standards. Too little attention has been given to the interpretations associated with standards and the validity of interpretations in comparison to the amount of attention given to technique, interjudge consistency, and lack of consistency across methods.

There are, of course, exceptions. Kane’s (1995) first criterion for comparing task-centered and examinee-centered standard setting methods, for example, concerned the consistency of the methods with the “model of achievement” underlying the design of the assessment. Kane argued that “all aspects of an assessment program, including test development, scoring, standard setting, and reporting of results, should be consistent with the intended interpretation of the results” (p. 121). He concluded that holistic models of learning and achievement are more compatible with examinee-centered methods whereas analytic models are more compatible with task-centered methods. Kane indicated that such an analysis was not conclusive with regard to NAEP because both models have adherents in the NAEP context, and there is some reason to believe that NAEP may be in a state of transition from a dominant analytic model to greater emphasis on a holistic model. The key point for present purposes, however, is that such an analysis usefully focuses attention on the achievement construct that the assessment is intended to measure. This seems the proper place to begin a serious consideration of validity of standards-based interpretations of assessment results.

The previously mentioned discrepancies in the location of the standard implied by judgments of different types of items is relevant in this regard. Item type represents method variance that is a potential source of invalidity in the location of cutscores in relation to the performance standard constructs. The discrepancies in cutscores as a function of item type reported by Shepard et al. (1993) were quite large in both reading and mathematics at all three grade levels and all three achievement levels. The largest of the discrepancies between cutscores based on different item types at a given grade and achievement level were larger than the differences between cutscores between grades or levels when the same item type was being judged. For example, at Grade 4 in mathematics, the basic-level cutscore based on...

Cover
Table of Contents
GUEST EDITORS’ NOTE
Converting Boundaries Between National Assessment Governing Board Performance Categories to Points on the National Assessment of Educational Progress Score Scale: The 1996 Science NAEP Process
Validating Inferences From National Assessment of Educational Progress Achievement-Level Reporting
Implications of Market-Basket Reporting for Achievement-Level Setting
Setting Performance Standards for Professional Licensure and Certification
Lessons for the National Assessment of Educational Progress From Military Standard Setting
The Recommended Dietary Allowances: Can They Inform the Development of Standards fo Academic Achievement?
Science and Judgment in Environmental Standard Setting

About this book

Frequently asked questions

Information

Table of contents