Large-scale Assessment Programs for All Students
eBook - ePub

Large-scale Assessment Programs for All Students

Validity, Technical Adequacy, and Implementation

  1. 536 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Large-scale Assessment Programs for All Students

Validity, Technical Adequacy, and Implementation

About this book

The need for a comprehensive volume that reviews both the processes and issues involved in developing, administering, and validating large-scale assessment programs has never been greater. These programs are used for many purposes, including instructional program evaluation, promotion, certification, graduation, and accountability. One of the greatest problems we face is how to deal with special needs and bilingual populations. Examining these processes and issues is the mission of this book. It is organized into the following five sections: Introduction, Validity Issues, Technical Issues, Implementation Issues, and Epilogue. Each chapter follows a common structure: Overview of critical issues, review of relevant research, descriptions of current assessment methodologies, and recommendations for the future research and practice.

Written by nationally recognized scholars,Large-Scale Assessment Programs for All Students: Validity, Technical Adequacy, and Implementation will appeal to anyone seriously involved in large scale testing, including educators, policymakers, testing company personnel, and researchers in education, psychology, and public policy.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Large-scale Assessment Programs for All Students by Gerald Tindal,Thomas M. Haladyna in PDF and/or ePUB format, as well as other popular books in Education & Education General. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Routledge
Year
2012
Print ISBN
9781138866645
Chapter 1
Large-Scale Assessments for All Students: Issues and Options
Gerald Tindal
University of Oregon
Standardized, large-scale assessment clearly has proliferated in the past 10 to 20 years and is prominent in all 50 states. These testing programs are increasingly complex and include a number of subject areas and testing formats, including multiple-choice tests and various types of performance assessments. Although the public press may appear to be negative on testing, in fact, Phelps (1998) reported in an extensive summary of 70 surveys that have been conducted over the past 30 years that “the majorities in favor of more testing, more high stakes testing, or higher stakes in testing have been large, often very large, and fairly consistent over the years, across the polls and surveys and even across respondent groups” (p. 14). Not only has assessment proliferated and been supported in the United States, but also throughout the world. A clear trend exists toward more, not less, large-scale testing programs. “Twenty-seven countries show a net increase in testing, while only three show a decrease. Fifty-nine testing programs have been added while only four have been dropped” (Phelps, 2000). In summary, large-scale testing has been on the rise and is supported by the public.
Such support, however, begs the question: Have educational programs improved as a result of such proliferation of large-scale tests? It is not simply a question of the public having positive perceptions of schools that is important. Even improvement in test scores does not necessarily mean that schools have improved because of the need to make valid inferences on what that improvement means, which requires both a nomological net of variables and a logic for relating them together. Rather, the critical is-sue is that schools are effective institutions in teaching students the skill and knowledge needed to be successful in post-school environments. It is essential that students have declarative knowledge in the cultural electives and imperatives (Reynolds, Wang, & Walberg, 1987) to understand how the social (economic and political) and physical (geographic) worlds operate. Students also need to be skilled in using that declarative knowledge to solve problems. They need to know when and how to use information, reflecting conditional and procedural knowledge, respectively (Alexander, Schallert, & Hare, 1991).
This focus on declarative as well as conditional and procedural knowledge has been incorporated into the measurement field through the increasing use of both selected and constructed responses. Students continue to take multiple-choice tests—an example of a selected response format—as part of most large-scale assessment programs. However, they also are being asked to solve open-ended problems and construct a response in which both the process and product are evaluated. In summary, our measurement systems have become more complex, and our understanding of them has deepened to include validation in reference to inferences or interpretations as well as use.
This chapter provides a preview of these two major issues currently being addressed in educational measurement. First, with the most recent Standards for Educational and Psychological Testing (American Educational Research Association, American Psychogical Association, & National Council of Measurement in Education, 1999), the validation process is focused on constructs and decisions. Although content and criterion validity (predictive and concurrent) continue to be important components in understanding our measurement systems, the clear focus is on the inferences we make from the outcomes of our measurements. Second, with an emphasis on decision making, the validation process is more anchored to how we use measurement systems in educational practices. Changing practices, however, requires a different look at the role of measurement. Rather than being simply documentary and used in a static manner with program evaluations, the argument is made that more quasi-experimental designs be used to better understand our measurement systems and hence the decisions we make from them.
Construct Validity: Evidential and Consequential Components
With Messick’s (1988, 1994) work serving as an influential guide in the emerging conception of validity, we first need to consider our interpretations. Both the evidential and consequential basis of interpretations from large-scale testing programs must be considered (Messick, 1989). In some cases, the research on the evidential basis has addressed components of the testing program, letting us understand specific aspects of how largescale tests should be constructed or implemented. For example, Haladyna, Hess, Osborn-Popp, and Dugan (1996) studied local versus external raters in a state writing assessment program, and Linn, Betebenner, and Wheeler (1998) studied student choice of problems in a state math testing program. The earlier study on writing reflected potential bias from raters, whereas the latter study highlighted bias from students’ choice of problems to solve. Both studies reflect exemplary research on construct validity and a potential source of systematic error (bias) that can influence our interpretations.
In both studies, however, only one component of the state-testing program was being analyzed. The focus of validity was on the construct being assessed in relation to the bias involved in the assessment. In contrast to this attention to specific practices, other research has focused on an entire statewide assessment program and the impact of large-scale testing programs. This growing body of research has focused on consequential validity, particularly unintended consequences. For example, according to Strong and Saxton’s (1996) evaluation of the Kentucky assessment program in reading, a substantial percentage of students performed relatively (and uniquely) low on the Kentucky Instructional Results Informational System (KIRIS) (novice or apprentice) but not on the American College Test (ACT). They concluded that “the KIRIS test fails to identify a significant percentage of students who read quite well according to the ACT…. Approximately 64% of students who score at the Apprentice level will achieve average and above on the ACT” (pp. 104–105). The question then moves to using the ACT as a measure of outcome success. In terms of validating large-scale testing programs in this manner, such an emphasis on concurrent validity results in an endless spiral with no eventual resolution.
A different problem emerged in a comprehensive review of the Maryland School Performance Assessment Program (MSPAP) by Yen and Ferrara (1997). Although they reported extensively on various design and psychometric components supporting MSPAP, they found a school-byform interaction. With some schools and with certain forms, the results were uniquely different than other schools and forms. Therefore, they recommended the use of two to three forms for measuring school performance. They also noted that the “instructional validity is a weakness for MSPAP as evidenced by current low performance and resulting low teacher morale and by legal vulnerability when high-stakes consequences (described earlier) are enforced” (pp. 79–80). Yet with no attention to classroom variables, this line of research also becomes an endless documentation of problematic consequences of little help in understanding relationships to measure or change.
Such research on the impact of large-scale testing certainly has been a concern both in the measurement field and in practice (see Educational Measurement: Issues and Practice, Vol. 17, No. 2). This issue, however, is not new, but may have become more critical as large-scale assessments apply to more and different purposes and with higher stakes attached to them. Clearly, we have seen more negative impacts being reported. For example, Hass, Haladyna, and Nolen (1989) reported on the negative consequences of using the results from the Iowa Test of Basic Skills (ITBS) to evaluate teachers, administrators, and schools. Smith (1991) also documented the same kinds of negative effects from testing such as (a) generation of shame, embarrassment, and guilt; (b) questions of validity that lead to dissonance and alienation; (c) creation of emotional impact and anxiety by students; (d) reduction of time available for teaching; (e) resultant narrowing of the curriculum; and (f) provision of teaching that is deskilled from teaching in a testlike manner. In a similar survey on the effects of standardized tests on teaching, Herman and Golan (1993) found similar negative effects in that teachers felt pressure to improve student scores, administrators focused on test preparation, instructional planning was affected (in both content and format), and more student time was spent preparing for testing.
Yet Herman and Golan (1993) also noted some very positive results. Teachers continued giving attention to nontested instructional areas, schools gave attention to instructional renewal, and teachers were both positive about their work environment and felt responsible for student performance. They concluded that the findings depend on “whether or not one views the standards and skills embodied in mandated tests as educationally valid and significant” (p. 24). As mentioned earlier, there is growing evidence that much of the public does view them in this way. Certainly, this kind of work on the components and impacts of largescale assessment programs must continue. Specific construct validation studies need to be at all levels whether we are measuring specific skill areas to make inferences about students’ proficiencies or entire systems to document accountability and perceptions of impact. This system-level focus on perceptions and impact concurrently needs to be documented for all stakeholders in our large-scale assessment programs. Parents and teachers are obviously critical participants. As we move to high-stakes accountability systems, administrators also are key individuals. With the increasing use of standards, critical stakeholders become more embedded in our economic and political systems as business leaders and state directors of testing. However, we also need to understand the relationship between that which we measure on the large-scale tests and that which we manipulate at the individual level, whether student or teacher. In this next section, the focus is on changing practices in the classroom. We need to use measure-merit systems more dynamically in our research and educational practices. This research on measurement needs to be technically adequate and the practice systemically related to classrooms.
Technical Issues in Using Measurement in the Classroom
To begin using our measurement systems in classrooms, we first need to operationalize performance tasks and then define the domain they represent. Although performance tasks are likely to be related to classroom instruction, they also are related to many other things. As Messick (1994) described them, performance assessments are purportedly both authentic and direct, a validity claim that needs evidence. For him, “authenticity and directness map, respectively, into two familiar tenets of construct validity, that is minimal construct under-representation and minimal constructirrelevant variance” (p. 14). When constructs are underrepresented, testing is not sensitive enough to ascertain the full range of skill and knowledge exhibited in a performance. For example, writing tests that require students to edit grammar and syntax may be viewed as underrepresenting the construct of writing. In contrast, when constructs are overrepresented in a test, the opposite problem is occurring in which more is being measured than intended. Many problem-solving tasks in content areas (math, science, and social sciences) exemplify this problem because they require writing as an access skill.
These two features of validity represent trade-offs “between breadth and depth in coping with domain coverage and limits of generalizability” (p. 13), with coverage referring to both content and process and generalizability referring to all dimensions of the context as well as the domain. In these examples, selected response writing tests may suffer from a lack of depth and limit the generalizations that can be made when making inferences about a student’s writing (composition) skills. Performance tasks in content areas may be providing more depth in problem solving, but at a cost of being too broad in response dimensions and too narrow in content coverage. This again limits the inferences that can be made of a student’s skill and knowledge.
The measurement field has tended to focus primarily on constructirrelevant variance of performance assessments through generalizability studies, with an emphasis on making inferences from a student’s obtained score to the student’s universe score (Cronbach, Gleser, Nanda, & Raja-ratnam, 1972; Shavelson, Webb, & Rowley, 1989). That is, assuming the obtained score is only that which is attained at that time and under those specific circumstances (both of which contain error), what would a stu-dent’s score be when considering all such possible attempts and circum-stances? To answer this question, variance generally is partitioned into er-ror and true score and can be attributable to the person (test taker), markers (judges), tasks, or scales in determining the reliability of a response: When performance assessments result in great variance, they can lead to erroneous generalizations. In the end, generalizations are made from a specific performance to a complex universe of performances, incorporating all possible combinations of these components in which unreliability arises from tasks, occasions, raters, or any combination.
Kane, Crooks, and Cohen (1999) put forth one of the most elegant strategies for addressing the validation process. They essentially proposed establishment of a sequence of propositions in the form of a bridge with three links that refer to scoring, generalization, and extrapolation:
1. A connection must be established between observations (of a performance) to observed scores (as part of administered tasks and as rated by judges). In this link, threats to validity must be countered that arise from the conditions of administration and contexts of assessments, as well as from the adequacy of the judgments that are rendered. This connection refers to scoring.
2. Assuming the first link can be made adequately, another connection needs to be made from the observed score to a universe of scores (as sampled by various representative tasks from which there are many). The most significant problem is the generalizability over tasks, which they noted “tends to be the weak link in performance assessments and therefore deserves extra attention” (p. 16). This connection refers to generalization.
3. Finally, assuming the prior link from sampled to possible tasks is adequate, a connection must be made from this range of tasks to a target score that reflects the broad construct for making inferences. This connection refers to extrapolation.
In their summary of the validation process, they argued that equal attention needs to be given to precision (generalization across tasks) and fidelity (extrapolation) for assessments to be useful in making decisions. “If any inference fails, the argument fails” (p. 16). For Kane et al. (1999), simulations and standardized tasks are suggested as a strategy for better controlling the task variance without fully compromising the (extrapolation) inferences made to the universe.
The Symbiotic Relationship Between Teaching and Learning
Although we may have the logic in validation of performance measures, we certainly do not yet have adequate information on how to integrate such information into the classroom. We have little empirical research to help us guide teachers or testers in the process of preparing students to learn and perform. For example, Shepard, Flexer, Hiebert, Marion, Mayfield, and Weston (1996) reported that classroom performance assessments had little influence on achievement. Using a fairly elaborate research design, two groups were compared: a participant group that focused on performance assessments and a control group that had no such focus. At the end of 1 year, however, they found that the participating teachers had not really implemented the treatment until late in the year and into the next year. Yet what was the treatment and how much of it was needed?
The central question is how to relate teaching to learning. This issue implies a focus on teaching, not just making inferences from our measures or ascertaining perceptions of impact. We must shift our focus to include both teaching as well as improved student learning because in the absence of well-defined instruction, the field is not better off: Replication an...

Table of contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright
  5. Contents
  6. Preface
  7. 1. Large-Scale Assessments for All Students: Issues and Options
  8. Part I: Validity Issues
  9. Part II: Technical Issues
  10. Part III: Implementation Issues
  11. Part IV: Epilogue
  12. Author Index
  13. Subject Index