Chapter 1
Summative evaluation of standards-based curricula
When new curricula are developed and published, the claims that they will improve student achievement are typically based on market research and/ or the claims of the authors as a consequence of one or more trials of the materials in classrooms. The validity of such claims for diverse sites is often questionable. What is needed is a summative evaluation of any particular curriculum. This involves large-scale studies that gather data to determine if the newly created product is ready for large-scale use. They are expensive and difficult to carry out. Unfortunately, examples of such studies are rare.1
In 1992 the National Science Foundation (NSF) funded several projects to develop new sets of instructional materials that reflected the reform vision of school mathematics espoused by the National Council of Teachers of Mathematics (NCTM, 1989). As the development of these materials was nearing completion, government agencies and educational researchers began to call for evidence that these were āresearch-based instructional materialsā and could be used by other schools to improve student achievement.2 Their intent was to ask the developers, or publishers, of such materials to go beyond the conventional āmarket researchā toward more āscientific researchā3 to support any claims of increased student performance.
The National Research Councilās review of the many studies related to the development of standards-based curricula found many reasonable design studies but no large-scale summative evaluations of any of the new curricula (National Research Council, 2004). The chapters in this book
portray an example of a summative evaluation of a āstandards-basedā curriculum for middle schools, Mathematics in Context, and describe the complexity and difficulties of conducting such research. Furthermore, this study demonstrates the kinds of assessments that can be meaningfully done amidst the complexities of āreal worldā implementations.
To understand the pressures for well-conducted summative evaluations of the standards-based curricula, we have chosen to focus on:
- summative evaluation as an aspect of curriculum research,
- the fact that randomized experiments do not adequately address the contextual complexity of the instructional dynamics in classrooms, and
- the potential of structural modeling as the basis for capturing the complexity of classroom instruction in such investigations.
Summative evaluation as an aspect of curriculum research
The evolution of research methods in education during the past quarter century has been discussed in several reports (e.g., Lagemann, 2000; Lagemann & Shulman, 1999; Shavelson & Towne, 2002), and specifically in mathematics education (e.g., Romberg, 1992; Schoenfeld, 1994, 2001). Researchers who have studied the development and use of new products, such as curricula, have used several different methods to gathering information and making judgments based on that data. For example, four general types of evaluations have been described in the literature: needs assessment, formative evaluation, summative evaluation, and illuminative evaluation (e.g., Romberg, 1992). All such evaluations involve gathering data to determine the usability of the product in educational settings, and as such should be considered as aspects of curriculum research.4
It should also be understood that federal-level insistence for information about the impact of new programs on student achievement is not new. Such calls began with the burst of reform programs associated with the mid-1960s Great Society initiatives in the United States. In areas as diverse as bilingual education, career education, compensatory programs, reading, or mathematics, little expertise in evaluation existed in the very agencies responsible for carrying out program evaluations. In fact, the initial training institute on program evaluation was held at the University of Illinois in 1963 under the direction of Lee Cronbach (Romberg, 1988).
Clements (2007) argues that summative evaluations should use a broad set of instruments to assess the impact of the implementation on participating children, teachers, program administrators, and parents, as well as to document the fidelity of the implementation and the effects of the curriculum across diverse contexts ⦠Ideally, because no set of experimental variables is complete or appropriate for each situation, qualitative inquiries [should] supplement these analyses.
(p. 53)
Thus, in summative evaluations the derived information comes from both quantitative and qualitative sources collected in several contexts to answer questions such as the following:
- What is the impact of the use of the new program on student achievement?
- How is the impact of instruction using the new curriculum different from that of conventional instruction on student performance?
- What variables associated with classroom instruction account for variation in student performance?
The problem when attempting to answer such questions, as many authors have pointed out, is that schooling is a complex and dynamic social enterprise that does not fit the standard research methods prevalent in many other fields (Brown, 1992). In particular, the use of randomized experiments adapted from agricultural research simply cannot cope with the dynamics of classroom research.
Randomized experiments and the contextual complexity of classroom instruction
Driven by the spectacular success of experimental methods in agriculture and medicine, naĆÆve policy makers have argued for the use of the āgold standardā of randomized controlled trials in summative evaluations of the new reform curricula so that āby harnessing the logical, conceptual, and computational power of mathematics and statistics, dubious notions about political and social dilemmas might be replaced with carefully reasoned and dispassionately tested scientific inferencesā (DeNardo, 1998, p. 125). While the desire for reasoned empirically-based inferences is understandable, the belief that this can only be done via randomized experiments is not warranted.
The power of experimental methodology is based on three key assumptions:
- treatment effects are additive,
- treatment effects are constant, and
- there is no interference between different experimental units (Cox, 1958).
If these assumptions are reasonable, then strong inferences are possible. Before explaining these assumptions and discussing their consequences, the three termsātreatments, effects, and experimental unitsāmust be understood.5
Treatment. The objective of many agricultural experiments is to compare the yields of a number of plant varieties, fertilizers, or soil characteristics. The term ātreatmentā, for example, might refer to the use of fertilizers. For curriculum studies, ātreatmentā would translate to use of a particular curriculum.
Effects. The term āeffectsā refers to yield or end product of a treatment. The number of ābushels of cornā in an agriculture study is an example of a yield. In curricular research this translates to assessment of performance at the termination of the ātreatmentā. This is now most often accomplished by developing tests based on the goals of the instructional treatment.
Experimental unit. In agriculture the term āexperimental unitā refers to the soil plots or āthe smallest division of the experimental material such that any two units may receive different treatments in the actual experimentā (Cox, 1958, p. 2). In curricular summative evaluations, the āexperimental unitā should be the students and teachers in a classroom or school using the new curricula.
Treatment effects are additive
The mathematics of determining experimental effects is based on Equation 1 (Cox, 1958, p. 14). Equation 1 says that the total quantitative effect of yield (y) after a
treatment can be broken down into two sub-quantities: uāa quantity depending only on the particular experimental unit, and tāa quantity depending only on the treatment used. There are two immediate consequences of this assumption. The first is the quantifiability of the first three termsāy, u, and t. While it is true that one mark of a mature science is the possession of sophisticated measurement instruments and techniques, we must admit that at present in education, we are not able to quantify with any validity or accuracy many terms in an educational setting. The
second consequence of the additive assumption allows the possibility for adding or subtracting these treatment effects in an algebraic manner in order to remove the quantity depending upon the experimental unit. For example, one can estimate differences between two treatment effects. If measurements are taken after two different treatments, they can be subtracted. The following set of equations shows this algebraic process.
y1 = u + t1 (Measurement after Treatment 1)(2)
y2 = u + t2 (Measurement after Treatment 2)(3)
Equation 4 says that the difference between the final measurements after Treatments 1 and 2 (y1 ā y2) can be considered to also be the difference between the two effects due only to Treatments 1 and 2 (t1 ā t2). Note, however, that this equation is correct only if the effects depending on the unit (the us) are equal for different units. These conditions are assumed to be true when there is no systematic bias, which differentiates the experimental units. Control of bias is accomplished by random assignment of experimental units to the alternate treatments.
In fact, it is this alternate treatment randomized experimental design for classroom research that many policy makers are calling for when one does summative evaluations of the standards-based curricula. One treatment involves an experimental treatment based on the new curricula, the other treatment being whatever had been typically done (the conventional treatment), and the comparison is made in terms of differences between treatment groups on some post-test. Unfortunately, comparable classrooms assigned randomly to alternate treatments often is not feasible for a variety of reasonsāsome of them ethical. Hence any claim of important differences of yield between treatments is logically suspect.
If random assignment is not possible, quasi-experiments are sometimes appropriate. This involves āmatchingā experimental units on some characteristics, and adjusting achievement scores to account for some of those differences. However, as Campbell and Stanley (1963) so forcefully point out, there are several sources of potential invalidity to this strategy such as history, maturation, instrumentation, test treatment interaction, and so forth. Investigators of classrooms simply cannot blindly use differences in scores as estimates of treatment effects.
Constancy of treatment effects
In an attempt to increase the generalizability of the findings, most researchers replicate the instructional procedure by using the basic program in two or more settings. This assumption says that treatment effect does not change when the treatment is given to two different units. Algebraically, this assumption allows Equation 5 to represent the measured effect after a specific treatment on one unit (u1), and Equation 6 to represent the measured effect of the same treatment upon a different unit.
Since treatment effect is assumed to be the same for these two different units (e.g., the same fertilizer is used with both units), the subtraction of Equation 6 from Equation 5 states a very important consequence of this assumption (see Equation 7).
Equation 7 states that if two different experimental units (u1 and u2) receive the same treatment, then the differences in effect (y1 and y2) are only due to differences in units. With this assumption, when one compares the end of treatment measures, one is also comparing the differences between the experimental units. Note, following this argument, if u1 and u2 were identical units, the difference in treatment effects should be zero. However, if aspects of instruction procedures are adapted for different classes using the same curriculum, one would actually anticipate different treatment effects. In fact, the difference between instruction in one classroom and instruction in another class is likely to vary considerably. Such differences are both natural and beneficial. Instructional events are not mechanistic routines to be blindly followed. Real events grow, change, and develop as the human beings involved in the event interact. In fact, it is the actual patterns of interactions, rather than the intended treatments, that are the important features of classroom instruction for policy makers, teachers, and other researchers.
Lack of interference of experimental units
This assumption says that when more than one experimental unit is used, there is no interference or interaction between the units. This assumption is particularly important if statistical analysis is to be made of the observations since the statistics are based on an assumption of independence of observations (or unit measurements). If the experimental units really were independent classes in different schools, then one might argue lack of interference between the classes. This assumption is problematic when one uses individual students as the experimental unit. As students are exposed to new material, we expect them to assimilate those new ideas into their own personal meanings or ideational scaffolding. We expect the same instructional event to have different effects on different students. Some will assimilate and use lots of new information in one way, others may generate quite different kinds of new information and relationships. Researchers now have generally agreed that unless the influence of individual difference variables is considered, predicted outcomes of instructional events will be masked by within-treatment variation. Persons indeed do differ in how they respond to the same information or the same instructional procedures. Thus, the assumption that treatment effects are constant is simply false in most classroom circumstances.
Also, such interference between units is the essential interaction between human beings one expects in classes. In fact, investigators have typically assumed that the treatment effect for a class is simply an aggregate of individual effects (individual students...