Chapter 1
The Changing Landscape of Teacher Evaluation
. . . . . . . . . . . . . . . . . . . .
Both the rhetoric and substance of teacher evaluation have changed dramatically over the last few years, due, in part, to a number of commentaries that have made strong claims regarding the inadequacies of traditional teacher evaluation systems. For example, Toch and Rothman (2008) said of traditional evaluation practices that they are "superficial, capricious, and often don't even directly address the quality of instruction, much less measure students' learning" (p. 1). Similarly, Weisberg, Sexton, Mulhern, and Keeling (2009) explained that teacher evaluation systems have traditionally failed to provide accurate and credible information about the effectiveness of individual teacher's instructional performance. A 2012 report from the Bill and Melinda Gates Foundation entitled Gathering Feedback for Teaching summarized the failings of teacher evaluation systems in the following way:
The nation's collective failure to invest in high-quality professional feedback to teachers is inconsistent with decades of research reporting large disparities in student learning gains in different teachers' classrooms (even within the same schools). The quality of instruction matters. And our schools pay too little attention to it. (p. 3)
Examples of similar sentiments abound in current discussions of teacher evaluation reform (e.g., Kelley, 2012; Strong, 2011).
Evidence for the Need for Change
Claims like those cited above have credible evidence supporting them. One can make a case that evidence impugning teacher evaluation started to accrue in the 1980s as a result of a study conducted by the RAND group entitled Teacher Evaluation: A Study of Effective Practices (Wise, Darling-Hammond, McLaughlin, & Bernstein, 1984). Along with their general finding that teacher evaluation systems were not specific enough to increase teachers' pedagogical skills, the researchers noted that teachers were the biggest critics of their current, narrative evaluation systems and the strongest proponents of a more specific and rigorous approach: "In their view, narrative evaluation provided insufficient information about the standards and criteria against which teachers were evaluated and resulted in inconsistent ratings among schools" (Wise et al., 1984, p. 16). Since this study first appeared, evidence of the inadequacies of teacher evaluation systems and commentary on that evidence has been mounting in the research and theoretical literature (e.g., Glatthorn, 1984; McGreal, 1983; Glickman, 1985; Danielson, 1996).
Without question, two reports, both of which we cited previously, catapulted the topic of inadequacies of teacher evaluation into the limelight: Rush to Judgment (Toch & Rothman, 2008) and The Widget Effect (Weisberg et al., 2009). Rush to Judgment detailed a study that found that 87 percent of the 600 schools in the Chicago school system did not give a single unsatisfactory rating of their teachers even though over 10 percent of those schools had been classified as failing educationally. In total, only 0.3 percent of all teachers in the system were rated as "unsatisfactory." By contrast, 93 percent of the city's 25,000 teachers received "excellent" or "superior" ratings.
The Widget Effect derives its name from the fact that teacher evaluation systems have traditionally not discriminated between effective and ineffective teachers:
The Widget Effect describes the tendency of school districts to assume classroom effectiveness is the same from teacher to teacher…. In its denial of individual [teacher] strengths and weaknesses, it is deeply disrespectful to teachers; in its indifference to instructional effectiveness, it gambles with the lives of students. (Weisberg et al., 2009, p. 4)
The authors of The Widget Effect found that, in a district with 34,889 tenured teachers, only 0.4 percent received the lowest rating, whereas 68.75 percent received the highest rating. These findings and others were publicized in the popular 2010 movie Waiting for ‘Superman.’ This movie, along with a veritable flood of commentaries on local and national news shows, brought the issue of teacher evaluation into sharp relief.
By the end of the first decade of the new century, the inadequacies of teacher evaluation systems were well known and a matter of public discussion. This enhanced level of public awareness, along with federal legislation, placed educator evaluation in the spotlight.
The Federal Impetus for Evaluation Reform
On July 24, 2009, President Barack Obama and Secretary of Education Arne Duncan announced the $4.35 billion education initiative Race to the Top (RTT). Designed to spur nationwide education reform in K–12 schools, the grant program was a major component of the American Recovery and Reinvestment Act of 2009. The program offered states significant funding if they were willing to overhaul their teacher evaluation systems. To compete, states had to agree to implement new systems that would weight student learning gains as part of teachers' yearly evaluation scores and had to implement performance-based standards for teachers and principals. The U.S. Department of Education's A Blueprint for Reform (2010) stated: "We will elevate the teaching profession to focus on recognizing, encouraging, and rewarding excellence. We are calling on states and districts to develop and implement systems of teacher and principal evaluation and support, and to identify effective and highly effective teachers on the basis of student growth and other factors" (p. 4). The report went on to explain: "Grantees must be able to differentiate among teachers and principals on the basis of their students' growth and other measures, and must use this information to differentiate, as applicable, credentialing, professional development, and retention and advancement decisions, and to reward highly effective teachers and principals in high-need schools" (p. 16).
In addition to stimulating the discussion about teacher evaluation, RTT legislation generated substantive and concrete change. A Center for American Progress report released in March 2012 noted that "Overall, we found that although a lot of work remains to be done, RTT has sparked significant school reform efforts and shows that significant policy changes are possible" (Boser, 2012, p. 3). The author went on to say:
We suffer under no illusion that a single competitive grant program will sustain a total revamping of the nation's education system. Nor do we believe that a program like RTT will be implemented exactly as it was imagined—one of the goals of the program was to figure out what works when it comes to education reform. Yet two things have become abundantly clear. There's a lot that still needs to be done when it comes to Race to the Top, and many states still have some of the hardest work in front of them. But it's also clear that a program like Race to the Top holds a great deal of promise and can spark school reform efforts and show that important substantive changes to our education system can be successful. (p. 5)
Currently, the two major changes being implemented in teacher evaluation are directly traceable to RTT legislation: (1) use of measures of student growth as indicators of teacher effectiveness, and (2) more rigor in measuring the pedagogical skills of teachers. Both of these initiatives come with complex issues in tow.
Issues with Measuring Student Growth
As we have seen, including measures of students' growth in teacher evaluation systems is not only a popular idea, but an explicit part of RTT legislation. There is an intuitive appeal to using such measures and some literature supporting this practice. For example, a report from the Manhattan Institute for Policy Research (Winters, 2012) noted:
On this last point, modern statistical tools present a promising avenue for reform. These measures, used in tandem with traditional subjective measures of teacher quality, could help administrators make better-informed decisions about which teachers should receive tenure and which should be denied it. Statistical evaluations can also be used to identify experienced teachers who are performing poorly, with an objectivity that reduces the risk of a teacher being persecuted by an administrator. (p. 2)
The report further explained that growth measures "can be a useful piece of a comprehensive evaluation system. Claims that it is unreliable should be rejected. [Value-added measures], when combined with other evaluation methods and well-designed policies, can and should be part of a reformed system that improves teacher quality and thus gives America's public school pupils a better start in life" (p. 7). Similar conclusions were reported in a study by the National Bureau of Economic Research (Chetty, Friedman, & Rockoff, 2011):
Students assigned to … teachers [with high value-added scores] are more likely to attend college, attend higher-ranked colleges, earn higher salaries, live in higher [socioeconomic status] neighborhoods, and save more for retirement. They are also less likely to have children as teenagers. Teachers have large impacts in all grades from 4 to 8. On average, a one standard deviation improvement in teacher [value-added scores] in a single grade raises earnings by about 1% at age 28. (p. 2)
The term commonly used to describe measures of student growth is value-added measure (VAM). In laymen's terms, a VAM is a measure of how much a student has learned since some designated point in time (e.g., the beginning of the school year). State-level tests are typically used to compute VAM scores for each student, and the average VAM score for a teacher's class is used as a measure of the teacher's impact on students. An assumption underlying the use of VAMs is that teachers whose students have higher VAM scores are doing a better job than teachers whose students have lower scores. As intuitively logical as this might seem, many researchers and theorists strongly object to using VAMs as a component of teacher evaluation. For example, Darling-Hammond, Amrein-Beardsley, Haertel, and Rothstein (2012) articulated a comprehensive critique of the assumptions underlying the use of VAMs. They began by noting:
Using VAMs for individual teacher evaluation is based on the belief that measured achievement gains for a specific teacher's students reflect that teacher's "effectiveness." This attribution, however, assumes that student learning is measured well by a given test, is influenced by the teacher alone, and is independent from the growth of classmates and other aspects of the classroom context. None of these assumptions is well supported by current evidence. (p. 8)
The authors then listed three criticisms of VAMs that they claimed rendered them inappropriate as high-stakes measures of teacher effectiveness:
Criticism #1: VAMs of teacher effectiveness are inconsistent. Research indicates that a teacher's VAM score can change rather dramatically from year to year. For example, Darling-Hammond and colleagues cited a study by Newton, Darling-Hammond, Haertel, and Thomas (2010) that examined VAM data from five school districts. The researchers found that of the teachers who scored in the bottom 20 percent of rankings one year, only 20 to 30 percent scored in the bottom 20 percent the next year while 25 to 45 percent moved to the top part of the distribution. These changes might have little or nothing to do with an increase or decrease in teacher competence but a great deal to do with differences in students from year to year.
Criticism #2: VAM scores differ significantly when different methods are used to compute them and when different tests are used. Equations used to compute VAMs can take a variety of forms, which we discuss in greater detail in Chapter 2. For now, let's simply say that equations used to compute VAMs can differ in the variables they use to predict student achievement and in the weights given to those variables. For example, one type of VAM equation might rely heavily on measures of student achievement in prior years, whereas another type might not. Darling-Hammond and colleagues cited studies indicating that different equations can produce rather dramatically different teacher rankings: "For example, when researchers used a different model to recalculate the value-added scores for teachers published in the Los Angeles Times in 2011, they found that from 40% to 55% of teachers would get noticeably different scores" (p. 9). In other words, teacher rankings can change based on the type of VAM equation used.
Additionally, tests that purportedly measure the same content can produce different VAM scores (Bill & Melinda Gates Foundation, 2011; Lockwood et al., 2007). If, for example, two different tests of mathematics achievement are used within a district, teacher rankings based on these two different measures could vary considerably. Darling-Hammond and colleagues noted that "[t]his raises concerns about measurement error and … the effects of emphasizing ‘teaching to the test’ at the expense of other kinds of learning, especially given the narrowness of most tests in the United States" (p. 9).
Criticism #3: Ratings based on VAMs can't disentangle the many influences on student progress. Darling-Hammond and colleagues concluded that teacher effectiveness "is not a stable enough construct to be uniquely identified even under ideal conditions" (p. 11). For example, a teacher might be very effective with one group of students but not with another. To illustrate, the authors cited the example of an 8th grade science teacher with low VAM scores who exchanged classes with a 6th grade science teacher who had high VAM scores under the assumption that the 6th grade teacher would be able to produce better learning with the 8th grade teacher's students. Instead, the 8th grade teacher started to receive high VAM scores with the 6th grade students and the 6th grade teacher started to receive low VAM scores with the 8th grade students. Darling-Hammond and colleagues note: "This example of two teachers whose value-added ratings flip-flopped when they exchanged assignments is an example of a phenomenon found in other studies that document a larger association between the class taught and value-added ratings than the individual teacher effect itself" (p. 12).
Issues with Measuring Teachers' Pedagogical Skills
In Chapter 3, we consider effective techniques for measuring teacher pedagogical skill. Here, we briefly introduce the topic and place it in the context of research on teacher effectiveness.
Over the years, the research has been consistent regarding the powerful effects teachers can have on their students' achievement. Many large-scale studies have provided evidence to this end. Three have been particularly influential. The first study, conducted in the mid-1990s, involved five subject areas (mathematics, reading, language arts, social studies, and science) and some 60,000 students across grades 3 through 5 (Wright, Horn, & Sanders, 1997). The authors' overall conclusion was as follows:
The results of this study well document that the most important factor affecting student learning is the teacher. In addition, the results show wide variation in effectiveness among teachers. The immediate and clear implication of this finding is that seemingly more can be done to improve education by improving the effectiveness of teachers than by any other single factor. Effective teachers appear to be effective with students of all achievement levels regardless of the levels of heterogeneity in their classes [emphasis in original]. If the teacher is ineffective, students under that teacher's tutelage will achieve inadequate progress academically, regardless of how similar or different they are regarding their academic achievement. (Wright et al., 1997, p. 63)
The second study conducted in the early 2000s (Nye, Konstantopoulos, & Hedges, 2004) involved 79 elementary schools in 42 school districts in Tennessee. It is noteworthy in that it also involved random assignment of students to classes and controlled for factors such as students' previous achievement, socioeconomic status, ethnicity, and gender, as well as class size and whether or not an aide was present in class. The study authors reported:
These findings would suggest that the difference in achievement gains between having a 25th percentile teacher (a not so effective teacher) and a 75th percentile teacher (an effective teacher) is over one-third of a standard deviation (0.35) in reading and almost half a standard deviation (0.48) in mathematics. Similarly, the difference in achievement gains between having a 50th percentile teacher (an average teacher) and a 90th percentile teacher (a very effective teacher) is about one-third of a standard deviation (0.33) in reading and somewhat smaller than half a standard deviation (0.46) in mathematics…. These effects are certainly large enough effects to have policy significance. (Nye et al., 2004, p. 253)
The third study was designed to determine the persistence of teacher effects in elementary grades and the extent to which these effects are persistent over multiple years (Konstantopoulos & Chung, 2011). After examining data from over 2,500 students across multiple grades, the authors concluded:
In sum, the results of this study are robust and consistently show that teachers matter in early grades. The effects of teachers persist through the sixth grade for all achievement tests. In addition, the cumulative teacher effects were substantial and highlighted the importance of having effective teachers for multiple years in elementary grades. (Konstantopoulos & Chung, 2011, p. 384)
For well over a decade, the research has consistently demonstrated that an individual classroom teacher can have a powerful, positive effect on the learning of his or her students. To dramatize the research findings over the years, Strong (2011) cited the extensive research and commentary of the economist Eric Hanushek (Hanushek, 1971, 1992, 1996, 1997, 2003, 2010; Hanushek, Kain, & Rivkin, 2004; Hanushek & Rivkin, 2006; Hanushek, Rivkin, Rothstein, & Podgursky, 2004). Basing his conclusions on Hanushek's work, Strong noted "the economic value of having a higher-quality teacher, such that a teacher who is significantly above aver...