ARRIVING AT ACCOUNTABILITY
How Did We Come to Be Where We Are?
The nation is in the midst of an accountability addiction. No Child Left Behind (NCLB, 2001) provided a taste of current day accountability, and although it was not successful, the need to document progress in education has made a lasting impression. Today, high-stakes decisions are being made about teachers and students on the basis of test scores. In the 2013ā2014 school year under Move on When Reading, Arizona will retain any third grader who falls far below the designated reading standard (Arizona Department of Education, 2012). Using 2012 test scores, an estimated 3,330 students will not progress to fourth grade. Fourteen other states have adopted similar laws (Expect More Arizona, 2012). In just 2 years, the District of Columbia Public Schools (DCPS) has f red 423 teachers for low performance under their new teacher evaluation program, IMPACT; a program that costs the city $7 million a year (DCPS, 2011; Dillon, 2011). In 2012, 45% of teachers eligible for tenure in New York City were denied, as compared to 3% in 2007 (Baker, 2012). As can be imagined, reactions to these new developments have been varied and contentious. On the one hand, there has been considerable teacher resistance to these measuresā consider the Chicago teacher strike in September of 2012 (Davey, 2012). And on the other hand, there has been commentary that the consequences have been insufficient and that evaluation systems have identified too few teachers as being unsatisfactory (Badertscher, 2013). Regardless of how these new changes are being perceived, they are not entirely ānew.ā What appears to be a recent addiction to accountability has actually been brewing for some time.
Throughout history, teachers and students have been graded using a variety of methods. In the most recent federal education initiative, Race to the Top (RttT), teachers are evaluated primarily on their ability to demonstrate student achievement gains, and, to a lesser extent on their ratings on observational measures of teaching. This trend has become increasingly popular with new statistical techniques that assess the āvalueā a teacher āaddsā to student learning (aka value-added modeling; see Harris, 2011, for detailed information about value-added techniques). A recent report using such modeling indicated that teachers have a lasting effect on studentsāstudents who receive 1 year of instruction with a highly effective teacher are more likely to go to college, less likely to become teenage mothers, and will be more likely to earn higher incomesāan average increase in lifetime earnings of $50,000 (Chetty, Friedman, & Rockof, 2011). Some, of course, have challenged these claims, but still, this report received significant media coverage (e.g., PBS, New York Times, Harvard Magazine, Education Week, CNN). It also brought value-added and the importance of teachers and teacher evaluation into the homes of those directly or even indirectly interested in schooling in America. The evolution leading up to current approaches to evaluation is complex, changing, and not nearly as modern as perceived (as we will demonstrate that teacher evaluation and assessment methods often appear, disappear, and then return again as newly discovered). In this chapter, we provide a modest historical account of the rise of accountability through the lens of teacher and student evaluation, demonstrating what methods have and can be used, why such methods have gained popularity, and if such methods will yield promising results.
History of Student Testing
The assessment of student learning outcomes dates back long before the time of teacher evaluation, but instead of using student learning as a product of teachers' instruction it was often a reflection of school progress and achievement. In the section below, we document how student achievement shifted to eventually be used as a way of holding others accountable for student learning and as a measure of teacher effectiveness.
The Growing Popularity of Student Assessment
Tests are not new. The most early, complex, and best-known forms of testing can be traced back to the Chinese, Greek, and Romans (Odell, 1928). Evidence of student assessment in the United States, often in the form of oral and written examinations, dates as far back as the 1820s (Garrison, 2009; Reese, 2007), however, in these early forms of assessment students usually answered different questions (Odell, 1928). It became clear that some students were given easier questions than others and this called into question if the assessments were fair (of course, issues of fairness and have remained as seen in recent examinations of standardized test questions (Collins, 2012) and cheating scandals (Gabriel, 2010)). Hence, the need for objective measures of student learning quickly emerged. For example, in the Boston Public Schools, committees were called upon to produce āas fair as an examination as possible; to give the same advantages to all; to prevent leading questions; to carry away ⦠positive information, in black and white; to ascertain with certainty what the scholars did not know, as well as what they did know ā¦ā (Caldwell & Courtis, 1924, p. 26). As illustrated here, a common, early belief and practice was that students were responsible for their learning and held back if unable to demonstrate so (Ravitch, 2002). A few voices, schools, and school systems opposed examinations, noting the possible negative effects of testing. Odell (1928) summarizes these concerns:
I. Examinations are injurious to the health of those taking them, causing overstrain, nervousness, worry, and other undesirable physical and mental results.
II. The content covered by examination questions does not agree with the recognized objectives of education, but instead encourages cramming, mere factual memorizing and acquiring items of information rather than careful and continuous study, reasoning, and other higher thought processes.
III. Examinations too often become objectives in themselves, the pupils believing that the chief purpose of study is to pass examinations rather than to master the subject or to gain mental power. This objection is more or less similar to the one stated immediately above, but still is probably different enough to warrant separate statement and consideration. At least it has been so considered by unfavorable critics.
IV. Examinations encourage bluffing and cheating. This occurs both because of the premium, which they place on doing so successfully, and because of the frequently prevailing conditions which make bluffing relatively easy and cheating comparatively safe.
V. Examinations develop habits of careless use of English and poor handwriting. This results because they emphasize writing a large amount as rapidly as possible and thus lead to the neglect of good form. VI. The time devoted to examinations can be more profitably used otherwise, for more study, recitation, review, and so forth.
VII. The results of instruction in the field of education are intangible and cannot be measured as can production in industry or agriculture, physical growth, heat, light, and many other products of human or other activity.
VIII. Examinations are unnecessary. Capable instructors handling classes, which are not too large, are able to rate the work of their pupils without employing examinations. (p. 10)
The Atlanta cheating scandal (Severson, 2010) is just one example that demonstrates these critics were thinking well before their time (see Nichols & Berliner, 2007, for more examples).
Protest over examinations was overshadowed by Horace Mann's Common School movement of the 1840s, a movement that called for common standards and curriculum (Mulvey, Cooper, & Maloney, 2010).1 This interest has been renewed as demonstrated in The Common Core State Standards, adopted by states beginning in 2010 (www.corestandards.org). Common curriculum and standards meant a common form of assessment of student learning. The purpose of these examinations was two-fold: (a) provide greater supervision of schools by the state; and (b) equalize testing conditions, reduce error and bias, hence, obtain more accurate measurement (Cremin, 1975; Garrison, 2009). Mann believed these tests would assess teaching and more specifically, if students āanswer from the book accurately and readily, but fail in those cases which involve relations and applications of principles, the dishonor must settle upon the heads of the teachersā (as cited in Caldwell & Courtis, 1924, p. 244). Testing gained momentum, standardized tests were born, and accountability, as we know it today, made its first mark in history. According to Odell (1928), āA standardized test in the most limited sense is any test which has been given to a large enough number of pupils of a given age, grade, or other homogeneous group so that the results are fairly adequate indications of what achievements are actually being attained by such pupils in generalā (p. 8). In Boston, students ages 13 and 14 were given written and verbal examinations in six areas of curriculum. It is unclear if and how teachers used such results. Schools, however, were ranked and results reviewed with harsh criticism. The committees provided recommendations for school improvements of what was perceived as dismal student performance (Caldwell & Court is, 1924); regard less of Horace Mann's approach, teachers were a limited part of this improvement equation. Hence, ideas of holding teachers and schools accountable were beginning to emerge, although consequences were merely suggestions and clearly less severe than they are today.
The early and mid-1800s mark the beginning of adopting tests to assess a large number of students. End of grade and exit examinations soon followed suit. In 1864, the Board of Regents of the State of New York passed an ordinance which required students to take an evaluation at the end of each academic term. Results would signify passing to the next grade. In 1878, high school exit exams were administered in New York State (Folts, 1996). Ironically, these policies are a return to an earlier belief that students should be held accountable for their own learning.
The 1900s is characterized by a swift and widespread adoption of standardized testing. Intelligence testing made its mark in the early 1900s by way of the Stanford-Binet Intelligence Scale (Terman, 1916). At this point in time, standardized intelligence tests were primarily used to diagnose and place children appropriately, but they were not widely used.
The use of standardized tests, at a mass level, began during World War I. The U.S. Army had adopted the use of intelligence tests (e.g., Army Alpha and Beta) to evaluate recruits, assign duties, and select suitable officers (Cronbach, 1975; McGuire, 1994). Army Alpha was modified and released in 1926 under the name Scholastic Aptitude Test (SAT, later the known as the Scholastic Assessment Test; Fuess, 1950). The development of a single test to be used for college admissions served to address the struggles schools were having in preparing students to take a number of college-specific admission tests (Ravitch, 2002). The development and popularization of the ACT and SAT were important in the growing use and application of student testing in higher education and K-12 settings. The ability for the SAT to be adopted on a massive scale was affirmed by its original iteration, the Army Alpha Examination (Cronbach, 1975). William Learned and E. L. Thorndike tested a number of college students during the sa...