CHAPTER
1
Washback or Backwash: A Review of the Impact of Testing on Teaching and Learning
Liying Cheng
Andy Curtis
Queenâs University
Washback or backwash, a term now commonly used in applied linguistics, refers to the influence of testing on teaching and learning (Alderson & Wall, 1993), and has become an increasingly prevalent and prominent phenomenon in educationââwhat is assessed becomes what is valued, which becomes what is taughtâ (McEwen, 1995a, p. 42). There seems to be at least two major types or areas of washback or backwash studiesâthose relating to traditional, multiple-choice, large-scale tests, which are perceived to have had mainly negative influences on the quality of teaching and learning (Madaus & Kellaghan, 1992; Nolan, Haladyna, & Haas, 1992; Shepard, 1990), and those studies where a specific test or examination1 has been modified and improved upon (e.g., performance-based assessment), in order to exert a positive influence on teaching and learning (Linn & Herman, 1997; Sanders & Horn, 1995). The second type of studies has shown, however, positive, negative, or no influence on teaching and learning. Furthermore, many of those studies have turned to focus on understanding the mechanism of how washback or backwash is used to change teaching and learning (Cheng, 1998a; Wall, 1999).
WASHBACK: THE DEFINITION AND ORIGIN
Although washback is a term commonly used in applied linguistics today, it is rarely found in dictionaries. However, the word backwash can be found in certain dictionaries and is defined as âthe unwelcome repercussions of some social actionâ by the New Websterâs Comprehensive Dictionary, and âunpleasant after-effects of an event or situationâ by the Collins Cobuild Dictionary. The negative connotations of these two definitions are interesting, as they inadvertently touch on some of the negative responses and reactions to the relationships between teaching and testing, which we explore in more detail shortly.
Washback (Alderson & Wall, 1993) or backwash (Biggs, 1995, 1996) here refers to the influence of testing on teaching and learning. The concept is rooted in the notion that tests or examinations can and should drive teaching, and hence learning, and is also referred to as measurement-driven instruction (Popham, 1987). In order to achieve this goal, a âmatchâ or an overlap between the content and format of the test or the examination and the content and format of the curriculum (or âcurriculum surrogateâ such as the textbook) is encouraged. This is referred to as curriculum alignment by Shepard (1990, 1991b, 1992, 1993). Although the idea of alignmentâmatching the test and the curriculumâhas been descried by some as âunethical,â and threatening the validity of the test (Haladyna, Nolen, & Haas, 1991, p. 4; Widen, OâShea, & Pye, 1997), such alignment is evident in a number of countries, for example, Hong Kong (see Cheng, 1998a; Stecher, Barron, Chun, Krop, & Ross, 2000). This alignment, in which a new or revised examination is introduced into the education system with the aim of improving teaching and learning, is referred to as systemic validity by Frederiksen and Collins (1989), consequential validity by Messick (1989, 1992, 1994, 1996), and test impact by Bachman and Palmer (1996) and Baker (1991).
Wall (1997) distinguished between test impact and test washback in terms of the scope of the effects. According to Wall, impact refers to â. . . any of the effects that a test may have on individuals, policies or practices, within the classroom, the school, the educational system or society as a wholeâ (see Stecher, Chun, & Barron, chap. 4, this volume), whereas washback (or backwash) is defined as âthe effects of tests on teaching and learningâ (Wall, 1997, p. 291).
Although different terms are preferred by different researchers, they all refer to different facets of the same phenomenonâthe influence of testing on teaching and learning. The authors of this chapter have chosen to use the term washback, as it is the mostly commonly used in the field of applied linguistics.
The study of washback has resulted in recent developments in language testing, and measurement-driven reform of instruction in general education. Research in language testing has centered on whether and how we assess the specific characteristics of a given group of test takers and whether and how we can incorporate such information into the ways in which we design language tests. One of the most important theoretical developments in language testing in the past 30 years has been the realization that a language test score represents a complex of multiple influences. Language test scores cannot be interpreted simplistically as an indicator of the particular language ability we think we are measuring. The scores are also affected by the characteristics and contents of the test tasks, the characteristics of the test takers, the strategies test takers employ in attempting to complete the test tasks, as well as the inferences we draw from the test results. These factors undoubtedly interact with each other.
Nearly 20 years ago, Alderson (1986) identified washback as a distinctâand at that time emergingâarea within language testing, to which we needed to turn our attention. Alderson (1986) discussed the âpotentially powerful influence offsetsâ (p. 104) and argued for innovations in the language curriculum through innovations in language testing (also see Wall, 1996, 1997, 2000). At around the same time, Davies (1985) was asking whether tests should necessarily follow the curriculum, and suggested that perhaps tests ought to lead and influence the curriculum. Morrow (1986) extended the use of washback to include the notion of washback validity, which describes the relationship between testing, and teaching and learning (p. 6). Morrow also claimed that â. . . in essence, an examination of washback validity would take testing researchers into the classroom in order to observe the effects of their tests in actionâ (p. 6). This has important implications for test validity.
Looking back, we can see that examinations have often been used as a means of control, and have been with us for a long time: a thousand years or more, if we include their use in Imperial China to select the highest officials of the land (Arnove, Altback, & Kelly, 1992; Hu, 1984; Lai, 1970). Those examinations were probably the first civil service examinations ever developed. To avoid corruption, all essays in the Imperial Examination were marked anonymously, and the Emperor personally supervised the final stage of the examination. Although the goal of the examination was to select civil servants, its washback effect was to establish and control an educational program, as prospective mandarins set out to prepare themselves for the examination that would decide not only their personal fate but also influence the future of the Empire (Spolsky, 1995a, 1995b).
The use of examinations to select for education and employment has also existed for a long time. Examinations were seen by some societies as ways to encourage the development of talent, to upgrade the performance of schools and colleges, and to counter to some degree, nepotism, favoritism, and even outright corruption in the allocation of scarce opportunities (Bray & Steward, 1998; Eckstein & Noah, 1992). If the initial spread of examinations can be traced back to such motives, the very same reasons appear to be as powerful today as ever they were. Linn (2000) classified the use of tests and assessments as key elements in relation to five waves of educational reform over the past 50 years: their tracking and selecting role in the 1950s; their program accountability role in the 1960s; minimum competency testing in the 1970s; school and district accountability in the 1980s; and the standards-based accountability systems in the 1990s (p. 4). Furthermore, it is clear that tests and assessments are continuing to play a crucial and critical role in education into the new millennium.
In spite of this long and well-established place in educational history, the use of tests has, constantly, been subject to criticism. Nevertheless, tests continue to occupy a leading place in the educational policies and practices of a great many countries (see Baker, 1991; Calder, 1997; Cannell, 1987; Cheng, 1997, 1998a; Heyneman, 1987; Heyneman & Ransom, 1990; James, 2000; Kellaghan & Greaney, 1992; Li, 1990; Macintosh, 1986; Runte, 1998; Shohamy, 1993a; Shohamy, Donitsa-Schmidt, & Ferman, 1996; Widen et al., 1997; Yang, 1999; and chapters in Part II of this volume). These researchers, and others, have, over many years, documented the impact of testing on school and classroom practices, and on the personal and professional lives and experiences of principals, teachers, students, and other educational stakeholders.
Aware of the power of tests, policymakers in many parts of the world continue to use them to manipulate their local educational systems, to control curricula and to impose (or promote) new textbooks and new teaching methods. Testing and assessment is âthe darling of the policy-makersâ (Madaus, 1985a, 1985b) despite the fact that they have been the focus of controversy for as long as they have existed. One reason for their longevity in the face of such criticism is that tests are viewed as the primary tools through which changes in the educational system can be introduced without having to change other educational components such as teacher training or curricula. Shohamy (1992) originally noted that âthis phenomenon [washback] is the result of the strong authority of external testing and the major impact it has on the lives of test takersâ (p. 513). Later Shohamy et al. (1996; see also Stiggins & Faires-Conklin, 1992) expanded on this position thus:
the power and authority of tests enable policy-makers to use them as effective tools for controlling educational systems and prescribing the behavior of those who are affected by their resultsâadministrators, teachers and students. School-wide exams are used by principals and administrators to enforce learning, while in classrooms, tests and quizzes are used by teachers to impose discipline and to motivate learning. (p. 299)
One example of these beliefs about the legislative power and authority of tests was seen in 1994 in Canada, where a consortium of provincial ministers of education instituted a system of national achievement testing in the areas of reading, language arts, and science (Council of Ministers of Education, Canada, 1994). Most of the provinces now require students to pass centrally set school-leaving examinations as a condition of school graduation (Anderson, Muir, Bateson, Blackmore, & Rogers, 1990; Lock, 2001; Runte, 1998; Widen, OâShea, & Pye, 1997).
Petrie (1987) concluded that âit would not be too much of an exaggeration to say that evaluation and testing have become the engine for implementing educational policyâ (p. 175). The extent to which this is true depends on the different contexts, as shown by those explored in this volume, but a number of recurring themes do emerge. Examinations of various kinds have been used for a very long time for many different purposes in many different places. There is a set of relationships, planned and unplanned, positive and negative, between teaching and testing. These two facts mean that, although washback has only been identified relatively recently, it is likely that washback effects have been occurring for an equally long time. It is also likely that these teachingâtesting relationships are likely to become closer and more complex in the future. It is therefore essential that the education community work together to understand and evaluate the effects of the use of testing on all of the interconnected aspects of teaching and learning within different education systems.
WASHBACK: POSITIVE, NEGATIVE, NEITHER OR BOTH?
Movement in a particular direction is an inherent part of the use of the washback metaphor to describe teachingâtesting relationships. For example, Pearson (1988) stated that âpublic examinations influence the attitudes, behaviors, and motivation of teachers, learners and parents, and, because examinations often come at the end of a course, this influence is seen working in a backward directionâhence the term âwashbackâ â (p. 98). However, like Davies (1985), Pearson believed that the direction in which washback actually works must be forward (i.e., testing leading teaching and learning).
The potentially bidirectional nature of washback has been recognized by, for example, Messick (1996), who defined washback as the âextent to which a test influences language teachers and learners to do things they would not necessarily otherwise do that promote or inhibit [emphasis added] language learningâ (p. 241, as cited in Alderson & Wall, 1993, p. 117). Wall and Alderson also noted that âtests can be powerful determiners, both positively and negatively, [emphasis added] of what happens in classroomsâ (Alderson & Wall, 1993, p. 117; Wall & Alderson, 1993, p. 41).
Messick (1996) went on to comment that some proponents have even maintained that a testâs validity should be appraised by the degree to which it manifests positive or negative washback, which is similar to Frederiksen and Collinsâ (1989) notion of systemic validity.
Underpinning the notion of direction is the issue of what it is that is being directed. Biggs (1995) used the term backwash (p. 12) to refer to the fact that testing drives not only the curriculum, but also the teaching methods and studentsâ approaches to learning (Crooks, 1988; Frederiksen, 1984; Frederiksen & Collins, 1989). However, Spolsky (1994) believed that âbackwash is better applied only to accidental side-effects of examinations, and not to those effects intended when the first purpose of the examination is control of the curriculumâ (p. 55). In an empirical study of an intended public examination change on classroom teaching in Hong Kong, Cheng (1997, 1998a) combined movement and motive, defining washback as âan intended direction and function of curriculum change, by means of a change of public examinations, on aspects of teaching and learningâ (Cheng, 1997, p. 36). As Chengâs study showed, when a public examination is used as a vehicle for an intended curriculum change, unintended and accidental side effects can also occur, that is, both negative and positive influence, as such change involves elaborate and extensive webs of interwoven causes and effects.
Whether the effect of testing is deemed to be positive or negative should also depend on who it is that actually conducts the investigation within a particular education context, as well as where, the school or university contexts, when, the time and duration of using such assessment practices, why, the rationale, and how, the different approaches used by different participants within the context.
If the potentially bidirectional nature of washback is accepted, and movement in a positive direction is accepted as the aim, the question then becomes methodological, that is, how to bring about this positive movement. After considering several definitions of washback, Bailey (1996) concluded that more empirical research needed to be carried out in order to document its exact nature and mechanisms, while also identifying âconcerns about what constitutes both positive and negative washback, as well as about how to promote the former and inhibit the latterâ (p. 259).
According to Messick (1996), âfor optimal positive washback there should be little, if any, difference between activities involved in learning the language and activities involved in preparing for the testâ (pp. 241â242). However, the lack of simple, one-to-one relationships in such complex systems was highlighted by Messick (1996): âA poor test may be associated with positive effects and a good test with negative effects because of other things that are done or not done in the education systemâ (p. 242). In terms of complexity and validity, Alderson and Wall (1993) argued that washback is âlikely to be a complex phenomenon which cannot be related directly to a testâs validityâ (p. 116). The washback effect should, therefore, refer to the effects of the test itself on aspects of teaching and learning.
The fact that there are so many other forces operating within any education context, which also contribute to or ensure the washback effect on teaching and learning, has been demonstrated in several washback studies (e.g., Anderson et al., 1990; Cheng, 1998b, 1999; Herman, 1992; Madaus, 1988; Smith, 1991a, 1991b; Wall, 2000; Watanabe, 1996a; Widen et al., 1997). The key issue here is how those forces within a particular educational context can be teased out to understand the effects of testing in that environment, and how confident we can be in formulating hypotheses and drawing conclusions about the nature and the scope of the effects within broader educational contexts.
Negative Washback
Tests in general, and perhaps language tests in particular, are often criticized for their negative influence on teachingâso-called ânegative washbackââwhich has long been identified as a potential problem. For example, nearly 50 years ago, Vernon (1956) claimed that teachers tended to ignore subjects and activities that did not contribute directly to passing the exam, and that examinations âdistort the curriculumâ (p. 166). Wiseman (1961) believed that paid coaching classes, which were intended for preparing students for exams, were not a good use of the time, because students were practicing exam techniques rather than language learning activities (p. 159), and Davies (1968) believed that testing devices had become teaching devices; that teaching and learning was effectively being directed to past examination papers, making the educational experience narrow and uninteresting (p. 125).
More recently, Alderson and Wall (1993) referred to negative washback as the undesirable effect on teaching and learning of a particular test deemed to be âpoorâ (p. 5). Alderson and Wallâs poor here means âsomething that the teacher or learner does not wish to teach or learn.â The tests may well fail to reflect the learning principles or the course objectives to which they are supposedly related. In reality, teachers and learners may end up teaching and learning toward the test, regardless of whether or not they support the test or fully understand its rationale or aims.
In general education, Fish (1988) found that teachers reacted negatively to pressure created by public displays of classroom scores, and also found that relatively inexperienced teachers felt greater anxiety and accountability pressure than experienced teachers, showing the influence of factors such as age and experience. Noble and Smith (1994a) also found that high-stakes testing could affect teachers directly and negatively (p. 3), and that âteaching test-taking skills and drilling on multiple-choice worksheets is likely to boost the scores but unlikely to promote general understandingâ (1994b, p. 6). From an extensive qualitative study of the role of external testing in elementary schools in the United States, Smith (1991b) listed a number of damaging effects, as the âtesting progr...