Chapter One
The computer learner corpus: a versatile new source of data for SLA research
Sylviane Granger
1 Corpus linguistics and English studies
Since making its first appearance in the 1960s, the computer corpus has infiltrated all fields of language-related research, from lexicography to literary criticism through artificial intelligence and language teaching. This widespread use of the computer corpus has led to the development of a new discipline which has come to be called 'corpus linguistics', a term which refers not just to a new computer-based methodology, but as Leech (1992: 106) puts it, to a 'new research enterprise', a new way of thinking about language, which is challenging some of our most deeply-rooted ideas about language. With its focus on performance (rather than competence), description (rather than universals) and quantitative as well as qualitative analysis, it can be seen as contrasting sharply with the Chomskyan approach and indeed is presented as such by Leech (1992: 107). The two approaches are not mutually exclusive however. Comparing the respective merits of corpus linguistics and what he ironically calls 'armchair linguistics', Fillmore (1992: 35) comes to the conclusion that 'the two kinds of linguists need each other. Or better, that the two kinds of linguists, wherever possible, should exist in the same body.'
The computer plays a central role in corpus linguistics. A first major advantage of computerization is that it liberates language analysts 'from drudgery and empowers [them] to focus their creative energies on doing what machines cannot do' (Rundell and Stock 1992: 14). More fundamental, however, is the heuristic power of automated linguistic analysis, i.e. its power to uncover totally new facts about language. It is this aspect, rather than 'the mirroring of intuitive categories of description' (Sinclair 1986: 202), that is the most novel and exciting contribution of corpus linguistics.
English is undoubtedly the language which has been analysed most from a corpus linguistics perspective. Indeed the first computer corpus to be compiled was the Brown corpus, a corpus of American English. Since then English corpora have grown and diversified. At the time, the 1 million words contained in the Brown and the LOB were considered to be perfectly ample for research purposes, but they now appear microscopic in comparison to the 100 million words of the British National Corpus or the 200 million words of the Bank of English. This growth in corpus size over the years has been accompanied by a huge diversification of corpus types to cover a wide range of varieties: diachronic, stylistic (spoken vs. written; general vs. technical) and regional (British, American, Australian, Indian, etc.) (for a recent survey of English corpora, see McEnery and Wilson 1996).
Until very recently however, no attempt had been made to collect corpora of learner English, a strange omission given the number of people who speak Eng1ish as a foreign language throughout the world. It was not until the early 1990s that academics, EFL specialists and publishing houses alike began to recognize the theoretical and practical potential of computer learner corpora and several projects were launched, among which the following three figure prominently: the International Corpus of Learner English (ICLE), a corpus of learner English from several mother tongue backgrounds and the result of international academic collaboration, the Longman Learners' Corpus (LLC), which also contains learner English from several mother tongue backgrounds and the Hong Kong University of Science and Technology (HKUST) Learner Corpus, which is made up of the English of Chinese learners.
2 Learner corpus data and SLA research
2.1 Empirical data in SLA research
The main goal of Second Language Acquisition (SLA)1 research is to uncover the principles that govern the process of learning a foreign / second language. As this process is mental and therefore not directly observable, it has to be accessed via the product, i.e. learner performance data. Ellis (1994: 670) distinguishes three main data types: (1) language use data, which 'reflect learners' attempts to use the L2 in either comprehension or production'; (2) metalingual judgements, which tap learners' intuitions about the L2, for instance by asking them to judge the grammaticality of sentences; and (3) self-report data, which explore learners' strategies via questionnaires or think-aloud tasks. Language use data is said to be 'natural' if no control is exerted on the learners' performance and 'elicited' if it results from a controlled experiment.
Current SLA research is mainly based on introspective data (i.e. Ellis's types 2 and 3) and language use data of the elicited type. People have preferred not to use natural language use data for a variety of reasons. One has to do with the infrequency of some language features, i.e. the fact that 'certain properties happen to occur very rarely or not at all unless specifically elicited' (Yip 1995: 9). Secondly, as variables affecting language use are not controlled, the effect of these variables cannot be investigated systematically. Finally, natural language use data fails to reveal the entire linguistic repertoire of learners because 'they [learners] will use only those aspects in which they have the most confidence. They will avoid the troublesome aspects through circumlocution or some other device' (Larsen-Freeman and Long 1991: 26).
Introspective and elicited data also have their limitations, however, and their validity, particularly that of elicited data, has been put into question. The artificiality of an experimental language situation may lead learners to produce language which differs widely from the type of language they would use naturally. Also, because of the constraints of experimental elicitation, SLA specialists regularly rely on a very narrow empirical base, often no more than a handful of informants, something which severely restricts the generalizability of the results. There is clearly a need for more, and better quality, data and this is particularly acute in the case of natural language data. In this context, learner corpora which, as will be shown in the following section, answer most of the criticisms levelled at natural language use data, are a valuable addition to current SLA data sources. Undeniably however, all types of SLA data have their strengths and weaknesses and one can but agree with Ellis (1994: 676) that 'Good research is research that makes use of multiple sources of data.'
2.2 Contribution of learner corpora to SLA research
The ancestor of the learner corpus can be traced back to the Error Analysis (EA) era. However, learner corpora in those days bore little resemblance to current ones. First, they were usually very small, sometimes no more than 2,000 words from a dozen or so learners. Some corpora, such as the one used in the Danish PIF (Project in Foreign Language Pedagogy) project (see Faerch et al. 1984) were much bigger, though how much bigger is difficult to know as the exact size of the early learner corpora was generally not mentioned. This was quite simply because the com pilers usually had no idea themselves. As the corpora were not com puterized, counting the number of words had to be done manually, an impossible task if the corpus was relatively big. At best, it would some times have been possible to make a rough estimate of the size on the basis of the number of informants used and the average length of their assignments.
A further limitation is the heterogeneity of the learner data. In this connection, Ellis (1994: 49) comments that, in collecting samples of learner language, EA researchers have not paid enough attention to the variety of factors that can influence learner output, with the result that 'EA studies are difficult to interpret and almost impossible to replicate'. Results of EA studies and in fact a number of SLA studies have been inconclusive, and on occasion contradictory, because these factors have not been attended to. In his book on transfer, Odlin (1989: 151) notes 'considerable variation in the number of subjects, in the backgrounds of the subjects, and in the empirical data, which come from tape-recorded samples of speech, from student writing, from various types of tests, and from other sources' and concludes that 'improvements in data gathering would be highly desirable'.
Yet another weakness of many early learner corpora is that they were not really exploited as corpora in their own right, but merely served as depositories of errors, only to be discarded after the relevant errors had been extracted from them. EA researchers focused on decontextualized errors and disregarded the rest of the learner's performance. As a result, they 'were denied access to the whole picture' (Larsen-Freeman and Long 1991: 61) and failed to capture phenomena such as avoidance, which does not lead to errors, but to under-representation of words or structures in L2 use (Van Els et al. 1984: 63).
Current learner corpora stand in sharp contrast to what are in effect proto-corpora. For one thing, they are much bigger and therefore lend themselves to the analysis of most language features, including infrequent ones, thereby answering one of the criticisms levelled at natural language use data (see section 2.1). Secondly, there is a tendency for compilers of the current computer learner corpora (CLCs), learning by mistakes made in the past, to adopt much stricter design criteria, thus allowing for investigations of the different variables affecting learner output. Last but not least, they are computerized. As a consequence, large amounts of data can be submitted to a whole range of linguistic software tools, thus providing a quantitative approach to learner language, a hitherto largely unexplored area. Comparing the frequency of words/structures in learner and native corpora makes it possible to study phenomena such as avoidance which were never addressed in the era of EA. Unlike previous error corpora, CLCs give us access not only to errors but to learners' total interlanguage.
2.3 Learner corpus data and ELT
The fact that CLCs are a fairly recent development does not mean that there was no previous link between corpus linguistics and the ELT world. Over the last few years, native English corpora have increasingly been used in ELT materials design. It was Collins Cobuild who set this trend and their pioneering dictionary project gave rise to a whole range of EFL tools based on authentic data. Underlying the approach was the firm belief that better descriptions of authentic native English would lead to better EFL tools and indeed, studies which have compared materials based on authentic data with traditional intuition-based materials have found this to be true. In the field of vocabulary, for example, Ljung (1991) has found that traditional textbooks tend to over-represent concrete words to the detriment of abstract and societal terms and there fore fail to prepare students for a variety of tasks, such as reading quality newspapers and report-writing. The conclusion is clear: textbooks are more useful when they are based on authentic native English.
However much of an advance they were, native corpora cannot ensure fully effective EFL learning and teaching, mainly because they contain no indication of the degree of difficulty of words and structures for learners. It is paradoxical that although it is claimed that ELT materials should be based on solid, corpus-based descriptions of native English, materials designers are content with a very fuzzy, intuitive, non-corpus based view of the needs of an archetypal learner. There is no doubt that the efficiency of EFL tools could be improved if materials designers had access not only to authentic native data but also to authentic learner data, with the NS (native speaker) data giving information about what is typical in English, and the NNS (non-native speaker) data highlighting what is difficult for learners in general and for specific groups of learners. As a result, a new generation of CLC-informed EFL tools is beginning to emerge. Milton's (Chapter 14, this volume) Electronic Language Learning and Production Environment is an electronic pedagogical tool which specifically addresses errors and patterns of over- and underuse typical of Cantonese learners of English, as attested by the HKUST Learner Corpus. In the lexicographical field, the Longman Essential Activator is the first...