1
Introduction
The field of electronic text analysis has been expanding rapidly over the past decades. This is partly due to advances in information technology and software development, but also as a result of the growing interest in using electronic resources to complement more traditional approaches to the analysis of language and literature. The improved accessibility of computers has added to the increasing popularity of electronic text analysis, especially in the higher education context. The development of principled collections of electronic texts, also called corpora, has allowed a systematic exploration of recurring patterns in language in use, and this has become one of the main areas of enquiry in the emerging field referred to as corpus linguistics.
With courses and modules in corpus linguistics and computer-aided language analysis currently being offered in many university departments across the country, there is also a growing emphasis on integrating electronic tools and resources in analyses of literary works. At the same time, electronic text analysis is increasingly being utilised as a tool in a range of applied contexts, for example in the area of language teaching or the study of language and ideology. These areas of investigation make use of a range of methodologies that have originally been developed in the area of corpus linguistics with the aim of enhancing language description.
This book combines the description of a range of approaches and methodologies in this field with a discussion of a number of areas of language study in which electronic text analysis is being used, often by way of complementing more traditional, analytical approaches. The main aim throughout the book is to introduce key ideas and methodologies and to illustrate these, where appropriate, through attested examples of language data. The book is primarily intended for the non-expert user who wishes to draw on some of the methodologies developed in the field of corpus linguistics for the purpose of analysing electronic texts.
Electronic text analysis: corpus linguistics by another name?
There are a number of terms that describe traditions and methodologies of computer-aided language research. They include, amongst others, corpus linguistics, Natural Language Processing (NLP) and Humanities Computing. The differences between these approaches lie in their overall research goals, the types of texts that they draw on, and the way in which the texts are analysed. While the methodologies described in this book are derived mainly from the corpus linguistic tradition, they are also applied to problems and texts that are not normally at the heart of this tradition. The term electronic text analysis has been adopted to reflect the different priorities in terms of data sources and research processes when we compare corpus linguistics as a tradition with other areas of computer-aided language research. As such, the term electronic text analysis has been chosen for its inclusive and broad meaning that relates to the analysis of any digitized text or text collection.
Research goals
To illustrate just some of the kinds of different orientations found in the diverse range of areas that use electronic text analysis, we will consider the examples of Natural Language Processing (NLP) and Humanities Computing in more detail. NLP is often geared towards developing models for particular applications, such as machine translation software for example. Sinclair (2004b) makes a useful distinction between description and application in this context. Language description here refers to the process of exploring corpus data with the aim of developing a better understanding of language in use, while an application refers to the deployment of language analysis tools with the aim of producing an output that has relevance outside of linguistics. Sinclair (2004b: 55) notes that the end users of language description are predominantly other linguists who are interested in empirical explorations of the way in which language is used. The end users of linguistic applications on the other hand are not necessarily linguists. They may be people who are simply users of the developed application, such as a spell checker or a machine translation system that has been developed on the basis of a textual resource. The research goal in this case is the successful development of an application rather than the comprehensive description of language in use. This distinction marks one of the differences in orientation between corpus linguistics and NLP.
Humanities Computing tends to be concerned with enhancing and documenting textual interpretations, often within a hermeneutic tradition. A number of specialist journals have emerged in this area, including Computers and the Humanities, and a substantial amount of research is devoted to making processes of textual interpretation more explicit to the research community by way of various types of documentation. Burnard (1999) highlights the need for this process:
All of the fields above analyse electronic, i.e. digitised, text(s) and use, where appropriate, software tools to do so.
Textual resources
One of the main differences between the various traditions in electronic text analysis lie in the nature of the textual resources and in the way in which they have been assembled to become an object of study. A corpus tends to be defined as a collection of texts which has been put together for linguistic research with the aim of making statements about a particular language variety. Biber et al. (1998:4) point out in this context that a corpus-based approach āutilises a large and principled collection of natural texts, known as a ācorpusā, as the basis for analysisā.
A single text might not be able to provide a balanced sample of any one language variety. The same applies to other texts that may exist in electronic format but have not been assembled to represent a principled sample of a language variety, such as an e-mail message, for example, or the world wide web. These can, of course, be assembled in a principled way and turned into a corpus for linguistic study. We will return to a discussion of the world wide web as a corpus in chapter two.
As far as the nature of the textual resource is concerned, there are core differences between naturally occurring discourse versus discourse that has been produced under experimental conditions, and between large-scale and small-scale texts and text collections. Since people who work in the discipline of corpus linguistics are often interested in the exploration of social phenomena, such as the relationship between patterns of usage and social context for example, naturally occurring discourse is required as the basis of any study. In order to be able to extract patterns from this type of discourse, the textual resources need to be substantial in size for the corpus linguist. This point takes us to the next issue.
Types of analysis
The way in which the corpus linguist approaches a text is through secondary analysis of concordance lines and frequency information (see Sinclair 2004a: 189). The close reading and interpretation of a single text is not the primary concern of the corpus linguist; instead the core research activity is the extraction of language patterns through the analysis of suitably sorted instances of particular lexical items and phrases (see Sinclair 2004a). This is not necessarily the approach taken by the Natural Language Processing (NLP) researcher, nor the humanities researcher, who will, respectively, analyse texts in a way that facilitates the development of specific software applications or process textual information as part of an often multi-faceted framework for textual interpretation. As such, the humanities researcher might be very familiar with a particular novel that they study but still make use of frequency counts to gather further quantitative information about the text.
The term electronic text analysis has been chosen as a broad title because the types of analyses discussed in this book draw on elements of various different approaches, albeit with a strong bias towards corpus linguistics techniques. These include the analysis of single texts to facilitate literary interpretation (chapter five), the investigation of lexical items within a corpus to better understand how ideology is encoded in language (chapter six), the exploration of corpus data for English language teaching applications (chapter seven) and the close reading of extended stretches of naturally occurring discourse (chapter eight). However, the main focus of the book is on the way in which different methods in electronic text analysis can facilitate the study of language in a range of different contexts.
A brief background to techniques in electronic text analysis
Electronic text analysis can be used to organize textual data in a variety of ways, such as through the generation of frequency information or through the representation of individual words or phrases in a concordance format. Both of these techniques will be discussed in more detail in chapters three and four respectively and the sections below are merely aimed to provide a brief background.
Frequency lists
Many of the techniques used in the electronic analysis of texts originate from manual procedures of text analysis, which were used long before the more recent advent of computer technology. Thorndike (1921), for example, gathered frequency information of individual words in a set of texts by manually counting each word form. His frequency list was based on a corpus of 4.5 million words from over 40 different sources and informed the Teacherās Workbook (Thorndike 1921), later superseded by The Teacherās Workbook of 30,000 Words (Thorndike and Lorge 1944) which was based on a corpus of over 18 million words in total.
This work, and other similar projects that were carried out during the early part of the 20th century, had a pedagogic purpose in that the results were used to inform language instruction. Thorndikeās work la...