1
Corpus Approaches to Sociolinguistics
Introduction and Chapter Overviews
Eric Friginal and Mackenzie Bristow
Introduction
Sociolinguistics is the study of variation in language form and use that is associated with social, situational, attitudinal, temporal, and geographic influences (Friginal & Hardy, 2014). Many research studies in sociolinguistics have investigated why and how individuals across varying backgrounds speak and write differently. The characteristic features of spoken and written discourse have been described and interpreted, both on the micro (individual) and macro (group) levels, producing a wide range of diverse and overlapping, often fascinating results. Several studies have also explored the effects of various aspects of society, which includes societal expectations, cultural norms and traditions, and historical influences on the way language is used.
Friginal and Hardyās (2014) Corpus-Based Sociolinguistics: A Guide for Students (Routledge) serves as a companion book for studies collected in this volume. In this collection, ācorpus-based sociolinguisticsā is used as an umbrella term for empirical research studies of linguistic variation investigated using corpora and corpus tools. āCorpus-basedā as a methodological approach is not differentiated from ācorpus-drivenā or ācorpus-assistedāātwo related terms that are also used in some chapters of this volume. For Friginal and Hardy, the social, situational, attitudinal and relational, temporal, and geographic factors that underscore everyday linguistic variation are primarily broadly defined, especially in a comprehensive context of applied linguistic research. Sociolinguistics as a field of study may not necessarily focus on a definitive, singular cause of variation in speech and writing. In fact, an overlap between and among variables is primarily given importance in many sociolinguistic investigations of language use to further understand the unique interplay of factors that influence explicit and implicit linguistic variation. Personal, private, and cognitive individual factors can be examined, with the availability of data, alongside affiliations, role-relations, power, registers, and group dynamics. In quantitative sociolinguistics, these variables are measurable, and linguistic distributions and frequency data provide patterns and tendencies that are appropriate for qualitative interpretations. Overall, this methodological data-driven model has paved the way for the application of computational data extraction and text corpora in sociolinguistic studies. The merging of these approaches points to the important contribution of corpus linguistics in broader sociolinguistic research.
Corpus linguistics is a research approach that facilitates practical investigations of language variation and use, producing a range of reliable and generalizable linguistic data that can be extensively interpreted (Biber, Conrad, & Reppen, 1998). The cor pus approach follows methodological innovations that allow scholars to ask ānewā research questions on existing linguistic phenomena across many social situations. The findings these questions generate may produce information and perspectives on language variation that may either complement or reject assumptions from those taken in traditional sociolinguistic investigations. In addition, corpora can also provide a stronger argument for the view that language variation is systematic, yet fluid, and can be described using empirical, quantitative methods. This argument is important because sociolinguistic studies, following their deep roots in ethnographies and qualitative analyses, also require extensive technical, multifaceted data that help explain the interface between linguistic parameters existing within social groups.
Sociolinguistic studies explore two primary variables under investigation: (1) linguistic variables, which focus on the presence of variation in language useāfrom observable shifts or changes in how these linguistic forms are utilized in speaking and writing; and (2) societal variables, which include the social, situational, attitudinal and relational, temporal, and geographic influences and any combination of these influences that potentially account for these linguistic shifts or changes. Thus, it is important to know how these two groups of variables are defined or operationalized in many sociolinguistic studies. Friginal and Hardy (2014) briefly define a nd describe these linguistic and societal variables:
Primary Linguistic Variables Investigated in Sociolinguistics
- Sounds, words, and grammatical features of a language, including a range of differences in the pronunciation of sounds, intonation of utterances, and the use of words and phrases (and also dysfluent markers of speech), and grammatical structures of language
- Discoursal features, including spoken and written characteristics of style, formality/informality of discourse, and textual structures (e.g., use of cohesive devises in writing; interruption, latching, or overlaps in face-to-face conversation)
- Pragmatic features, including spoken and written expressions of politeness in language, stance and hedging, the use of respect markers or cuss words, and features of agreements and disagreements in interactions
- Specific communicative features, including spoken and written manifestations of friendliness, affection, loyalty, or disgust; various speech acts (e.g., requests, commands, and declarations); pauses, backchannels, and greetings and leave-takings; and visual representations of attitude, political positions, and personal/group opinions and biases in print media
- Paralanguage features, including pitch and volume in speech and non-verbal elements of language such as silence, gasp, and laughter in conversations; paralanguage may also include the use of visuals (e.g., pictures, colors, signs, and signage), emoticons, or punctuation marks in writing
Primary Societal Variables Explored in Sociolinguistics
- Socialāspeaker/writer demographic information such as gender and sexuality, age, occupation, educational background, annual income, group networks (traditionally, social networksānot referring to the internet or social media applications such as Facebook, Twitter, or Instagram), social class, or social status
- Geographicāparticular locations, geographic regions, and boundaries
- Situationalāvarious communication contexts and registers; speech events such as conversation, interview, or broadcast
- Attitudinal and relationalāspeaker/writer perceptions and attitudes (including prejudice), identity and identity construction, power, relationships and roles, and solidarity
- Temporalātime periods (e.g., āreal timeā and āapparent timeā studies), changes in societal and cultural perspectives over time, major historical events including influences from wars, natural calamities, and migration patterns over time
- Other societal variablesāmore specific personality and cognitive factors, sociological distinctions; āuncommonā or new/emerging societal variables particularly influenced by the internet; humanānon-human and machine-mediated communication; technology-based variables (e.g., use of telecommunication devices, gaming devices, and gaming culture)
By examining the role of these societal variables on how language is āformedā and used, researchers can further illustrate, comprehend, and also deeply experience the reality that everyday language is remarkably varied and influenced by numerous factors. In sum, no one speaks the same way all the time, and individuals constantly exploit the nuances of the languages they speak and write for a wide variety of purposes. This recognition of variation in language use implies that everyone must see language as not just some kind of abstract object of study (Meyerhoff, 2011). Language is pragmatic, practical, evolving, and unique to individuals or groups of connected individuals. Concluding remarks and generalizations can be formalized about these variations and their practical implications, and, further, answers can be obtained from questions such as: How do these variations influence policies or attitudes? How could these patterns be taught in the classroom effectively? How do people address linguistic differences to make sure that their reactions are valid or constructive, especially as they try to define what is proper or correct language in contrast to improper or sub-standard language (does āsub-standardā language exist in the first place)? How have linguistic patterns changed over time? These and many other related questions could be answered by utilizing multiple (traditional) research approaches in sociolinguistics and they are, arguably, best described and interpreted following a corpus linguistic research paradigm.
Sociolinguistics and Corpus Linguistics
The exploration of sociolinguistics using corpora and corpus tools is still a relatively new area of research compared to established ethnographic methods, emerging from variationist studies in the mid-1980s (e.g., especially from seminal works by Edward Finegan, Douglas Biber, and their contemporaries). As emphasized by Biber, Reppen, and Friginal (2010), corpus linguistics is not, in itself, a model of language (unlike sociolinguistics). This implies a potential misnomer in how the term corpus linguistics has been used and applied in many research studies over the years. What is clear is that corpus linguistics is primarily a methodological approach that can be defined or described according to the following considerations from Biber et al. (1998):
- It is empirical, analyzing the actual patterns of use in natural texts.
- It utilizes a large and principled collection of natural texts, known as a corpus (pl. corpora), as the basis for analysis.
- It makes extensive use of computers for analysis, employing both automatic and interactive techniques.
- It relies on the combination of quantitative and qualitative analytical techniques.
The descriptions may suggest that the corpus linguistic approach produces data and findings about variation in language that have much greater generalizability and validity than would otherwise be feasible and/or justifiable from other study designs. Research in corpus-based sociolinguistics, in general, may offer stronger support for the view that language variation is indeed systematic, with consistent patterns, and can be described using empirical, quantitative, and frequency-based methods (Biber, 1988). It is important to remember that, although corpora offer measurable descriptions of texts and social groups, the researcher and subsequent consumers of these studies must still interpret these corpus-based findings as accurately and consistently as possible. Extensive knowledge of the literature, related approaches, and awareness of the clear limitations of computational tools must always be in the foreground. Interpretive techniques honed by ethnographers and discourse analysts over the years are certainly invaluable. For example, as highlighted by Friginal and Hardy (2014), there is little importance in knowing that one gender group uses more passive voice constructions than another without being able to explore the functional reasons behind that difference in a particular context or communicative setting (e.g., in an academic or professional interaction; in telephone calls vs. face-to-face job interviews; or in narrative vs. expository texts). To summarize, corpus approaches can often be used in tandem with qualitative and discourse analytic methods, and corpora and frequency data can be statistically tested to figure out whether a consistent and significant pattern exists.
Sociolinguistic Research Questions, Corpus Design, and Corpus Representativeness
One of the key elements in Sinclairās (2005) definition of a corpus is that the collection of texts is used to represent a language or language variety. In other words, corpora are created for the purpose of better understanding a particular type of language. Thus, a sample of texts that together can serve as a characteristic example of the target variety or target domain is needed. This description brings to light the concept of representativeness. Biber (1993) defines representativeness as āthe extent to which a sample includes the full range of variability in a populationā (p. 243). In a more general sense beyond corpus linguistics, representativeness refers to the idea that one can collect a smaller sample than the population as a whole, but that that smaller sample could show as much variability in the subset as in the overall population. Because a corpus should represent a particular language or variety of that language, corpus designers must be aware of the kinds of questions they would like to answer or think that others who use their corpora might ask. According to Biber, the representativeness of a corpus can be considered both contextually and linguistically. Contextually, a corpus of the target language or variety should include the full range of various registers or text types used. In other words, because the different situations in which a language is used affect the way that language is actually utilized across contexts, those different registers need to be included in order to fully understand the variety as a whole. Linguistically, a corpus can be said to be representative if it includes the full range of different lexical and grammatical features present in that language or variety (Friginal & Hardy, 2014).
Researchers involved in the study of sociolinguistic variation have also developed models of corpus design that emphasize representativeness and generalizability of corpora. Corpora are generally not created without particular research questions in mind. Corpora are planned, collected, organized, and analyzed in ways that sociolinguists have thought of studying from the inception of the idea to create them. Although it has also been traditional for some researchers to utilize publicly available corpora, this methodology may limit the extent to which the researcher can be familiar with the data and its existing social contexts. It also narrows the focus of the types of subsequent questions that can be asked. For example, if someone were interested in differences in writing by men and women, the corpus being used would have had to include that variable to be separated. The same would be true for any variable commonly associated with sociolinguistic research (e.g., age, socioeconomic status, geographic location/dialect, register).
The merging of corpus and sociolinguistic approaches in the past few years has begun to address important corpus design, collection, and representativeness components. For example, spoken texts from sociolinguistic interviews have also been carefully developed to capture at least some of the lexico/syntactic features of speech for various demographic comparisons. A āsociolinguistics corpusā collected by Tagliamonte (2006, 2008) was obtained from oral-narratives to capture vernacular language a nd annotated for speakersā demographic characteristics. The Linguistic Innovators Corpus or LIC (Kerswill, Cheshire, Fox, & Torgersen, 2008) (see Chapter 7 of this volume) also utilized sociolinguistic interviews collected from 100 working-class adolescents (who were college students) and 18 elderly speakers in two English boroughs, Hackney and Havering. The LIC corpus has been used to test whether or not London is the center of linguistic innovation in southeastern England (i.e., a dialect study) (Gabrielatos, Torgersen, Hoffman, & Fox, 2010). Publicly available āmega-corporaā such as Google Labās Google Books Ngram Viewer and Mark Daviesās COCA and COHA (Corpus of Contemporary American English and Corpus of Historical American English) and many others from his BYU corpus site (www.corpus.byu.edu) (see Chapters 2, 3, and 4 of this volume) provide information suitable for temporal (both synchronic and diachronic studies) and register studies, contributing to the increasing amount of research that can directly describe sociolinguistic variation and change.
Corpus Approaches to Sociolinguistics
Friginal and Hardy (2014) argue that, although corpus-based sociolinguistics h...