Research in Corpus Linguistics

Corpus linguistics is a research approach that has developed over the past few decades to support empirical investigations of language variation and use, resulting in research findings that are have much greater generalizability and validity than would otherwise be feasible. Corpus linguistics is not in itself a model of language. Rather, it can be regarded as primarily a methodological approach; it is empirical, analyzing the actual patterns of use in natural texts. It utilizes a large and principled collection of natural texts, known as a corpus, as the basis for analysis. At the same time, corpus linguistics is more than a methodological approach, because these methodological innovations have enabled researchers to ask fundamentally different kinds of research questions, sometimes resulting in radically different perspectives on language variation and use from those taken in previous research. Corpus linguistic research offers strong support for the view that language variation is systematic and can be described using empirical, quantitative methods.

1. Introduction

Corpus linguistics is a research approach that has developed over the past several decades to support empirical investigations of language variation and use, resulting in research findings that are have much greater generalizability and validity than would otherwise be feasible. Corpus linguistics is not in itself a model of language. Rather, it can be regarded as primarily a methodological approach:

  • It is empirical, analyzing the actual patterns of use in natural texts.

  • It utilizes a large and principled collection of natural texts, known as a corpus, as the basis for analysis.

  • It makes extensive use of computers for analysis, employing both automatic and interactive techniques.

  • It depends on both quantitative and qualitative analytical techniques. (Biber, Conrad & Reppen, 1998: 4)

At the same time, corpus linguistics is more than a methodological approach, because these methodological innovations have enabled researchers to ask fundamentally different kinds of research questions, sometimes resulting in radically different perspectives on language variation and use from those taken in previous research. Corpus linguistic research offers strong support for the view that language variation is systematic and can be described using empirical, quantitative methods. Variation often involves complex patterns consisting of the interaction among (p. 549) several different linguistic parameters, but, in the end, it is systematic. Beyond this, the major contribution of corpus linguistics is to document the existence of linguistic constructs that are not recognized by current linguistic theories. Research of this type—referred to as a corpus-driven approach—identifies strong tendencies for words and grammatical constructions to pattern together in particular ways, whereas other theoretically possible combinations rarely occur.

A novice student of linguistics could be excused for believing that corpus linguistics evolved only recently, as a reaction against the standard practice of intuition-based linguistics. Introductory linguistics textbooks tend to present linguistic analysis (especially syntactic analysis) as it has been practiced over the past 50 years, employing the analyst's intuitions rather than being based on empirical analysis of natural texts. Against that background, it would be easy for a student to imagine that corpus linguistics developed only in the 1980s and 1990s, responding to the need to base linguistic descriptions on actual language use.

This view is far from accurate. In fact, intuition-based linguistics developed as a reaction to corpus-based linguistics. That is, the standard practice in linguistics up until the 1950s was to base language descriptions on analyses of collections of natural texts: precomputer corpora. Dictionaries have long been based on empirical analysis of word use in natural sentences. For example, Samuel Johnson's Dictionary of the English Language, published in 1755, was based on approximately 150,000 natural sentences recorded on slips of paper, to illustrate the natural usage of words. The Oxford English Dictionary, published in 1928, was based on approximately 5,000,000 citations from natural texts (totaling around 50 million words), compiled by over 2,000 volunteers over a 70-year period. (See the discussion in G. D. Kennedy, 1998: 14–15.) West's (1953) creation of the General Service List from a preelectronic corpus of newspapers was one of the first empirical vocabulary studies not motivated by the goal of creating a dictionary.

Grammars were also sometimes based on empirical analyses of natural text corpora before 1960. For example, Jespersen's grammars of English (1909–1949) used natural sentences from newspapers and novels to illustrate the various structures. An even more noteworthy example of this type is the work of C. C. Fries, who wrote two corpus-based grammars of American English. The first, published in 1940, had a focus on usage and social variation, based on a corpus of letters written to the government. The second is essentially a grammar of conversation: It was published in 1952, based on a 250,000-word corpus of telephone conversations. It includes authentic examples taken from the corpus and discussion of grammatical features that are especially characteristic of conversation (e.g., the words well, oh, now, and why when they initiate a “response utterance unit”; Fries, 1952: 101–102).

In the 1960s and 1970s, most research in linguistics shifted to intuition-based methods, arguing that language was a mental construct and that empirical analyses of corpora were not relevant for describing language competence. However, even during this period, some linguists continued the tradition of empirical linguistic analysis. For example, in the early 1960s, Randolph Quirk began the Survey of English Usage, a precomputer collection of 200 spoken and written texts (each (p. 550) around 5,000 words) that was subsequently used for descriptive grammars of English (e.g., Quirk et al., 1972). Functional linguists like Prince and Thompson also continued this descriptive tradition, arguing that (noncomputerized) collections of natural texts could be studied to identify systematic differences in the functional use of linguistic variants. For example, Prince 1978 compares the discourse functions of WH-clefts and IT-clefts in spoken and written texts. Thompson has been especially interested in the study of grammatical variation in conversation; for example, Thompson and Mulac 1991 analyzed factors influencing the retention versus omission of the complementizer that occur in conversation, whereas Fox and Thompson 1990 studied variation in the realization of relative clauses in conversation.

What changed in the 1980s were the widespread availability of large electronic corpora, and the increasing availability of computational tools that facilitated the linguistic analysis of those corpora. Work on large electronic corpora began in the 1960s, when Kucera and Francis 1967 compiled the Brown Corpus (a one-million word corpus of published AmE written texts). This was followed by a parallel corpus of BrE written texts: the LOB Corpus, published in the 1970s.

It was not until the 1980s, though, that major studies of language use based on large electronic corpora began to appear. Thus, in 1982, Francis and Kucera provide a frequency analysis of the words and grammatical part-of-speech categories found in the Brown Corpus, followed in 1989 by a similar analysis of the LOB Corpus (Johansson and Hofland, 1989). Book-length descriptive studies of linguistic features began to appear in this period (e.g., Granger, 1983, on passives; de Haan, 1989, on nominal postmodifiers) as did the first multidimensional studies of register variation (e.g., Biber, 1988). During this same period, English language learner dictionaries based on the analysis of large electronic corpora began to appear, such as the Collins CoBuild English Language Dictionary (1987) and the Longman Dictionary of Contemporary English (1987). Since that time, most descriptive studies of linguistic variation and use in English have been based on analysis of an electronic corpus, either a large standard corpus (such as the British National Corpus) or a small corpus designed for a specific study (e.g., a corpus of 20 biology research articles constructed for a genre analysis). Within applied linguistics, the subfields of English for specific purposes and English for academic purposes have been especially influenced by corpus research, so that nearly all articles published in these areas employ some kind of corpus analysis.

Studies in this research tradition have adopted the tools and techniques available from computer-based corpus linguistics, with its emphasis on the representativeness of the text collection, and its computational tools for investigating distributional patterns across registers and across discourse contexts in large text collections. The textbook treatments by Kennedy 1998, Biber, Conrad, and Reppen (1998), and McEnery, Xiao, and Tono (2006) provide good introductions to the methods used for these studies as well as surveys of previous research.

In the ensuing sections, we survey many of the most important linguistic studies over the past 25 years that have employed corpus analysis. These studies have been motivated by two major research goals (see Biber, Conrad, and Reppen, 1998: 5–8):

  1. (p. 551) 1. To describe linguistic features, such as vocabulary, lexical combinations, or grammatical features. These studies focus on variation in the choice among related linguistic features (e.g., the simple past tense versus present perfect aspect) or on the discourse functions of a single linguistic feature.

  2. 2. To describe the overall characteristics of a variety: a register or dialect. These studies provide relatively comprehensive linguistic descriptions of a single variety or of a set of related varieties.

Section 2, which follows, introduces studies of the first type, whereas section 3 surveys studies of the second type. Studies of both types have been undertaken for many of the world's languages. However, to limit the scope of the chapter, we survey only studies of English. Then, in section 4, we survey pedagogical applications of these descriptive corpus-based studies, discussing how classroom teaching and materials development have been influenced by the corpus revolution.

2. Descriptive Linguistic Studies

2.1. Corpus Studies with a Lexical Focus

Many of the earliest uses of corpora were designed to provide word lists ranked by frequency, comparing the most frequent words in different varieties. For example, Francis and Kucera 1982 and Johansson and Hofland 1989 catalog the most frequent words in the Brown and LOB Corpora, comparing word frequencies in the fiction versus nonfiction components of the corpora.

One of the major contributions of corpus-based lexical studies has been the insight that collocational associations are a central consideration for describing the meaning of a word. For example, the copular verbs turn, ome, and go all have the same dictionary meaning: “to become, or to change to another state.” However, corpus research (Biber et al., 1999: 444–445) shows that these three verbs have very different collocational associations: The most common adjectives following turn are color terms, like black, brown, red, and white. The most common adjectives following come describe processes representing a change to a more dynamic condition, such as alive, awake, clean, loose, and unstuck. And in contrast to both other verbs, the most common adjectives following go are all negative: crazy, mad, and wrong. It is not clear whether differences like these should be regarded as part of the core connotational meaning of a word, but it seems uncontroversial that this kind of information is crucially important for language learners.

There have been numerous corpus-based studies of collocation. Probably the best known is Sinclair 1991, who provides detailed descriptions on the collocations of decline, yield, and set in. Another excellent book-length introduction to the corpus-based study of collocation is by Partington 1998. For example, in chapter 2 (p. 552) of his book, Partington discusses the word sheer and its supposed synonyms pure, complete, and absolute, showing how these words are not at all interchangeable when considered from the perspective of their frequent collocates. Mahlberg (2005) provides a book-length treatment of general nouns in English (e.g., time, day, man, woman, people, thing, way), describing their meanings and use with respect to their collocational associations.

Most studies of collocation have disregarded register differences. One exception to this practice appears in a work by Biber, Conrad, and Reppen 1998: 43–53), which shows how the near-synonyms big, large, and great co-occur with very different sets of collocates (e.g., big enough versus large number versus great deal), and further shows how the collocational associations are very different in fiction versus academic writing. Other collocational studies taking a register perspective include those by Gledhill 2000 and Marco 2000, which both describe the functions of collocations in academic research writing.

Studies of collocation have in turn led to development of the notion of semantic prosody (Louw, 1993; Partington, 1998): the positive or negative connotations shared by the set of collocates that co-occur with a word. For example, the copular verb go (previously discussed) has a strong negative semantic prosody, whereas the copular verb come has a positive semantic prosody. Partington 1998: 66–67) discusses another example of this type: the verb commit, which has a strong negative semantic prosody, co-occurring with nouns like crime, suicide, and offenses. Similarly, Sinclair 1991: 74–75) notes that the nouns that co-occur as the subject of set in are mostly unpleasant states of affairs, such as rot, decay, malaise, despair, infection, disillusion, and so on. Studies have tended to focus on words with negative prosodies rather than positive prosodies. Other examples include cause (Stubbs, 1995), signs of (Stubbs, 2001: 458), and sit through (Hunston, 2002b: 60–62).

A related productive area of research has been the corpus-based (and corpus-driven) investigation of formulaic language in spoken and written registers. The methods and research goals of this line of research are quite different from the typical study of collocation. That is, studies of collocation have typically been case studies focused on a few particular words. These studies have typically disregarded register differences, and they have not attempted to generalize to the textual use of collocational combinations generally. In contrast, corpus studies of longer formulaic expressions are normally carried out in the context of a particular register or for the purposes of describing patterns of variation among multiple registers; in addition, the goals of these studies are to generalize about the use of formulaic language in the target registers rather than case studies restricted to one or two particular formulaic sequences. For example, Simpson (2004) and Simpson and Mendis 2003 describe the functions of idioms in academic spoken registers.

Many other studies have taken a corpus-driven approach to this research domain, identifying the sequences of words that are most common in different spoken and written registers (rather than starting with a set of formulaic (p. 553) expressions identified a priori based on their perceptual salience). These common word sequences, often referred to as lexical bundles, are usually not idiomatic and are not complete structures, but they are important building blocks of discourse. Thus, for example, Altenberg (1998) focuses on the recurrent word sequences in spoken English, whereas Biber et al. (1999, chapter 13) compare the lexical bundles in conversation and academic writing. Applying that framework, several studies have considered the types and functions of lexical bundles in additional registers: university classroom teaching and textbooks (Biber, Conrad, and Cortes, 2004; Nesi and Basturkmen, 2006), university student writing (Cortes, 2004), university institutional and advising registers (Biber and Barbieri, 2007), and political debate (Partington and Morley, 2004). N. Ellis et al. (2008) begin with a corpus analysis to identify a set of word sequences that are either frequent or that have strong collocational associations; they then test the psycholinguistic status of those sequences with respect to their perceptual salience and for their role in language production and comprehension (cf. Schmitt, Grandage, and Adolphs, 2004).

Corpus studies have shown that the types and functions of lexical bundles are very different among spoken and written registers (see, e.g., Biber, Conrad, and Cortes, 2004). First of all, there are generally more lexical bundles used in spoken registers than written registers. In terms of their structural characteristics, the bundles in speech tend to be composed of verb phrase and clause fragments, whereas the bundles in writing tend to be composed of noun phrase and prepositional phrase fragments. Those differences correspond to different discourse functions: The bundles in speech tend to be used for stance and discourse organizing functions, whereas the bundles in writing tend to have referential functions.

Of all subareas of applied linguistics, corpus research has probably had the greatest impact on lexical research and vocabulary studies. As previously noted, West 1953 created the General Service List of important vocabulary items based on analysis of a preelectronic corpus, and that list has been used in countless studies of vocabulary acquisition. One of the central concerns has been efforts to estimate the number of different words that a learner needs to know for different communicative purposes. Waring and Nation (1997) use corpus analysis to estimate the number of words needed to comprehend general written texts, whereas Coxhead 2000 analyzed a corpus of academic texts from several disciplines to develop a word list specifically for written academic language. Adolphs and Schmitt 2003 utilize analyses of spoken corpora to estimate the number of words required to understand conversational interactions.

Corpus research is similarly accepted as the standard practice in lexicography, so that all major ELT dictionaries are currently based on analysis of actual word use in large corpora (e.g., the Collins CoBuild English Language Dictionary [1987], the Longman Dictionary of Contemporary English [1987], and the Cambridge Advanced Learner's Dictionary [2005]). In sum, it would not be an overstatement to say that corpus research has revolutionized the way that lexicography, vocabulary acquisition, and word use in general are approached in linguistics.

(p. 554) 2.2. Corpus Studies with a Grammatical Focus

Within descriptive linguistics, there have been numerous book-length studies over the past 20 years reporting corpus-based investigations of grammar and discourse: for example, Tottie 1991 on negation, Collins 1991 on clefts, Mair 1990 on infinitival complement clauses, Meyer 1992 on apposition, several books on nominal structures (e.g., de Haan, 1989; Geisler, 1995; Johansson, 1995), Mindt (1995) on modal verbs, Hunston and Francis 2000 on pattern grammar, Lindquist and Mair 2004 on grammaticalization, and Mair 2006 on recent grammatical change within American English and British English—in other words, during the twentieth century).

Most corpus-based grammatical studies take a register perspective. Many of these focus on the linguistic variants associated with a feature, using register differences as one factor to account for the patterns of linguistic variation. However, there are an even larger number of studies that have focused on the use of a particular linguistic feature in a single register; in this case, the goals of the study are to describe both the discourse functions of the linguistic feature as well as the target register itself. Studies of both types can be further subdivided according to the linguistic level of the target feature (e.g., grammatical class, dependent clause type). In addition, both types of studies include descriptions of synchronic patterns of use as well as descriptions of historical patterns of variation.

Corpus-based studies of linguistic features using register as a predictor have investigated linguistic variation from all grammatical levels, from simple part of speech categories to variation in the realization of syntactic phrase and clause types. These studies have shown that descriptions of grammatical variation and use are not valid for the language as a whole. Rather, characteristics of the textual environment interact with register differences so that strong patterns of use in one register often represent only weak patterns in other registers. The Longman Grammar of Spoken and Written English (Biber et al., 1999) and Cambridge Grammar of English (Carter and McCarthy, 2006) are comprehensive reference works with this goal, applying corpus-based analyses to show how any grammatical feature can be described for structural characteristics as well as patterns of use across spoken and written registers.

As previously noted, many corpus-based studies use register differences as a predictor of linguistic variation, whereas others study linguistic features in the context of a single register. Thus, for example, Tottie 1991 contrasts the choices between synthetic and analytic negation, as in

He could find no words to express his pain.


He couldn't find any words to express his pain.

Among other factors, Tottie shows that synthetic negation is strongly preferred in written rather than in spoken registers, whereas analytic negation is more (p. 555) commonly used in spoken registers. In contrast, Hyland (1998a) focuses on the single register of scientific research articles, describing variation in the use of hedges within that register.

As noted earlier, these studies have documented the use of lexico-grammatical features at all linguistic levels. Several studies analyze a single part-of-speech category, documenting the patterns of variation and use in particular registers. Studies taking the perspective of register variation include Barbieri 2005 on quotative verbs and Römer (2005a) on progressive verbs.

Several other studies describe linguistic variation within the context of a single spoken register, such as conversation. Quaglio and Biber 2006 survey the distinctive grammatical characteristics of conversation identified through corpus research, whereas other studies provide detailed descriptions of a particular feature in conversation. For example, McCarthy (2002) describes nonminimal response tokens; Aijmer 2002 provides a book-length description of discourse particles; Carter and McCarthy 2006 describe the discourse functions of the get passive; Tao and McCarthy 2001 focus on nonrestrictive which clauses; and Norrick 2008 describes the discourse functions of interjections in conversational narratives. Other studies of a single spoken register have focused on academic speech in university settings, based on analysis of the Michigan Corpus of Academic Spoken English (MICASE). For example, Fortanet 2004 focuses on the pronoun we in university lectures; Lindemann and Mauranen 2001 describe the use of just in academic speech; and Swales 2001 provides a detailed description of the discourse functions served by point and thing in university academic speech.

A much larger number of studies have described linguistic variation within the context of a particular written register, most often a type of academic writing. Many of these have focused on the kinds of verbs used in research writing (e.g., Thomas and Hawes, 1994), or the referring expressions in research articles (e.g., Hyland, 2001, on the use of self-mentions and Kuo, 1999, on the role relationships expressed by personal pronouns). Other studies deal with simple grammatical structures, but again most often within the context of academic writing. For example, Hyland (2002a) and Swales et al. (1998) describe variation in the use of imperatives and the expression of directives, whereas Hyland (2002b) and Marley 2002 focus on the use of questions in written registers.

The study of linguistic variation related to the expression of stance and modality has been especially popular in corpus-based research. Several of these studies compare the ways in which stance is expressed in spoken versus written registers. Biber and Finegan 1988 and Conrad and Biber 2001 focus on adverbial markers of stance in speech and writing, whereas Biber and Finegan (1989a, 1989b) and Biber et al. (1999, chapter 12) survey variation in the use of numerous grammatical stance devices (including modal verbs, stance adverbials, and stance complement clause constructions), again contrasting the patterns of use in spoken versus written registers. Biber (2006a, 2006b) and Keck and Biber (2004) take a similar approach but applied to university spoken and written registers.

(p. 556) Many other studies focus exclusively on the expression of stance and modality in written registers (usually academic writing). These include Vohla's (1999) study of modality in medical research writing, the studies of stance by Charles (2003, 2006, 2007) on academic writing from different disciplines, and several studies that focus on hedging in academic writing (e.g., Grabe and Kaplan, 1997; Hyland, 1996, 1998a; Salager, 1994). Related studies have been carried out under the rubric of evaluation, again usually focusing on academic writing (e.g., Hunston and Thompson, 2000; Hyland and Tse, 2005; Römer, 2005b; Stotesbury, 2003; Tucker, 2003; cf. Bednarek's 2006 study of evaluation in newspaper language). Fewer studies have described the linguistic devices used to express stance and evaluation in spoken registers; some of these have focused on conversation (e.g., McCarthy and Carter, 1997, 2004; Tao, 2007), whereas others have focused on academic spoken registers (e.g., Mauranen, 2003, 2004; Mauranen and Bondi, 2003; Swales and Burke, 2003).

Dependent clauses and more complex syntactic structures have also been the focus of numerous corpus-based studies that consider register differences. Several studies contrast the patterns of use in spoken and written registers: Collins 1991 on cleft constructions, de Haan 1989 on nominal postmodifiers, Geisler 1995 on relative infinitives, Johansson (1995) on relative pronoun choice, and Biber et al. (1999) on complement clause constructions. Other studies have focused on the use of a syntactic construction in a particular register, like the study of conditionals in medical discourse (G. Ferguson, 2001) or the study of extraposed constructions in university student writing (Hewings and Hewings, 2002).

All of the kinds of studies surveyed in the preceding paragraphs can be approached from a historical (or diachronic) perspective rather than a synchronic perspective, and numerous studies have taken that approach. For example, many of the papers in the edited volumes by Nevalainen and Kahlas-Tarkka 1997 and Kytö, Rydén, and Smitterberg (2006) incorporate register comparisons to describe historical change for linguistic features like existential clauses, adverbial clauses, and relative clauses. Biber and Clark (2002) contrast the kinds of noun modifiers common in academic versus popular written registers. Several historical studies of stance and modality have included analysis of register differences, such as Kytö (1991) on modal verbs in written and speech-based registers, Culpeper and Kytö (1999) on hedges in Early Modern English dialogues, Salager-Meyer and Defives 1998 on hedges in academic writing over the last two centuries, Fitzmaurice (2002b, 2003) on stance and politeness in early eighteenth-century letters, and Biber 2004 on historical change in the use of stance and modal features across a range of speech-based and written registers. A few studies have focused on recent (i.e., twentieth-century) historical change; for example, Hundt and Mair 1999 contrast the rapid grammatical change observed in “agile” registers (like newspaper writing) with the much slower pace of change observed in “uptight” registers like academic prose. Leech, Hundt, Mair, and Smith (in press) track historical change in the twentieth century using the register categories distinguished in the Brown/LOB family of corpora.

(p. 557) 3. Descriptions of Varieties

3.1. Register Descriptions

The studies surveyed in the preceding section focus on a particular linguistic feature, using register to describe the use of that feature. In the present section, the analytical perspective is reversed: These studies focus on the overall description of a register, considering a suite of linguistic features that are characteristic of the register.

Many studies of this type describe spoken registers, including conversation (e.g., Biber, 2008; Carter and McCarthy, 1997, 2004; Quaglio and Biber, 2006; Biber and Conrad, in press: chapter 4), service encounters (e.g., McCarthy, 2000), call center interactions (Friginal, 2009a, 2009b), spoken business English (McCarthy and Handford, 2004), television dialogue (Quaglio, 2009; Rey, 2001), spoken media discourse (O'Keeffe, 2006), and spoken university registers like classroom teaching, office hours, and teacher-mentoring sessions (e.g., Biber, 2006a; Biber, Conrad, and Leech, 2002; Csomay, 2005; Reppen and Vásquez, 2007). Ädel and Reppen 2008 include several papers that use corpus analysis to describe different registers from academic, workplace, and television settings.

However, written registers have received considerably more attention than spoken registers. Academic prose has been the best described written register (see, e.g., Biber, 2006a; Biber, Connor, and Upton, 2007; Connor and Mauranen, 1999; Connor and Upton, 2004b; Conrad, 1996, 2001; Freddi, 2005; McKenna, 1997; Tognini-Bonelli and Del Lungo Camiciotti, 2005). But many other written registers have also been described using corpus-based analysis, including personal letters (e.g., Connor and Upton, 2003; Fitzmaurice, 2002a; Precht, 1998), written advertisements (e.g., Bruthiaux, 1994, 1996, 2005), newspaper discourse (e.g., Bednarek, 2006; Herring, 2003; Jucker, 1992), and fiction (e.g., Thompson and Sealey, 2007; Mahlberg, in press; Semino and Short, 2004). Electronic registers that have emerged over the past few decades, from e-mail communication to weblogs and texting, have been an especially interesting and productive area of research (see, e.g., Biber and Conrad, in press: chapter 7; Danet and Herring; 2003, Gains, 1999; Herring and Paolillo, 2006; Hundt, Nesselhauf, and Biewer, 2007; Morrow, 2006).

3.2. Multidimensional Analyses of Register Variation

Most of the studies previously listed have the primary goal of describing a single register. However, corpus analysis can also be used to describe the overall patterns of variation among a set of spoken and/or written registers. Perhaps the best known approach used for descriptions of this type is multidimensional (MD) analysis: a corpus-driven methodological approach that identifies the frequent linguistic co-occurrence patterns in a language, relying on inductive empirical/quantitative analysis (see, e.g., Biber, 1988, 1995; Biber and Conrad, in press: chapter 8). Frequency (p. 558) plays a central role in the analysis, because each dimension represents a constellation of linguistic features that frequently co-occur in texts. These dimensions of variation can be regarded as linguistic constructs not previously recognized by linguistic theory. Thus, MD analysis is a corpus-driven (as opposed to corpus-based) methodology, in that the linguistic constructs—the dimensions—emerge from analysis of linguistic co-occurrence patterns in the corpus. The set of co-occurring linguistic features that comprise each dimension is identified quantitatively. That is, based on the actual distributions of linguistic features in a large corpus of texts, statistical techniques (specifically, factor analysis) are used to identify the sets of linguistic features that frequently co-occur in texts.

The original MD analyses (Biber, 1986, 1988) investigated the relations among general spoken and written registers in English, based on analysis of the Lancaster-Oslo/Bergen (LOB) Corpus (15 written registers) and the London-Lund Corpus (6 spoken registers). Sixty-seven different linguistic features were analyzed computationally in each text of the corpus. Then, the co-occurrence patterns among those linguistic features were analyzed using factor analysis, identifying the underlying parameters of variation—in other words, the factors or dimensions.

In the 1988 MD analysis, the 67 linguistic features were reduced to 7 underlying dimensions. (The technical details of the factor analysis are given in Biber, 1988: chapters 45; see also Biber, 1995: chapter 5). The dimensions are interpreted functionally, based on the assumption that linguistic co-occurrence reflects underlying communicative functions; that is, linguistic features occur together in texts because they serve related communicative functions. For example, table 38.1 lists the important co-occurring features for dimensions 1 and 2 from the 1988 MD analysis, together with the labels reflecting the functional interpretation.

Many subsequent studies have applied the 1988 dimensions of variation to study the linguistic characteristics of other more specialized registers and discourse domains (Conrad and Biber, 2001). The following are examples: (p. 559) However, other MD studies have undertaken new corpus-driven analyses to identify the distinctive sets of co-occurring linguistic features that appear in a particular discourse domain or in a language other than English. The following section surveys some of those studies.

Table 38.1. Summary of Dimensions 1 and 2 from the 1988 MD analysis of general English registers

Dimension 1: “Involved versus Informational Production”

Positive features: mental verbs, present tense verbs, contractions, possibility modals, first- and second-person pronouns, demonstrative pronouns, emphatics, hedges, causative subordination, WH clauses, that-clauses with that omitted, WH questions

Negative features: nouns, long words, high type/token ratio, prepositional phrases, attributive adjectives, passive verbs

Dimension 2: “Narrative Discourse”

Positive features: past tense verbs, perfect aspect verbs, communication verbs, third-person pronouns

Negative features: present tense verbs, attributive adjectives

Source: Biber, 1988.

Present-Day Registers


Spoken and written university registers

Biber et al. (2002)

AmE versus BrE written registers

Biber 1987

AmE versus BrE conversational registers

Helt (2001)

Student versus academic writing (biology, history)

Conrad (1996, 2001)

I-M-R-D sections in medical research articles

Biber and Finegan 1994

Direct mail letters

Connor and Upton (2003)

Discourse moves in non-profit grant proposals

Connor and Upton (2004b)

Oral proficiency interviews

Connor-Linton and Shohamy (2001)

Academic lectures

Csomay 2005

Conversation versus TV dialogue

Quaglio 2009

Female/male conversational style

Rey (2001); Biber and Burges 2000

Author styles

Connor-Linton (2001); Biber and Finegan 1994

Historical Registers


• Written and speech-based registers, 1650-present

Biber and Finegan (1989a, 1997)

• Medical research articles and scientific research articles, 1650-present

Atkinson (1992, 1996, 1999)

• Nineteenth-century written registers

Geisler (2002)

3.2.1 Comparison of the Multidimensional Patterns across Discourse Domains and Languages

Numerous other studies have undertaken complete MD analyses, using factor analysis to identify the dimensions of variation operating in a particular discourse domain in English rather than applying the dimensions from the 1988 MD analysis (e.g., Biber, 1992, 2001, 2006a, 2008; Biber, Connor, and Upton, 2007; Biber and Jones, 2005; Biber and Kurjian, 2007; Friginal 2006, 2009b; Kanoksilapatham, 2005, 2007; Reppen, 2001).

Given that each of these studies is based on a different corpus of texts, representing a different discourse domain, it is reasonable to expect that they would each (p. 560) identify a unique set of dimensions. This expectation is reinforced by the fact that the more recent studies have included additional linguistic features not used in earlier MD studies (e.g., semantic classes of nouns and verbs). However, despite these differences in design and research focus, there are certain striking similarities in the set of dimensions identified by these studies.

Most important, in nearly all of these studies, the first dimension identified by the factor analysis is associated with a literate, informational focus (e.g., nouns, prepositional phrases, attributive adjectives, longer words) versus an oral, involved focus (personal involvement/stance, interactivity, and/or real time production features). For example, the MD studies of university spoken and written registers (Biber, 2006a), elementary school spoken and written registers (Reppen, 2001), and eighteenth-century written and speech-based registers Biber (2001) all identified a first dimension of this type. More surprisingly, a similar dimension has emerged even in MD studies that have focused exclusively on spoken registers, such as that of M. White 1994, which investigated register variation within the domain of job interviews, and of Biber (2008), which investigated register variation among the different types of conversation. A second parameter found in most MD analyses corresponds to narrative discourse, reflected by the co-occurrence of features like past tense, third-person pronouns, perfect aspect, and communication verbs (see, e.g., the Biber, 2006a study of university registers; Biber, 2001, on eighteenth-century registers; and the Biber, 2008, study of conversation text types).

However, most of these studies have also identified some dimensions that are unique to the particular discourse domain. For example, Reppen's (1994) factor analysis identified a dimension of “other-directed idea justification” in elementary student registers. The study of university spoken and written registers (Biber, 2006a) identified two dimensions that are specialized to the university discourse domain: “Procedural versus content-focused discourse” and “academic stance.”

In sum, corpus-driven MD studies of English registers have uncovered both surprising similarities and notable differences in the underlying dimensions of variation. Two parameters seem to be fundamentally important, regardless of the discourse domain: a dimension associated with informational focus versus (inter) personal focus and a dimension associated with narrative discourse. At the same time, these MD studies have uncovered dimensions particular to the communicative functions and priorities of each different domain of use.

These same general patterns have emerged from MD studies of languages other than English, including Nukulaelae Tuvaluan (Besnier, 1988), Korean (Kim and Biber, 1994); Somali (Biber and Hared, 1992, 1994); Taiwanese (Jang, 1998), Spanish (Biber, Davies, Jones, and Tracy-Ventura, 2006; Biber and Tracy-Ventura, 2007; Parodi, 2007), and Dagbani (Purvis, 2008). Taken together, these studies provide the first comprehensive investigations of register variation in non-English languages.

Biber 1995 synthesizes several of these studies to investigate the extent to which the underlying dimensions of variation and the relations among registers are configured in similar ways across languages. These languages show striking similarities in their basic patterns of register variation, as reflected by the co-occurring (p. 561) linguistic features that define the dimensions of variation in each language, the functional considerations represented by those dimensions, and the linguistic/functional relations among analogous registers. For example, similar to the full MD analyses of English, these MD studies have all identified dimensions associated with informational versus (inter)personal purposes and with narrative discourse.

At the same time, each of these MD analyses has identified dimensions that are unique to a language, reflecting the particular communicative priorities of that language and culture. For example, the MD analysis of Somali identified a dimension interpreted as “distanced, directive interaction,” represented by optative clauses, first- and second-person pronouns, directional preverbal particles, and other case particles. Only one register is especially marked for the frequent use of these co-occurring features in Somali—personal letters. This dimension reflects the particular communicative priorities of personal letters in Somali, which are typically interactive as well as explicitly directive.

The cross-linguistic comparisons further show that languages as diverse as English and Somali have undergone similar patterns of historical evolution following the introduction of written registers. For example, specialist written registers in both languages have evolved over time to styles with an increasingly dense use of noun phrase modification. Historical shifts in the use of dependent clauses are also surprising: in both languages, certain types of clausal embedding—especially complement clauses—turn out to be associated with spoken registers rather than with written registers.

These synchronic and diachronic similarities raise the possibility of universals of register variation. Synchronically, such universals reflect the operation of underlying form/function associations tied to basic aspects of human communication; diachronically, such universals relate to the historical development of written registers in response to the pressures of modernization and language adaptation.

3.3. Corpus-Based Studies of Historical Registers

Corpus analysis has been especially important for historical descriptions of registers (see Biber and Conrad, in press: chapter 6). Multidimensional analysis has been used to document historical patterns of register variation (e.g., Atkinson, 1992, 1996, 1999; Biber, 2001; Biber and Finegan, 1989a, 1997; Geisler, 2002). However, there has been an even larger number of studies that provide a detailed description of a single historical register. A few MD studies have focused on a specific register, such as the study of historical change in fictional dialogue by Biber and Burges 2000 or the study of recent changes in television dialogue (Rey, 2001). But most of these studies provide detailed descriptions of the linguistic characteristics of a historical register. Several of these studies analyze spoken registers from earlier historical periods (e.g., Culpeper and Kytö, 2000, forthcoming; Kahlas-Tarkka and Rissanen, 2007; Kryk-Kastovsky, 2000; 2006; Kytö and Walker, 2003). The largest majority, though, focus on written historical registers, such as letters (Fitzmaurice, 2002a; Nevala, 2004), (p. 562) medical recipes and herbals (Mäkinen, 2002; Taavitsainen, 2001), and medical and scientific writing (e.g., Taavitsainen and Pahta, 2000, 2004).

3.4. World Englishes and English as a Lingua Franca (ELF)

In general, sociolinguistics has been resistant to the application of corpus-based analyses, and so most studies of social and regional dialect variation continue to employ traditional methodologies. However, a few research projects have studied regional dialect variation from a corpus perspective. For the most part, these projects have been conducted in European universities (Freiburg, Helsinki, Newcastle) and have focused on British English dialects, resulting in the Newcastle Electronic Corpus of Tyneside English, the Helsinki Corpus of British English Dialects (see Ihalainen, 1990), and the Freiburg English Dialect Corpus (FRED; see Kortmann and Wagner, 2005; Anderwald and Wagner, 2005). We are aware of only one study to date that has applied a corpus approach to analyze American English regional dialects: Grieve's (2009) study of variation in a 50-million-word corpus of letters to the editor collected from 200 cities from across the United States.

In contrast, the linguistic study of global varieties of English—or “World Englishes”—is almost always carried out from a corpus perspective. The strengths of the corpus approach make it ideal for describing new varieties that have emerged as English adapts to changing circumstances of use and contact with local languages and cultures (see Breiteneder, 2008). Research efforts in this area have focused on two major subareas: the study of World Englishes (indigenous varieties of English) and the study of English as a Lingua Franca (ELF; English used by nonnative English speakers). (See J. Jenkins, 2006, for a full discussion of this topic.)

Corpus development efforts in the arena of World Englishes are best represented by the International Corpus of English (ICE) project. The ICE project is an attempt to construct comparable corpora for all varieties of English spoken around the world (see Greenbaum, 1988, 1990a, 1990b, 1990c, 1991, 1996; Greenbaum and Nelson, 1996). Each corpus in ICE ideally has the same design—in other words, a total size of one million words, with 500 texts of approximately 2,000 words each from the same registers (news, conversation, etc.). The texts in the corpus date from 1990 or later. The authors and speakers of the texts are aged 18 or over, are educated through the medium of English, and either were born in the target country or moved there at an early age (Nelson, 1996).

As part of the ICE project or other related efforts, individual corpora have been constructed for many of the varieties of English used around the world. These include corpora for the “inner-circle” varieties of English (e.g., for Australia, Canada, Great Britain, New Zealand, the United States; see ice/) as well as corpora for numerous other varieties of English spoken around the world, such as Caribbean English, East African English, Fiji English, Filipino English, Hong Kong English, Indian English, Jamaican English, Nigerian English, Singaporean English, and Xhosa English (see, e.g., Banjo, 1996; Bolt and Kingsley, 1996; Bolton, 2000; Burridge and Kortmann, 2008; Friginal, 2009b; Holmes, 1996; Hundt 1998, 2006; (p. 563) Hundt and Biewer, 2007; Kortmann, 2006; Mair, 1992; Mair and Sand, 1998; Ooi, 1997; Rogers, 2002, 2003; Sand, 1998, 1999; Schmied, 1990, 1994, 2004a, 2004b, 2005, 2006, 2007; Schmied and Hudson-Ettle, 1996; Tent and Mugler, 1996, 2004).

A parallel research effort has focused on English as a lingua franca (ELF). Two especially important projects in this area have been the Vienna Oxford International Corpus of English (VOICE; see Seidlhofer, 2006, 2007; Seidlhofer, Breiteneder, and Pitzl, 2006; Breiteneder et al., 2006) and the corpus of English as Lingua Franca in Academic Settings (ELFA corpus; see Mauranen, 2003, 2006, 2007).

4. Corpus Linguistics, Language Learning, and Language Pedagogy

Explorations into the pedagogical applications of corpus linguistics continue to match ongoing advancements in corpus-based technology and classroom research. Vocabulary acquisition and the mastery of grammar for language learners have traditionally been the preferred areas of investigation by many corpus researchers involved in the design and creation of language teaching materials (Conrad, 1999, 2000; Hinkel, 2002). However, in recent years, corpus tools have been utilized in the teaching of specific skills particularly in genre-based writing (Hyland, 2004b; Swales, 2002) and speaking in various academic and professional contexts.

There are several points of intersection between corpus linguistics and directly applied issues that involve language teaching and learning. In the following sections, we address four of these:

  • The compilation and analysis of learner corpora

  • The use of corpora for language teaching and learning

  • Applications of corpus research in ESP/EAP

  • The extent to which corpus findings can be integrated into textbooks and other teaching materials

4.1. Learner Corpora

One major application of corpus methods has been in the construction of learner corpora and the analysis of those corpora to document differences across L1 backgrounds. The most important project of this type is the International Corpus of Learner English (ICLE), a collection of corpora produced by learners from several different language backgrounds (see, e.g., Granger, 1993, 1994, 1996, 1998a, 2003a, 2003b). Many studies have compared the patterns of use in learner corpora to those found in native-English corpora to document patterns of overuse or underuse by learners. Studies have focused on a wide range of grammatical features, such as passives, participle clauses, connectors, and so on (see Aarts and Granger, 1998; (p. 564) Granger, 1997a, 2004; Granger and Tyson, 1996; Granger, Hung, and Petch-Tyson, 2002). Many studies in this tradition have also focused on formulaic sequences and the lexico-grammatical patterns associated with different learner groups (see, e.g., Altenberg and Granger, 2001; De Cock, 1998; De Cock et al., 1998; Granger, 1998b; Meunier and Granger, 2008). Although most corpus studies of leaner language have been based on the ICLE, there have also been major studies with similar research goals undertaken from other perspectives (e.g., Hinkel, 2002, 2003; Reder, Harris, and Setzler, 2003).

4.2. Corpora for Language Teaching and Learning

An even larger number of studies address the use of corpora for language teaching, introducing the approaches and discussing potential pedagogical benefits. These include numerous book-length treatments (e.g., Aston, 2001a; Aston, Bernardini, and Stewart, 2004; Botley, McEnery, and Wilson, 2000; Burnard and McEnery, 2000; Ghadessy et al., 2001; Lewandowska-Tomaszczyk, 2003, 2004; McEnery and Wilson, 1997; Mukherjee and Rohrbach, 2006; O'Keeffe, McCarthy, and Carter, 2007; Sinclair, 2004; Thomas and Short, 1996; Tribble and Jones, 1997; Wichmann, Fligelstone, McEnery, and Knowles, 1997) as well as an even larger number of journal articles and book chapters (e.g., Alderson, 1996; Aston, 1995, 1997, 2001b; Barbieri and Eckhardt, 2007; Braun, 2005; Brodine, 2001; Donley and Reppen, 2001; Fligelstone, 1993; Huckin and Coady, 1999; “Kaltenböck and Mehlmauer-Larcher, 2005; Leech, 1997, 2000; McCarthy and Carter, 2001; McEnery and Wilson, 1993, 1997, 2001; Meunier, 2002; Milton, 1998; Mindt, 1996; Mudraya, 2006; Murphy, 1996; O'Keeffe and Farr, 2003; Partington, 2001; Salsbury and Crummer, 2008; Shirato and Stapleton, 2007; Thompson and Tribble, 2001; Tribble, 2001; Yoon and Hirvela, 2004; Zorzi, 2001).

One especially common topic of these studies is the use of concordancing activities in the classroom, especially for inductive, data-driven learning (in addition to many of the studies previously cited, see Cobb, 1997; Flowerdew, 2001; Gaskell and Cobb, 2004; Gavioli, 1997, 2001; Johns, 1994, 1997; Nesselhauf, 2003; Qiao and Sussex, 2001; Sinclair, 2003; Stevens, 1993; Todd, 2001; Wichmann, 1995). For instance, Cobb 1997 and Horst, Cobb, and Nicolae 2005 report specific learning gains in the transfer of vocabulary knowledge of language learners that are attributable to the use of concordance programs and corpus-based tools. Similar studies by Chan and Liou (2005), Charles 2005, and Friginal 2006 illustrate how web-based concordancing instruction and the use of concordancers in editing laboratory reports significantly help students' learning and use of verb-noun collocations, reporting verbs, passive and active sentence structures, and linking adverbials. Most participants in these studies see the use of concordancers as helpful. Innovative corpus tools that aid in the introduction of new words, collocations, and lexical bundles help learners to improve their awareness of word meanings and of the uses of words in various contexts. In addition, hands-on concordancing also aids in successful learning of new academic vocabulary, and enhances students' performance in (p. 565) activities and on tests (Altenberg and Granger, 2001; McCarthy and Carter, 2002; Nesselhauf, 2005).

Other studies focus more on the unexpected research findings that result from corpus investigations, discussing how such findings often indicate that we should be using radically different pedagogical approaches and different teaching materials than those traditionally used for language teaching (see, e.g. Carter and McCarthy, 1995; Conrad, 1999, 2000; Henry and Roseberry, 2001; Hughes and McCarthy, 1998; Hunston, 2002b; Hunston and Francis, 1998; Liu, 2003; Nesselhauf, 2003). For example, Biber and Reppen 2002 present corpus findings that identify the most common verbs in English conversation and then survey ESL grammar books to show that most of them fail to illustrate the use of those verbs.

4.3. Corpora and ESP/EAP

Research in the subfields of English for specific purposes (ESP) and English for academic purposes (EAP) has become almost entirely corpus based over the past 10 to 20 years. For example, a survey of articles in any recent issue of English for Specific Purposes or the Journal of English for Academic Purposes shows that recent linguistic descriptions of special/academic varieties in English are almost always based on corpus analysis.

Similarly, corpus approaches have become commonplace for ESP/EAP pedagogy. For example, Gilquin, Granger, and Paquot (2007), Hyland (2004b), Flowerdew 2005, and Gavioli 2005 all acknowledge the invaluable contribution of corpus approaches in the teaching of ESP/EAP, especially in increasing learners' awareness of the textual features of the target language. Yoon and Hirvela 2004 and Lee and Swales 2006 explore the use of corpora and corpus tools in EAP courses. For example, Lee and Swales piloted an innovative 13-week course in corpus-informed EAP, in which students were able to compare their writing with the linguistic patterns in a corpus of professional, published academic papers. These studies indicate that the corpus approach to academic writing facilitates the development of writing skills and contributes to learners' increased confidence; a majority of the participants in studies reported that they would recommend corpus-informed writing classes to other foreign students.

4.4. Corpus-Informed Language Textbooks

In contrast to the extremely large number of books and research papers that advocate the application of corpus approaches for language teaching, there are surprisingly few language textbooks that are based on corpus research. ELT dictionaries, which have been based on corpus research since the 1980s, are the major exception here (see sections 1 and 2). However, publishers have been more reluctant to break with tradition in ELT textbooks for vocabulary and grammar.

(p. 566) There are a few notable exceptions to this generalization. In some cases, textbooks have been shaped by corpus analysis, even though this influence is not acknowledged on the book cover or in the introduction. Such books include the series Vocabulary in Use (McCarthy and O'Dell, 2001, 2004, 2005) and Natural Grammar (Thornbury, 2004). In more recent years, though, publishers have become more willing to market ESL textbooks that are directly shaped by the results of corpus research. For example, the four-level EFL/ESL Touchstone series by McCarthy, McCarten, and Sandiford (2006) is advertised as drawing on “the Cambridge International Corpus … to build a syllabus based on how people actually use English” (back cover). Vocabulary books like those by Schmitt and Schmitt 2005 and Huntley 2006 are corpus based in two major respects:

  1. 1. They teach the words on the “Academic Word List”: a list of the most common vocabulary items that occur in a large corpus of written academic texts (see Coxhead, 2000, previously discussed in section 2.1).

  2. 2. They provide practice in the typical “collocations” of those words, derived from further corpus analysis.

Corpus-based EAP curricula are widely used throughout Europe and Asia, but they are usually based on locally created materials rather than on a major textbook. One exception to this is the corpus-informed textbook on chemistry research writing by Robinson, Stoller, Costanza-Robinson, and Jones (2008). This book is actually targeted for all students of chemistry, because native speakers of English encounter many of the same challenges in learning advanced disciplinary writing skills as do language learners.

It is possible to make a distinction between corpus-informed textbooks and corpus-based textbooks: The former incorporate natural examples taken from a corpus, whereas in the latter, decisions about inclusion/exclusion of topics and the sequence of topics are made based on the results of prior corpus analysis. In many cases, a corpus-based book will present linguistic patterns of use that would not have even been acknowledged in a traditional textbook. The vocabulary books by Schmitt and Schmitt 2005 and Huntley 2006 are corpus based in this sense. The grammar book by Thornbury 2004 also seems to be corpus based in this sense, although there is nothing in the book introduction that acknowledges the role of corpus analysis.

Two recent books provide corpus-based introductions to English grammar for advanced students training to become language teachers: The Longman Student Grammar of Spoken and Written English (and the accompanying workbook; Biber, Conrad, and Leech, 2002; Conrad, Biber and Leech, 2002) and the Teacher's Grammar of English (Cowan, 2008). Finally, Conrad and Biber (in press) identifies 50 of the most important and surprising corpus research findings from the Longman Grammar of Spoken and Written English, presenting those as grammar units for ESL/EFL students.

(p. 567) 5. Future Directions

The present chapter has surveyed the extensive body of research using corpus analysis to describe the patterns of language use in English (and other languages). In addition, there is no shortage of studies that advocate the application of corpus approaches for language teaching. However, as described in the last section, there has been much less effort given to the actual implementation of corpus research findings to develop teaching materials, especially textbooks that can provide the basis for a curriculum. At present, however, there are several such books in the works, and we anticipate that this state of affairs will change dramatically over the next few years.

One specific area that is currently receiving attention is the analysis of spoken corpora annotated for prosody in addition to lexico-grammatical information. Interestingly, the very first large spoken corpus of English—the London-Lund Corpus—included detailed coding to reflect pitch, length, and pausing phenomena (see Svartvik, 1990). However, this information was mostly disregarded in linguistic analyses of that corpus. More recently, though, spoken corpora are being analyzed to document systematic patterns of discourse intonation. Cheng, Greaves, and Warren's (2008; cf. Warren, 2004) study of the Hong Kong Corpus of Spoken English is one notable example of this type. Similarly, the C-ORAL-ROM project (Cresti and Moneglia, 2005) is a major research effort to develop acoustically analyzed spoken corpora for Italian, French, Spanish, and Portuguese.

Finally, multimodal annotation of spoken interactions should be another important area for future research (see, e.g., Gu, 2002, 2007). In addition to enhanced prosodic and acoustic transcriptions of spoken corpora, these projects link video recordings to nonlinguistic features that play a crucial role in communication, such as facial expressions, hand gestures, and body position (see, e.g., Carter and Adolphs, 2008; Dahlmann and Adolphs, in press; Knight and Adolphs, 2008). Studies like these indicate that the strengths of corpus analysis can be extended to include aspects of communication beyond the analysis of the lexico-grammatical fabric of spoken and written texts. (p. 568)