Abstract and Keywords
This chapter gives an introduction to the Oxford Handbook of Corpus Phonology. It gives short summaries of each of the chapters contained in its four parts. Part I contains contributions on general issues in phonological corpus compilation, annotation, analysis, storage and dissemination. The chapters of Part II show how corpus-based methods may enrich and improve research within different subfields of phonology such as phonetics, prosody, segmental phonology, diachrony, first language acquisition, and second language acquisition. Part III presents relevant tools and methods, while Part IV describes a number of leading corpora in the field of phonology.
Corpus phonology is a new interdisciplinary field of research that has only begun to emerge during the last few years. It has grown out of the need for modern phonological research to be embedded within a larger framework of social, cognitive, and biological science, and combines methods and theoretical approaches from phonology, both diachronic and synchronic, phonetics, corpus linguistics, speech technology, information technology and computer science, mathematics, and statistics. In the past, phonological research comprised predominantly descriptive methods, but while new methods such as experimentation, acoustic-perceptual, and aerodynamic modelling, as well as psycholinguistic and statistical methods, have recently been introduced, the employment of purpose-built corpora in phonological research is still in its infancy.
With the increasing number of phonological corpora being compiled all over the world, the need arises for the international research community to exchange ideas and find a consensus on fundamental issues such as corpus annotation, analysis, and dissemination as well as corpus data formats and archiving. The time seems right for the development of standards for phonological corpus compilation and especially corpus annotation and metadata. It is the aim of this Handbook to address these issues. It offers guidelines and proposes international standards for the compilation, annotation, and analysis of phonological corpora. This includes state-of-the-art practices in data collection and exploitation, theoretical advances in corpus design, best practice guidelines for corpus annotation, and the description of various tools for corpus annotation, exploitation, and dissemination. It also comprises chapters on phonological findings based on corpus analyses, including studies in fields as diverse as the phonology–phonetics interface, language variation, and language acquisition. Moreover, an overview is provided of a large number of existing phonological corpora and tools for corpus compilation, annotation, and exploitation.
The Handbook is structured in four parts. The first part, ‘Phonological Corpora: Design, Compilation, and Exploitation’, contains contributions on general issues in phonological corpus compilation, annotation, analysis, storage, and dissemination. In (p. 2) chapter 2, Ulrike Gut and Holger Voormann describe the basic processes of phonological corpus design including data compilation, data selection, and data annotation as well as corpus storage, sustainability, and reuse. They address fundamental questions and decisions that compilers of a phonological corpus are inevitably faced with, such as questions of corpus representativeness and size, raw data selection, and corpus sharing. On the basis of these reflections, the authors propose a methodology for corpus creation. Many of the issues raised in Chapter 2 are developed in the next three chapters. Chapter 3 is concerned with corpus-based data collection. In it, Bruce Birch discusses some key issues such as control over primary data, context, and contextual variation, and the observer’s paradox. He further classifies various widespread data collection techniques in terms of the amount and type of control they assert over the production of speech, and gives a comprehensive overview of data collection techniques for purposes of phonological research. The task of phonological corpus annotation is described in chapter 4 Elisabeth Delais-Roussarie and Brechtje Post first discuss some theoretical issues that arise in the transcription and annotation of speech such as segmentation and the assignment of labels. Furthermore, they provide a comprehensive overview and evaluation of the various systems that are in use for the annotation of segmental and suprasegmental information in the speech signal. Chapter 5 provides an overview of the state of the art in automatic phonetic transcription of corpora. After introducing the most relevant methodological issues in this area, Helmer Strik and Catia Cucchiarini describe and evaluate the different techniques of (semi-)automatic phonetic corpus transcription that can be applied, depending on what kind of data and annotations are available to corpus compilers.
The next couple of chapters are concerned with the exploitation and archiving of phonological corpora. In chapter 6, Hermann Moisl presents statistical methods for analysing phonological corpora, focusing in particular on cluster analysis. Illustrating his account with the Newcastle Electronic Corpus of Tyneside English (which is presented in Part IV), he describes and discusses in a detailed way the process and benefits of applying the technique of clustering to phonological corpus data. Chapter 7 is concerned with corpus archiving and dissemination. Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg discuss how the traditional model of corpus archiving and dissemination is changing dramatically, with digital innovations opening up new possibilities. They examine various preservation requirements that need to be met, and illustrate the use of advanced infrastructures for data accessibility, archiving and dissemination.
The last two chapters of Part I are concerned with the concept of metadata and data formats. In chapter 8, Daan Broeder and Dieter van Uytvanck describe some of the major metadata sets in use for the compilation of corpora including OLAC, TEI, IMDI, and CMDI. They further give some practical advice on what metadata schema to use or how to design one’s own if required. Finally, Chapter 9 addresses basic issues that are important for corpus compilers with regard to the choice of data format. Laurent Romary and Andreas Witt argue for providing the research community with a set of standardized formats that allow a high reuse rate of phonological corpora as well as better interoperability across tools used to produce or exploit them. They describe some (p. 3) basic concepts related to the representation of annotated linguistic content, and offer some proposals for the annotation of spoken corpus data.
The second part of this Handbook, ‘Applications’, is devoted to how speech corpora can be put to use. Each chapter considers how corpus-based methods may enrich and improve research within different subfields of phonology such as phonetics, prosody, segmental phonology, diachrony, first language acquisition, and second language acquisition. These topics and perspectives should by no means be regarded as exhaustive; they are but a few examples of many possible ones that are intended to show the usefulness of corpus-based methods. In chapter 10, Elisabeth Delais-Roussarie and Hiyon Yoo take as their starting point the various data and methods commonly used for research in phonetics and phonology. This leads them to a definition of what can be considered (1) a corpus and (2) a corpus-based approach to the two disciplines. The rest of the chapter is devoted to post-lexical phonology and prosody, such as liaison in French, suprasegmental phenomena such as phrasing or intonation, and the use of corpora in phonetic research. The topic of chapter 11, written by Hanne Gram Simonsen and Gjert Kristoffersen, is segmental phonology from a variationist point of view. An ongoing change where a formerly laminal /s/ is turned into apical /ʂ/ before /l/ in Oslo Norwegian is shown to be governed by a complex set of phonological and morphological constraints that could not have been identified without recourse to corpus-based methods. The corpora used in their analysis are described in Chapter 25 in this volume.
Chapter 12 takes up again one of the topics of Chapter 10 in greater detail: French liaison. Based on the PFC corpus (see Chapter 24, this volume), Jacques Durand shows how recourse to corpora has contributed to a better understanding of perhaps one of the most thoroughly analysed phenomena in the phonology of French. Durand argues that previous analyses are to a certain extent flawed because they are based on data which are too scarce, occasionally spurious, and often uncritically adopted from previous treatments. The PFC corpus has helped to put the analysis on firmer empirical ground and to chart which areas are relatively stable across speakers and which are variable. The topic of chapter 13, written by Yvan Rose, is phonological development in children. Following a discussion of issues that are central to research in phonological development, the chapter describe some solutions, with an emphasis on the recently proposed PhonBank initiative (see also chapter 19 of this volume) within the larger CHILDES project. Finally, chapter 14 is concerned with second language acquisition. Here, Ulrike Gut shows how research on the acquisition and structure of L2 phonetics and phonology can profit from the analysis of phonological corpora of second language learner speech. A second objective of this chapter is to discuss how corpora can support the creation of teaching materials and teaching curricula, and how they can be employed in classroom teaching and learning of phonology.
Part III of the Handbook concerns ‘Tools and Methods’. A number of tools, systems, or methods in this section have become standard in the field and are used in a large number of research projects. Thus, chapter 15 by Han Sloetjes provides an overview of ELAN, a stand-alone tool developed at the Max Planck Institute for Psycholinguistics in Nijmegen in the Netherlands. ELAN is a generic multimedia annotation tool which is (p. 4) not restricted to the analysis of spoken language, since it is also applied in sign language research, gesture research, and language documentation, to name just a few. It offers powerful descriptive strategies, since it supports time-aligned multilevel transcriptions and permits annotations to reference other annotations, allowing for the creation of annotation tree structures.
By contrast, EMU presented in chapter 16 by Tina John and Lasse Bombien is a database system for the specific analysis of speech, consisting of a collection of software tools for the creation, manipulation, and analysis of speech databases. EMU includes an interactive labeller which can display spectrograms and speech waveforms, and which allows the creation of hierarchical as well as sequential labels for a speech utterance. A central concern of the EMU project is the statistical analysis of speech corpora. To this end, EMU interfaces with the R environment for statistical computing. Like EMU, Praat, devised by Paul Boersma and David Weenink at the University of Amsterdam, is a computer program for analysing, synthesizing, and manipulating speech and other sounds, and for creating publication-quality graphics. A speech corpus typically consists of a set of sound files, each of which is paired with an annotation file, and metadata information. Paul Boersma’s introduction to Praat in chapter 17 demonstrates that the strengths of this tool lie in the acoustic analysis of the individual sounds, in the annotation of these sounds, and in browsing multiple sound and annotation files across the corpus. Moreover, corpus-wide acoustic analyses, leading to tables ready for statistical analysis, can be performed by the Praat scripting language, which is thoroughly described and illustrated by Caren Brinckmann in chapter 18. As stressed by this author, building a speech corpus and exploiting it to answer phonetic and phonological research questions is a very time-consuming process. Many of the necessary steps in the corpus-building process and the analysis stage can be facilitated by scripting. Caren Brinckmann demonstrates how scripts can be employed to support orthographic transcription, phonetic and prosodic annotation, querying, analysis, and preparation for distribution.
The contribution by Yvan Rose and Brian MacWhinney (chapter 19) is centred on the PhonBank project. The authors provide a description of the tools available through the PhonBank initiative for corpus-based research on phonological development as well as for data sharing. PhonBank is one of ten subcomponents of a larger database of spoken language corpora called TalkBank. Other areas in TalkBank include AphasiaBank, BilingBank, CABank, CHILDES, ClassBank, DementiaBank, GestureBank, Tutoring, and TBIBank. All of the TalkBank corpora use the CHAT data transcription format, which enables a thorough analysis with the CLAN programs (Computerized Language ANalysis). The PhonBank corpus is unique in that it can be analysed both with the CLAN programs and also with an additional program, called Phon, which is designed specifically for phonological analysis. The authors provide an introduction to Phon and then widen the discussion to methodological issues relevant to software-assisted approaches to phonological development, and to phonology, more generally.
In chapter 20, Thomas Schmidt and Kai Wörner provide an overview of EXMARaLDA. This is a system for creating, managing, and analysing digital corpora of spoken language which has been developed at the University of Hamburg since (p. 5) 2000. From the outset, EXMARaLDA was planned to serve a variety of purposes and user communities. Today, the system is used, among other things, for corpus development in pragmatics and conversation analysis, in dialectology, in studies of multimodality, and for the curation of legacy corpora systems of its kind. This chapter foregrounds the use of EXMARaLDA for corpus phonology within the wider study of spoken interactions. As part of the overview, three corpora are presented—a phonological corpus, a discourse corpus and a dialect corpus—all constructed with the help of EXMARaLDA.
Chapter 21 by Michael Kipp is devoted to ANVIL, a highly generic video annotation research tool. Like ELAN (cf. chapter 15), ANVIL supports three activities which are central to contemporary research on language interaction: the systematic annotation of audiovisual media (coding), the management of the resulting data in a corpus, and various forms of statistical analysis. In addition, ANVIL also allows for audio, video, and 3D motion-capture data. The chapter provides an in-depth introduction to ANVIL’s underlying concepts which is especially important when comparing it to alternative tools, several of which are described in other chapters of this volume. It also tries to highlight some of the more advanced features (like track types, spatial coding, and manual generation) that can significantly increase the efficiency and robustness of the coding process.
In line with its title, ‘Corpora’, Part IV of this Handbook aims at presenting a number of leading corpora in the field of phonology. Even a book of this size cannot remotely hope to reference all the worthwhile projects currently available. Our aim has therefore been a more modest one: that of giving an overview of some well-known speech corpora exemplifying the methods and techniques discussed in earlier parts of the book and covering different countries, different languages, different linguistic levels (from the segmental to the prosodic), and different perspectives (e.g. dialectology, sociolinguistics, and first and second language acquisition).
In chapter 23, Francis Nolan and Brechtje Post provide an overview of the IViE Corpus of spoken English. IViE stands for ‘Intonational Variation in English’ and refers to a collection of audio recordings of young adult speakers of urban varieties of English in the British Isles made between 1997 and 2002. These recordings were devised to facilitate the systematic investigation of intonational variation in the British (p. 6) Isles, and have served as a model for similar studies in other parts of the world. This chapter sets out by describing the reasoning behind the choices made in designing the corpus, and surveys some of the research applications in which recordings from IViE have been used. Chapter 24 is devoted to an ongoing programme concerning spoken French which was set up in the late 1990s: the PFC Programme (Phonologie du Français Contemporain: usages, variétés et structure), which is by far one of the largest databases of spoken French of its kind. In their contribution, Jacques Durand, Bernard Laks, and Chantal Lyche attempt to show the advantages of a uniform type of data collection, transcription, and coding which has led to the construction of an interactive website integrating advanced search and analysis tools and allowing for the systematic comparison of varieties of French throughout the world. They also emphasize that, while the core of the programme has been phonological (and initially mainly segmental), the database permits applications ranging from speech recognition to syntax and discourse—a point made by many other contributors to this volume.
In chapter 25, Kristin Hagen and Hanne Gram Simonsen provide a description of two speech corpora hosted by the University of Oslo: NoTa-Oslo and TAUS (see also Chapter 11, where research based on these corpora are reported). Both corpora are based on recordings of spontaneous speech from Oslo residents, NoTa-Oslo speech recorded in 2005–2006 and TAUS speech recorded in 1972–1973. These two corpora permit a thorough synchronic and diachronic investigation of speech from Oslo and its immediate surroundings, which can be seen as representative of Urban East Norwegian speech. In both cases, the web search interface is relatively simple to use, and the transcriptions are linked to audio files (for both NoTa-Oslo and TAUS) and video files (for NoTa-Oslo). NoTa-Oslo and TAUS are both multi-purpose corpora, designed to support research in different fields, such as phonology, morphology, syntax, semantics, discourse, dialectology, sociolinguistics, lexicography, and language technology. This makes the corpora very useful for most purposes, but it also means that they cannot immediately meet the demands of every research task. The authors show how NoTa-Oslo and TAUS have been used for phonological research, but also discuss some of the limitations entailed by the types of interaction at the core of the corpus—a discussion most instructive for researchers wishing to embark on large-scale socio-phonological projects.
Chapter 26 by Ulrike Gut turns to another area where phonological corpora are proving indispensable—that of second language acquisition. The LeaP corpus was collected in Germany at the University of Bielefeld between 2001 and 2003 as part of the LeaP (Learning Prosody in a Foreign Language) project. The aim has been to investigate the acquisition of prosody by second language learners of German and English with a special focus on stress, intonation, and speech rhythm as well as the influencing factors on the acquisition process and outcome. The LeaP corpus comprises spoken language produced by 46 learners of English and 55 learners of German as well as recordings with 4 native speakers of English and 7 native speakers of German. This chapter is particularly useful in providing a detailed discussion of methods concerning the compilation of a corpus designed for studying the acquisition of prosody: selection of speakers, (p. 7) recordings, types of speech, transcription issues, annotation procedures, data formats, and assessment of annotator reliability.
In chapter 27 by Joan C. Beal, Karen P. Corrigan, Adam J. Mearns, and Hermann Moisl, the focus is on the Diachronic Electronic Corpus of Tyneside English (DECTE), and particularly annotation practices and dissemination strategies. The first stage in the development of the Diachronic Electronic Corpus of Tyneside English (DECTE) was the construction of the Newcastle Electronic Corpus of Tyneside English (NECTE) between 2000 and 2005. NECTE is what is called a legacy corpus based on data collected for two sociolinguistic surveys conducted on Tyneside, northeast England, in c.1969–1971 and 1994, respectively. The authors concentrate in particular on transcription issues relevant for addressing research questions in phonetics/phonology, and on the nature of and rationale for the text-encoding systems adopted in the corpus construction phase. They also offer some discussion of the dissemination strategy employed since completion of the first stage of the corpus in 2005. The exploitation of NECTE for phonetic/phonological analysis is described in Moisl’s chapter in Part I of this Handbook. Insofar as the researchers behind NECTE have been pioneers in the construction of a unique electronic corpus of vernacular English which was aligned, tagged for parts of speech, and fully compliant with international standards for encoding text, the continuing work on the subcorpora now included within DECTE is of interest to all projects having to deal with recordings and metadata stretching back in time.
Interestingly, the following chapter on LANCHART by Frans Gregersen, Marie Maegaard, and Nicolai Pharao (28) focuses on similar issues concerning Danish. The authors give an outline of the corpus work done at the LANCHART Centre of the Danish National Research Foundation. The Centre has performed re-recordings of a number of informants from earlier studies of Danish speech, thus making it possible to study variation and change in real time. The chapter deals with the methodological problems posed by such a diachronic perspective in terms of data collection, annotation, and interpretation. Gregersen, Maegaard, and Pharao then focus on three significant examples: the geographical pattern of the (əð) variable, the accommodation to a moving target constituted by the raising of (æ) to [ɛ], and finally the covariation of three phonetic variables and one grammatical variable (the generic pronoun) in a single interview.
Chapter 29, written by Marc van Oostendorp, is devoted to phonological and phonetic databases at the Meertens Institute in Amsterdam. This centre was founded in 1930 under the name ‘Dialect Bureau’ (Dialectenbureau) before being named in 1979 after its first director, P. J. Meertens. Originally, the institute had as its primary goal the documentation of the traditional dialects as well as folk culture of the Netherlands. In the course of time, this focus has broadened in several ways. From a linguistic standpoint, the Institute has widened its scope to topics other than the traditional dialects. Currently it comprises two departments, one of Dutch Ethnology and one of Variation Linguistics. Although the documentation of dialects has made significant progress, considerable effort has recently gone into digitizing material and putting it online. Van Oostendorp’s contribution seeks to describe the two most important databases on Dutch dialects which are available at the Meertens Institute: the Goeman–Taeldeman–Van (p. 8) Reenen Database and Soundbites. He concludes by presenting new research areas at the Meertens Institute and by pointing out some desiderata concerning them.
Chapter 30 is concerned with the VALIBEL speech database. Anne Catherine Simon, Philippe Hambye, and Michel Francard present the ‘speech bank’ which has been developed since 1989 at the Centre de recherche Valibel of the Catholic University of Louvain (Belgium). This speech database, which is one of the largest banks of spoken French in the world, is not a homogeneous corpus but rather a compilation of corpora, collected with a wide range of linguistic applications in mind and integrated into a system allowing for various kinds of investigation. The authors give a thorough description of the database, with special attention to the features that are relevant for research in phonology. Although the first aim of VALIBEL was not to build up a reference corpus of spoken French, but to collect data in order to provide a sociolinguistic description of the varieties of French spoken in Belgium, Simon, Hambye, and Francard show how the continuing gathering of data for various research projects has finally resulted in the creation of a large and controlled database, highly relevant for research in a number of fields, including phonetics and phonology.
In chapter 31, Janet Fletcher and Lesley Stirling focus on prosody and discourse in the Australian Map Task corpus. The Australian Map Task corpus is part of the Australian National Database of Spoken Language (ANDOSL), which was collected in the 1990s for use in general speech science and speech technology research in Australia. It is closely modelled on the HCRC Map Task, which was designed in the 1990s by a team of British researchers to elicit spoken interaction typical of everyday talks in a controlled laboratory environment. Versions of this task have been used successfully to develop or test models of intonation and prosody in a wide number of languages including several varieties of English (as illustrated by Nolan and Post in Chapter 23). The authors show how the Australian Map Task has proved to be a useful tool with which to examine different prosodic features of spoken interactive discourse. While the intonational system of Australian English shares many features with other varieties of English, tune usage and tune interpretation are argued to remain variety-specific, with the Map Task proving to be a rich source of information on this question. The studies summarized in this contribution also illustrate the flexibility of Map Task data in permitting correlations of both micro-level discourse units such as dialogue acts and larger discourse segments such as Common Ground Units, with intonational and prosodic features of Australian English. The chapter includes a detailed discussion of annotation and analytical techniques for the study of prosody, thus complementing the contribution of Nolan and Post at the beginning of this part of the Handbook.
The Handbook concludes with a chapter by Jane S. Tsay describing a phonological corpus of L1 acquisition of Taiwan Southern Min. In her contribution, Tsay outlines the data collection, transcription, and annotations for the Taiwanese Child Language Corpus (TAICORP), including a brief description of computer programs developed for specific phonological analyses. TAICORP is a corpus of spontaneous speech between young children growing up in Taiwanese-speaking families and their carers. The target language, Taiwanese, is a variety of Southern Min Chinese spoken in Taiwan. (p. 9) (Taiwanese and Southern Min are used interchangeably by the author.) Tsay shows how a well-designed phonological corpus such as TAICORP can be used to throw light on many issues beyond phonology such as the acquisition of syntax (syntactic categories, causatives, classifiers) and of pragmatic features. From a phonological point of view, much of the literature on child language acquisition has focused primarily on universal innate patterns (e.g. markedness constraints within Optimality Theory), but many contemporary studies have also argued that frequency factors are highly relevant and indeed more crucial than markedness. Only corpora such as TAICORP can allow investigators to test competing hypotheses in this area. As is argued in most chapters of this volume, the construction of corpora cannot be divorced from theory construction and evaluation.
The idea for this Handbook was born during an ESF-funded workshop on phonological corpora held in Amsterdam in 2006. We would like to thank all participants of this event and of the summer school on Corpus Phonology held at Augsburg University in 2008 for their discussions, comments, and commitment to this emerging discipline of corpus phonology. Our thanks also go to Eva Fischer and Paula Skosples, who assisted us in the editing process of this Handbook. We hope that it will be of interest to researchers from a wide range of linguistic fields including phonology, both synchronic and diachronic, phonetics, language variation, dialectology, first and second language acquisition, and sociolinguistics. (p. 10)