Prosodic Phenomena: Stress, Tone, and Intonation
Abstract and Keywords
Prosodic phenomena such as stress, tone, and intonation have been the focus of much developmental research as well as theoretical work in phonology. This review presents an overview of research that explores the relationship between the development of prosodic phenomena and linguistic models of phonological structure, particularly, metrical stress theory and autosegmental phonology. The review surveys what is currently known about the developmental course of stress, tone, and intonation in infants and children, introduces research that investigates the role of organizational principles of phonological structure in the acquisition of these prosodic phenomena, and discusses the evidence and arguments for this approach toward understanding phonological acquisition.
This chapter provides an overview of research on the development of prosodic phonology, or the phonological organization beyond segments, particularly at or above the level of the word. The focus will be on three prosodic phenomena that have received much attention in developmental linguistics: namely, stress, tone, and intonation. To avoid overlap in coverage, this chapter will not discuss how these prosodic features are related to other areas of learning, such as speech and word segmentation, phonological processes, or morpho-phonological acquisition. The reader is referred to the relevant portion of Chapters 3 by Zamuner and Kharmalov, 4 by Goad, and 7 by Tessier for these issues.
For each of these prosodic phenomena, the chapter first describes what is known about its course of development, including perceptual precursors in newborns and very young infants, and the subsequent emergence of general and language-specific properties. Next, it presents the outcomes of research that attempts to interpret the developmental observations within the frameworks of metrical stress theory and autosegmental phonology, models that have been central to theoretical research on prosodic phonology in the past few decades. This is followed by a critical assessment of the evidence and arguments for this approach to understanding the development of prosodic phenomena. The chapter concludes with suggestions for future directions.
(p. 69) 5.2 Acquisition of Stress
5.2.1 Development of Stress
For the purpose of this chapter, stress is understood as a lexically assigned property of a syllable that renders the syllable a potential position of prominence (Hayes 1995; Sluijter 1995; Ladd 2008). Working from this definition, stress is a structural notion that does not necessarily translate to actual phonetic prominence; the realization of the latter depends on various factors, such as whether the stressed syllable is in or out of the focus position of the utterance. Stress is also a separate matter from the presence and shape of particular pitch patterns (e.g. a high pitch), whose association with a stressed syllable is dictated by the intonational system of the language. Despite the lack of isomorphic phonetic characteristics of stress, an underlyingly stressed syllable can be differentiated from an unstressed one as being phonetically salient in some contexts, and it is the acoustic signal of such prominence that the learner must be using to learn the structure and function of stress.
There is much evidence that sensitivity to acoustic differences associated with stress is already present in very young infants. A study using the high-amplitude sucking paradigm shows that newborns can discriminate natural samples of Italian disyllables and trisyllables differing in stress position (e.g. /ˈmama/ versus /maˈma/, /ˈtacala/ versus /taˈcala/) (Sansavini et al. 1997).1 Similarly, English-exposed 1-month-olds can detect a change in synthesized disyllables with different stress patterns such as /ˈbada/ versus /baˈda/ (Spring and Dale 1977; Jusczyk and Thompson 1978). As the stressed syllables in these studies differed from the unstressed ones in either duration (Sansavini et al. 1997; Spring and Dale 1977) or a combination of duration, amplitude, and fundamental frequency (Jusczyk and Thompson 1978), the results indicate that neonates are able to differentiate syllables based at least on duration, which is one of the key phonetic correlates of stress in mature systems (Sluijter and van Heuven 1995; Gussenhoven 2004; Kochanski et al. 2005).
One of the first signs of language-specific stress development we see in infants is their recognition of the predominant stress pattern of the ambient language. In English, stress frequently falls on the initial syllable of a word (Cutler and Carter 1987), a tendency that is particularly strong in infant-directed speech, in which stress can coincide with the beginning of a word as much as 95 percent of the time (Kelly and Martin 1994). This characteristic of English stress is picked up by infants some time between 7 and 9 months. (p. 70) Experiments using the head-turn preference procedure show that 9-month-old infants exposed to English (but not those of 6- or 7-months old) prefer to listen to initially-stressed disyllables over finally-stressed disyllables (Jusczyk et al. 1993a; Echols et al. 1997). In German, another language that has an overall tendency toward initial stress, ERP experiments with the mismatch negativity paradigm have shown that a similar bias for initially-stressed disyllabes may emerge as early as 4 to 5 months of age (Weber et al. 2004; Friederici et al. 2007a). The interpretation that such preferences reflect language-specific input rather than a universal bias toward initial stress is reinforced by the lack of comparable effects in infants learning languages without an extremely skewed distribution of stress patterns, such as Spanish and Catalan (Pons and Bosch 2007).
Around the same time, infants’ behaviors also begin to reflect cross-linguistic differences in the lexical contrastiveness of stress. While infants respond differently to initially- versus finally-stressed words if they are exposed to a language that uses stress contrastively in lexical items (e.g. English, Spanish, German), they do not if they are learning a language in which stress is not lexically contrastive (e.g. French) (Höhle et al. 2009; Skoruppa et al. 2009, 2011; Pons and Bosch 2010). However, there is evidence that infants learning the latter type of language, such as French, do retain sensitivity to acoustic correlates of stress; thus their failure to respond to stress differences imposed on words suggests not a decline in auditory abilities but a functional reorganization due to the non-lexical nature of stress in the language (Skoruppa et al. 2009).
During the first year, infants also gradually become capable of learning stress patterns specific to individual lexical items. In English, the first indication of this process appears in 7-month-olds, who can detect a stress shift in a novel word form they have been familiarized to (e.g. doˈpita → ˈdopita) (Curtin et al. 2005). Evidence that infants can link such a stress difference to a referential contrast emerges several months later. In experiments using novel word forms and unfamiliar objects, 12-month-olds can learn distinct word–object pairings, even when the word forms differ only in the syllables that are stressed (e.g. ˈbedoka vs. beˈdoka, Curtin 2009, 2011), and 14-month-olds can learn novel pairings when the word forms differ in the position of the stressed syllable (e.g. ˈbedoka versus doˈbeka, Curtin 2010). Recognition of familiar words by English- or French-exposed 11-month-olds is slowed down when the stress is shifted to the wrong syllable (e.g. ˈbaby → baˈby), indicating that the representations of real words learned before 1 year of age already contain information associated with stress (Vihman et al. 2004).
While the evidence from perception and recognition experiments may suggest that much of stress acquisition is achieved during the first year, production data tell a different story. In spontaneous real-word production as well as non-word imitation, children older than 1 year produce many stress errors, which usually reflect the predominant patterns of the language (English: Kehoe 1997, 1998, 2001, Klein 1984; Dutch: Fikkert 1994, Lohuis-Weber and Zonneveld 1996; Spanish: Hochberg 1988a, 1988b). In English and (p. 71) Dutch, initial primary stress is often imposed on words that begin with a syllable with no stress or secondary stress, as illustrated by the examples in (1):
In slightly older children, the conflict between the predominant and more specific stress patterns can result in word productions with two locations of primary stress.
Such errors gradually disappear from children’s production during the third and fourth years. However, stress patterns involving morphologically complex words and higher levels of prosodic domains continue to undergo development in school-aged children. An example of morphologically conditioned stress patterns would be derivational affixes that induce a predictable stress shift, such as -ic and -ity, which place primary stress on the preceding syllable (e.g. ˈmetal → meˈtallic, ˈpersonal → persoˈnality). These stress shift patterns are not fully acquired by 7- to 9-year-olds (Jarmulowicz 2006). An example of a stress pattern that operates above the level of a simple word is the contrast between compound (a ˈhotˌdog) and phrasal (a ˌhot ˈdog) stress in English. The comprehension of this contrast does not approximate adult performance until children pass the age of 9 years (Atkinson-King 1973; Vogel and Raimy 2002). The later development of these aspects of stress is not surprising since its mastery requires not only a purely phonological understanding of stress but also an understanding of how stress interacts with morphology (e.g. affixation, compounding) and syntax (e.g. noun phrase structure).
5.2.2 Metrical Phonology and Stress Acquisition
From the review of the developmental process of stress in the previous section, it should be evident that infants and children do not simply learn stress on an item-by-item basis. They engage in some level of generalization, as indicated by the biases toward language-specific regular patterns in perception and production. This raises questions about the nature of the knowledge of stress in young learners as well as the learning mechanisms involved in the acquisition of stress. The infants’ developmental behaviors may emerge from the piecemeal learning of the distributional characteristics observed (p. 72) within numerous instances of individual stress patterns that the learners encounter. Alternatively, the behaviors may be a manifestation of some abstract structural principles that underlie any human-language stress system. A related issue is the extent to which the paths of learning are guided by a priori principles of stress organization. Successful convergence on the adult state may require certain constraints on the range of possible stress systems learners entertain, or it may be sufficiently accomplished through a process that integrates the input data without a predetermined learning space.
In addressing these questions, many researchers have examined the extent to which the development of stress involves the same structural organization of mature phonological systems proposed in metrical stress theory. A central tenet of metrical phonology is that stress reflects a hierarchical structure that governs the positions of prominence in phonological forms (Liberman and Prince 1977; Hayes 1995; Kager 2007). One common way of representing this underlying structure is through grids of beats that express temporal sequencing along the horizontal axis and prominence along the vertical axis, as in (3).
According to the representation in (3), syllables, potential bearers of stress, are grouped into feet, although for English nouns the final syllable is left out of the grouping (i.e. it is “extrametrical”). Feet are composed of a strong position, or its head (“x” at the foot level structure in (3)) and a weak position (“.” in (3)). In English, the head of the rightmost foot is the position of word-level prominence (i.e. primary stress). Feet in English are also “quantity sensitive.” That is, their formation takes the internal structure of the syllable into consideration. A syllable containing a coda or a long vowel (i.e. a “heavy” syllable, marked H in (3)) can form a foot on its own; otherwise (i.e. if it is “light,” marked L in (3)) it needs to be combined with another syllable to form a foot. Metrical representations capture many fundamental characteristics of stress systems in human language, such as the tendency for stress to occur on alternating syllables, and the cumulativeness by which one syllable in a word is singled out to carry the highest prominence.
Metrical representations also allow us to describe cross-linguistic variation systematically. Languages can vary in the type of feet they have in terms of head direction (trochaic or left-headed versus iambic or right-headed) and quantity sensitivity. Within quantity-sensitive languages, some treat closed syllables as heavy while others do not. Languages also differ in the direction in which syllables are parsed into feet (left-to-right versus right-to-left), what unit (e.g. syllable, consonant) or position (final or initial) is extrametrical (if any), whether the rightmost or leftmost foot is the most prominent, and whether feet that do not satisfy their size requirements are still admitted if no other options are available.
(p. 73) On this account, the task of a language learner is to find out the specific pattern of metrical setup adopted by the target language. In a parameter-setting approach (Dresher and Kaye 1990; Dresher 1999), the dimensions of cross-linguistic differences are binary parameters (e.g. parsing direction) that can be set to different values (e.g. right-to-left of left-to-right). The child is modeled as a learner who sets the value of each parameter based on the available cues in the input. In order for the learning to settle on the correct parameter settings, some parameters have been hypothesized to have a default value (the value that is retained in the absence of evidence to the contrary) and the order in which the parameters are set has been prescribed such that the parameters whose values crucially dictates the settings of other parameters are fixed first. In a constraint-based approach, such as Optimality Theory (Prince and Smolensky 2004), stress acquisition has been modeled as algorithmic reranking of violable constraints such as Parse (“a syllable must be footed”) and Align-Feet-Right (“each foot must be aligned with the end of a word”). An example of such a model is Robust Interpretive Parsing/Constraint Demotion (RIP/CD; Tesar 1998; Tesar and Smolensky 2000). In RIP/CD, the stress pattern assigned by the current constraint ranking is compared with the attested pattern. Whenever a mismatch is observed, the algorithm recursively modifies the metrical grammar by demoting certain constraints in the ranking until no such mismatches are detected.2 For a fuller discussion of this and other approaches to modeling phonological acquisition with constraints, the reader is referred to Chapter 30 by Jarosz in this volume.
5.2.3 Evidence for Metrical Organization in Stress Development
There are two important empirical questions regarding the proposed involvement of metrical phonology in the development of stress. First, to what extent do patterns in developmental data support the hypothesis that stress learning progresses on the basis of metrical structures? Second, does successful learning of stress necessarily require the structures prescribed by metrical theory and its allied learning models (i.e. parameter-setting, constraint reranking)? This section will address the first question.
In support of the empirical fit between developmental data and metrical phonology, several studies have shown that stages of stress development can be matched up with a systematic progression of metrical settings (Fikkert 1994; Demuth 1996; Kehoe and Stoel-Gammon 1997a; Kehoe 1998). For example, Fikkert (1994) bases her account of Dutch stress acquisition on the parameter-setting model of Dresher and Kaye (1990), and presents stage-wise analyses of production data in child Dutch. The stage during which weak–strong targets are produced as strong–weak forms (as illustrated in (1)) is explained as one in which the parameters are set to allow only one left-headed disyllabic (p. 74) foot. The next stage is characterized by level stress (as illustrated in (2)), which exemplifies a metrical stage that allows more than one foot (which is now a left-headed moraic foot), but lacks a setting for the main stress parameter that assigns primary stress.3 Crucially, the unset main stress parameter should result in phonetic forms with level stress, which are not attested in the input. The presence of such forms cannot be directly explained by appealing to convergence to a dominant stress pattern.
A related but slightly different type of argument uses the size and shape of early word production as a source of evidence for metrical organization of developmental patterns. One robust observation of children’s word production is that there is an initial stage where syllables are omitted from long target words in such a way that the resulting word production is limited to two syllables at most. Furthermore, the child form only admits the dominant prosodic pattern of one or two-syllable words in the language (e.g. a strong(–weak) pattern in English). This pattern has been attested in a number of languages including English (Allen and Hawkins 1978; Echols and Newport 1992; Schwartz and Goffman 1995; Salidis and Johnson 1997), Catalan (Prieto 2006), Dutch (Fikkert 1994; Wijnen et al. 1994), Hungarian (Fee 1995), K’iche Maya (Pye 1992), and Japanese (Ota 2003b). This stage of development follows from the general developmental hypothesis in Optimality Theory that all markedness constraints are ranked above faithfulness constraints in the initial state (Smolensky 1996b; Davidson et al. 2004). Such a ranking scheme predicts that the faithfulness constraint that prohibits deletion of input materials (Max) should be outranked by markedness constraints that require every syllable to belong to a foot (Parse-σ), feet to be either disyllabic or bimoraic (FtBin) and every foot to be aligned with a prosodic word (Align-Ft) (Pater 1997). As demonstrated by the tableau in (4), the result is that no structure larger than a prosodic word consisting of a single binary foot can be an optimal output, hence the disyllabic maximality effect.4 Crucially, this argument rests on the assumption that children’s words have metrical constituents consisting of binary feet that group syllables into a hierarchical structure.
(p. 75) Finally, findings that infants can generalize the underlying stress patterns of nonsense words beyond the surface forms of the familiarization stimuli may be interpreted as evidence that stress acquisition involves abstract principles of metrical structures. For instance, Gerken (2004) and Gerken and Bollt (2008) exposed 9-month-olds to 3- and 5-syllable nonsense words derived from two artificial languages, and then tested their response to tokens which have surface stress forms that are different from those of the familiarization items but still in accordance with the stress assignment grammar of each language. Both artificial languages followed two metrical principles: the Weight-to-Stress principle (i.e. heavy syllables are stressed) and avoidance of stress clash (i.e. no two consecutive stressed syllables are stressed), the latter of which took precedence over the former when their demands conflicted. They also assigned stress according to iterative parsing of syllables into disyllabic feet, albeit in two opposite directions. In Language 1, stress fell on alternating syllables beginning with the initial one but only when it did not interfere with the Weight-to-Stress principle or avoidance of stress clash. In Language 2, the alternating pattern started with the final syllable, again subject to the Weight-to-Stress principle and avoidance of stress clash. The crucial test items were the words in (5), where the syllables in capital letters were stressed.
The stress pattern in (5a) is consistent with the grammar of Language 1 but not with that of Language 2 (which would have generated “do TON re mi FA”). Conversely, (5b) is consistent with Language 2 but not with Language 1 (which would have generated “DO re mi TON fa”). The listening times for these test items differed depending on which language the infants were familiarized with. The effects were observed not only when the heavy stressed syllables had the same segmental composition in the training and the test items (e.g. TON; Gerken 2004), but also when they were different, as long as the training set contained more than two examples of heavy stressed syllables (e.g. BOM, KEER, SHUL; Gerken and Bollt 2008). These results suggest that 9-month-olds are able to generalize beyond the stress pattern in the individual nonsense words to new words and new stress patterns that reflect a metrical system.
However, it has also been argued that many of the observations cited in this section as evidence for metrical phonological development can also be simulated in neural networks (Gupta and Touretzky 1994; Shultz and Gerken 2005) or exemplar-based models (Daelemans et al. 1994; Eddington 2000) without any recourse to formal metrical mechanisms. Learning simulations carried out with these computational models resemble real developmental data in important ways. First, the distribution of errors produced by the simulator during the early stages of learning matches that in children’s word production. Eddington’s (2000) simulation of Spanish stress acquisition yielded the largest number of errors for antepenultimate-stressed target words, much fewer for final-stressed targets, and the fewest for penultimate targets, correctly predicting the order of error frequencies in Hochberg’s (1988a, 1988b) experiments with 3- and 4-year-old (p. 76) Spanish-speaking children. Second, these simulations often exhibit apparent patterns suggestive of metrical organization. In Daelemans et al. (1994), a simulator trained on Dutch words learned to favor stress assignment to heavy syllables, just as predicted by the Weight-to-Stress principle. In fact, the effect was more pronounced in a model using simple phonemic representations of syllables than in a model in which syllables were annotated for weight. Third, the simulations yield results that correspond with the markedness predictions of metrical theory. For example, Dresher and Kaye (1990) propose that the default (thus, unmarked) setting of foot parsing is iterative. The learner can reset the value of this parameter to the marked setting of non-iterative feet when absence of secondary stress is observed. Consistent with this prediction, the simulators in Gupta and Touretzky (1994) took longer to learn languages with non-iterative feet, even though no explicit bias against non-iterative systems was engineered into the models.
5.2.4 Metrical Theory and the Learnability of Stress
A second question related to the developmental evidence for the involvement of linguistic principles is whether the acquisition of stress necessarily requires a priori knowledge of metrical phonology in order for the learner to always arrive at the correct target system. Metrical theory limits the types of stress systems that are allowed in human language, and such restrictions on the learner’s hypothesis space may be necessary for successful learning to occur. One prediction that follows from this model is that stress patterns that are not licensed by principles of metrical theory should be unlearnable. This prediction has been tested by Gerken and Bollt (2008). Recall that in one experiment in this study, 9-month-olds were shown to generalize weight sensitivity when they were familiarized with words that favored placement of stress on closed (and hence “heavy” in metrical terms) syllables such as BOM, KEER, and SHUL. In another experiment, they familiarized 9-month-olds to a system in which stress was attracted to open syllables with a /t/ onset (TU, TO, TI). In this case, the infants did not generalize this pattern to novel items. By contrast, 7-month-olds who participated in the same experiment did learn the pattern. Given the standard assumption in metrical phonology (e.g. Hayes 1989) that syllable weight is only sensitive to rhyme structure (i.e. syllable minus the onset), this is a rather surprising result. The interpretation offered by Gerken and Bollt (2008) is that “constraints on generalization” are not inherent in the learners, but develop over time as infants become familiar with the input regularities in the ambient input, which in English includes the tendency for closed syllables to attract stress. There is an alternative interpretation of these outcomes, however. Recent research shows that, although rare, some languages do exhibit onset-based distinctions that can be equated with syllable weight (Gordon 2005). Interestingly, in these languages, low sonority onsets (such as /t/, as in the Gerken–Bollt language) tend to attract more stress than high sonority onsets (such as /n/). The onset-based language in Gerken and Bollt (2008), then, might after all be a possible pattern in natural language, and metrical phonology may have to be revised to incorporate this pattern. The results (p. 77) can, therefore, be reinterpreted as a demonstration of 7-month-olds’ readiness to learn this metrical option after limited exposure because they are still not as committed to the specific ambient pattern (i.e. the rhyme-only syllable weight system of English) as 9-month-olds are.
Another way to investigate what successful acquisition of stress is dependent on is through computer simulations. The amount and type of constraints or biases a simulation needs in order to correctly succeed to learn a possible stress pattern or fail to learn an unattested (hence presumed impossible) pattern can tell us what types of structure must be hard-wired into the learning model. For example, Hayes and Wilson (2008) demonstrate that a learning model using only structurally-adjacent (or local) information cannot succeed in acquiring non-local aspects of stress, such as the assignment of main stress to the rightmost stressed syllable, unless it is augmented by metrical representations of the kind illustrated in (3). The critical insight is that metrical representations make long-distance relationships, such as the positions of stressed syllables, local at some level of analysis. Also using model simulations, Pearl (2011) argues that even when learners adopt parametric metrical phonology, they cannot successfully converge on the target stress system through probabilistic learning unless there is some built-in bias in the learning process. In Pearl’s analysis, such a bias must guide the learner toward the input data that unambiguously lead to the correct system—a kind of data that turns out to be a small minority of the input data.
Here again, the caveat raised for the connection between developmental data and metrical theory in the previous section may apply. As pointed out by Gupta and Touretzky (1994), the existence of formal architectures such as metrical principles cannot be proven by demonstrating that they turn an otherwise unlearnable attested system into a learnable one because degrees of learnability can also be changed by models that do not impose formal constraints on the learning space. Furthermore, there is a possibility that the child’s initial target system is not the same as the adult stress grammar (Pearl 2011). If so, then any demonstration of learnability or lack thereof based on the description of the adult system may not hold true.
In recent years, the issue of learnability has also been readdressed within a programmatic approach that links language acquisition with typology. Instead of asking what needs to be hard-wired into learners for them to successfully converge on the target system based on finite samples of the language, this approach asks whether basic characteristics of language (such as the range of attested stress systems) can be explained as a product of children’s “analytic bias” (Moreton 2008). Biases in data interpretation will limit the types of generalizations that children make over a sample of stress patterns, and the outcome of that learning will result in stress systems with common properties that reflect those biases. In an attempt to explore this bidirectional relationship between learning and language universals, Heinz (2009) examined the typology of stress systems in Gordon (2002), and noted that, with a few possible exceptions, all systems are “neighborhood distinct.” Informally put, this means that if words are construed as a string of steady states (i.e. the beginning, the end, and any position between two syllables) and transitions (i.e. types of syllables, such as stressed and unstressed), no two steady states (p. 78) have exactly the same transition pattern before and after them (for a technical definition of neighborhood distinctness, refer to Heinz 2009). By postulating a learner who only learns neighborhood distinct systems, we can explain why children can arrive at least at a typologically possible stress system when there are many more logically possible systems that account for the examples that they may encounter. In turn, the restriction on the learning can explain why stress systems in human language share some basic characteristics.
5.3 Tone and Intonation
This section begins by discussing tone and intonation together, as the distinction between the two is a matter of linguistic function. In general terms, tone refers to the linguistic use of pitch in marking lexical items, and intonation refers to non-lexical use of pitch to indicate, for example, utterance level pragmatic distinctions (statement versus question) and phrase boundaries. All languages are known to have intonation, and some languages (about 60–70 percent of the world’s languages, according to Yip 2002) also have tonal marking of lexical items. Languages that are tonal can differ in how densely they specify the pitch configuration of lexical items, ranging from nearly every syllable (or mora), as in some so-called “lexical tone languages” such as Mandarin and Yoruba, to only one location per word, as in so-called “pitch accent languages” such as Japanese, Serbo-Croatian, and Swedish.
5.3.1 Sensitivity to Pitch as a Linguistic Phenomenon
The main phonetic feature of tone and intonation is pitch, a psychophysical correlate of fundamental frequency (F0). From birth, infants exhibit sensitivity to fundamental frequencies in nonlinguistic stimuli, and can discriminate pure tones that differ only in F0 (Wormith et al. 1975). But the perception of pitch cannot be straightforwardly equated with that of F0. On one hand, adult listeners perceive sounds that share the same F0 as having the same tonal height regardless of the composition of the harmonic overtones. On the other hand, if tonal complexes contain harmonics derived from a single F0, they are perceived as having the same pitch even when the F0 itself is missing from the signal. Research using operant conditioning has shown that both of these basic properties of pitch perception are present by 7 months (Clarkson and Clifton 1985; Montgomery and Clarkson 1997). By contrast, 4-month-olds are incapable of extracting pitch from the combination of overtones, indicating that the perception of pitch is still under development during the first several months (Bundy et al. 1982).
Sensitivity to pitch differences in early infancy has also been demonstrated with linguistic stimuli. Nazzi et al. (1998b) used the high-amplitude sucking procedure to show that newborns in France can discriminate two lists of disyllabic Japanese words differing (p. 79) in F0 contour (ascending versus descending). Experiments using synthesized speech stimuli show that by 1–2 months, infants can also discriminate contour differences within a syllable (Morse 1972; Kuhl and Miller 1982).
During the same period of development, there also appears to be a more general underlying neural change that differentiates responses to pitch differences that are linguistically relevant from those that are not. Using near-infrared spectroscopy with Japanese-learning infants, Sato et al. (2010) measured hemodynamic brain responses to falling versus rising pitch contours in pure tone and also in words. At 4 months, responses were bilateral for both the pure tone and word form stimuli, but at 10 months, responses were stronger in the left hemisphere only when the infants heard the contours embedded in words.
5.3.2 Development of Tone
Some perceptual reorganization with respect to linguistic pitch occurs between 6 and 9 months of age. In experiments using the head-turn preference paradigm or the stimulus alternating preference procedure, infants learning English, French, or Mandarin all respond to the rising versus low tone in Thai at 6 months (Mattock and Burnham 2006; Mattock et al. 2008). But at 9 months, only the Mandarin-learning infants demonstrate sensitivity to this tonal difference. Similarly, 6- to 8-month-old Yoruba-learning infants attend more to F0 differences among monosyllables than do their English-learning counterparts (Harrison 2000). These findings suggest that infants exposed to lexical tone languages maintain a higher degree of sensitivity to certain types of pitch patterns in comparison to infants learning a language without lexical tone. However, the effect does not appear to be simply caused by a general typological difference in prosodic systems, as English-learning infants continue to show fairly good discrimination of other tonal differences (e.g. Thai rising versus falling contours). It is more likely that the perceptual difference arises from the phonetic details of the pitch contours that have linguistic functions in the ambient language. For example, the Thai contrast between rising and low tone has some resemblance to the Mandarin tone contrast between rising and low dipping, but such contour difference may not play a major role in English intonation. Conversely, a difference similar to the Thai rising versus falling contrast signals the difference between the rising and falling intonation in English.
A number of studies have been carried out on the production of tones in languages with lexical tone including Mandarin (Li and Thompson 1977; Clumeck 1980; Hua and Dodd 2000; Wong et al. 2005), Cantonese (Tse 1978; So and Dodd 1995), Taiwanese (Tsay 2001), and Sesotho (Demuth 1993, 1995a). Many of these, particularly those studying Asian languages, use transcribed data or adult-listener judgment of spontaneous word production to establish an order of acquisition in the tonal inventory by comparing the (impressionistic) accuracy levels of different tones. Although there is some consistency of order within each language, no apparent universal order of acquisition can be identified among comparable tones (for example, the high level tone is acquired first (p. 80) in Mandarin, but last in Thai). One general observation that can be made, however, is that tones that are variable in their surface realizations tend to be acquired later than those that are not subject to alternations. Thus, as pointed out by Clumeck (1980), the confusion between the rising tone and low dipping tone in Mandarin-learning children recorded as late as 3½ years can be attributed to the variable realizations of an underlying low dipping tone, which can surface as a rising tone when it follows another dipping tone, or a low level tone if it is elsewhere in a non-final position. In a similar vein, the so-called subject marker tone in Sesotho that invariably differentiates first/second person (no tone) from third person (high tone) is produced fairly accurately by the age of 2 years, while the phonetically similar tonal contrast that underlies verbal roots, highly variable due to its interaction with various phonological processes, is not fully acquired until the age of 3 years (Demuth 1995a).
The development of prosodic phonology in languages with a lexical pitch accent system has been studied primarily using spontaneous speech data from Japanese (Hallé et al. 1991; Ota 2003a) and Swedish (Engstrand et al. 1991; Kadin and Engstrand 2005; Ota 2006; Peters and Strömqvist 1996). These languages feature complex pitch contours that are made up of lexical and intonational components. Studies by Kadin and Engstrand (2005) and Ota (2006) show that the speech production of Swedish children exhibit both components by 18 months, but the complex contours of Japanese children before the age of 18 months sometimes lack a critical intonational feature, that is, phrase-initial lowering (Ota 2003a). In languages with a lexical pitch accent system, therefore, there is a certain degree of independence between the development of lexical pitch features and intonational pitch features.
5.3.3 Development of Intonation
By 4 to 5 months, infants begin to show evidence for sensitivity to intonational units in speech. In head-turn preference experiments, 4.5-month-olds prefer to listen to passages with pauses inserted between clausal boundaries rather than those with the pauses inserted in other places (Juscyzk et al. 1995). Similar preference for phrasal boundaries appears around 9 months (e.g. Jusczyk et al. 1992). Further evidence that infants have precocious sensitivity to the global prosodic well-formedness of utterances comes from findings showing that infants as young as 2 months remember a list of words better when they are said with an intonational phrase rather than a prosodically disconnected sequence (Mandel et al. 1994; Mandel et al. 1996).
A crucial functional feature of intonation is that variations in pitch patterns signal non-lexical differences. In an extensive review of the literature on early spontaneous production, Snow and Balog (2002) conclude that there is no clear evidence that children acquire the form–meaning/context mapping of intonational pitch patterns before the onset of word production. Much of what might be perceived as “intonation” before word production is essentially paralinguistic; in other words, they are modulations of pitch, amplitude, and/or speech rate to indicate the emotional states of the speaker, (p. 81) rather than linguistic contrasts. Suggestions have also been made that the pitch contours on early words may be lexically bound (Halliday 1975; Crystal 1979; Galligan 1987); that is, pitch patterns are learned as if they are part of the lexical property of the word. However, even these studies report productive non-lexical use of pitch from around 17 to 18 months. Children’s understanding of the non-lexical nature of intonational phonology has also been demonstrated in experimental work using the novel-word/novel-object pairing paradigm. For example, 2½-year-old English-learning children treat novel word forms as different words when they have different vowels, but not when they have different pitch contours (rise–fall versus low–fall) (Quam and Swingley 2010).
The range of structures and meanings signaled by intonation is quite wide, and not surprisingly, the development of the different functions of intonation is not uniform. Functions that approximate adult-like performance before school age include the use of intonation to differentiate illocutionary acts such as statement versus question (Patel and Grigos 2006), or to mark information structure such as newness (MacWhinney and Bates 1978; Wonnacott and Watson 2008), topic (Chen 2011), and contrastive focus (Hornby and Hass 1970; Wells et al. 2004; Müller et al. 2006; Chen 2007). Müller et al. (2006), for example, show that German 4-year-olds produce focus elements (“Peter bakes a cake” contrasted with “Eva wants to bake cookies”) with a higher F0 than non-focused elements. Not all aspects of intonation for information structure are acquired at the same pace, however. For instance, 4- to 5-year-old Dutch-speaking children are capable of producing adult-like contours for topic-marking, but not for focus-marking in the sentence-final position (Chen 2011).
In general, the use of intonation to demarcate the phrasal structure of sentences seems to be acquired later than those functions mentioned here. Comprehension studies show that English-speaking 4- to 5-year-olds fail to reliably use prosodic cues to disambiguate structures such as [Tap][the frog with the flower] versus [Tap the frog][with the flower] (Snedeker and Trueswell 2004). Although 5- and 7-year-olds understand the syntactic difference between [[pink and green] and white] versus [pink and [green and white]] indicated by intonational phrasing (Beach et al. 1996), their ability to produce the same prosodic difference has yielded mixed results (Katz et al. 1996; Wells et al. 2004).
5.3.4 Evidence for Autosegmental Representation in Tone and Intonation Acquisition
There are two key properties in tone and intonation that may dictate the ways in which they are acquired. First, there is considerable evidence that the phonological elements behind tone and intonation are inherently independent of other phonological structures. Thus, lexical tones can be mobile (e.g. a particular tone can move from one segmental position to another), or stable (e.g. a particular tonal pattern can stay even when the associated segment or syllable is deleted), and participate in one-to-many or many-to-one association with other structures (e.g. a particular tone can be ‘spread’ over many (p. 82) segments or syllables) (Yip 2002). Second, because all tonal and intonational phonology has a single phonetic correlate (i.e. pitch), the mapping between the acoustic signal and the underlying phonology can be complex. A particular pitch configuration can not only be a marker of lexical distinction, phrase boundary or an utterance type, but may also be a composite of all of them.
These properties of tone and intonation have been successfully modeled in autosegmental phonology, which postulates discrete tonal elements lined up along a separate tier from the rest of phonological representation, both for lexical tone (Leben 1975; Liberman 1975; Goldsmith 1976) and intonation (Pierrehumbert 1980; Beckman and Pierrehumbert 1986; Ladd 2008). The independence of the tonal tier from the segmental tier can be illustrated by the following examples from Mende. The three words shown in (6) differ in the number of syllables as well as the surface contour of pitch.5 Nevertheless, autosegmental representations allow us to see that they have the same underlying tonal structure: H(igh)–L(ow).
As an example of autosegmental analysis of the interaction between different types of tonal and intonational phonology, (7) shows the two ‘word accents’ in Stockholm Swedish, which exhibit variable contour realizations depending on whether or not they appear in a focus position (including their citation forms). This complex pattern can be explained as a combination of lexical pitch accents (either H*L or HL*, where the asterisk indicates the tone that is associated with the stressed syllable), an intonational phrasal accent (H), and an utterance-final L% tone lined up on the same tonal tier. The lexical pitch accents are always present, but the phrasal accent occurs only when the word is in a focus position (Bruce 1977, 1987).
(p. 83) One can see how an understanding of pitch-related phonology in terms of atomic tonal units may assist the learning of tone and intonation. If quantitatively continuous F0 information is abstracted into strings of discrete units, it can constrain the possible phonological structures that can be postulated for the attested data.6 Complex contours that may reflect patterns associated with a range of diverse functions including lexical contrasts, phrasal boundaries, or discourse semantics can be decomposed into units of mapping between tonal sequences and their functions. General patterns behind alternations in pitch patterns can be learned as autosegmental processes (e.g. spreading of tonal association, avoidance of identical adjacent tones). In this way, autosegmental phonology provides a plausible model of the acquisition of tone and intonation.
Whether the development of these phenomena actually involves autosegmental mechanisms is, of course, an empirical question. The literature offers several types of supporting evidence. The first type of evidence for autosegmental structure comes from various observations pointing to the separation of pitch patterns from segmentals. If tone is acquired as an inherent feature of vowels, sensitivity for non-native tonal contrasts should also begin to attenuate around the same time. Infants’ sensitivity to non-native vowel (quality) contrasts typically declines before 6 months (Kuhl et al. 1992; Polka and Werker 1994). However, results in Mattock and Burnham (2006) and Mattock et al. (2008) indicate that the analogous perceptual reorganization of tone takes place later, some time between 6 and 9 months. Such findings suggest that the perceptual development of tones is independent of that of vowel quality.
Some non-adult-like pitch patterns found in early production also indicate the separation of tonal and intonational phonology from segmental structures (Demuth 1993, 1995a; Ota 2003a). One such example in Demuth’s analysis of early Sesotho involves the application of the Obligatory Contour Principle (OCP; the ban on adjacent identical tones on the tonal tier). In Sesotho, when two underlying high tones become adjacent, one of them becomes a low tone. In autosegmental terms, delinking of a high tone occurs to satisfy the OCP, and the toneless tone-bearing unit receives a default low tone. The example in (8a) is an utterance produced by a 2½-year-old Sesotho speaker, who omitted the subject marker from the target structure (which is given in (8b)) (Demuth 1993: 297).7 The omission would have made the high tone in /ná/ adjacent to the high tone in the first syllable of the verb stem (/bídíkìsà/). Instead, the stem-initial syllable was produced with a low tone (/bìdíkísà/), resulting in a structure that respects the OCP. Crucially, the tonal specification of a syllable in the adult model has changed in the child production, indicating the independence of tonal and segmental representations in the child’s phonological system. (p. 84)
Occasionally, children come up with an idiosyncratic system of phonology that they appear to have spontaneously created. Such original language games or word plays offer unique evidence for autosegmental representations of tonal structures. Yue-Hashimoto (1980), for instance, discusses the case of a Mandarin-speaking child who productively engaged in a word game from the age of 2 years. The word game involved the application of a fixed tonal pattern to real words, as shown in (9).8 All disyllabic words received a HL pattern, regardless of the original tones (9a). Monosyllabic words were reduplicated and also forced into the HL template (9b), unless the nucleus contained two vowels, in which case the vowels were split into separate syllables with an LH pattern (9c). Although data like this may not generalize to ‘normal’ language development, they serve as a striking demonstration of children’s ability to manipulate tonal elements separately from segments.
Another type of argument for autosegmental representations in early phonology relates to the idea that pitch movements in (adult) tonal and intonational phonology are best represented as sequences of discrete tonal elements rather than the shape and slopes of the contours. A corollary of this model is that pitch contours tend to have phonetically stable turning points that anchor the pitch movement. For example, the so-called ‘Accent II’ in citation forms of Swedish words (given in (7)) has been analyzed as having four underlying tones: H*LHL%. Bruce (1977, 1987) shows that the phonetic constants in such pitch configurations are the F0 of points that correspond to the L and H of the hypothesized underlying tonal structure (i.e. the black dots superimposed on the contour in (10)).
Using second-degree polynomials defined by high and low turning points, Ota (2006) examined the spontaneous speech of Swedish-speaking 16- to 18-month-olds, and found that more high–low–high turning point sequences can be identified in Accent (p. 85) II words than in other productions. Furthermore, the higher the F0 peak of the stressed syllable was, the larger the drop after the peak, demonstrating the relative stability of the low F0 point, a presumed phonetic realization of the L tone between the two H tones.
5.4 Summary and Future Directions
The purpose of this chapter was to review some key descriptive findings in the development of stress, tone, and intonation, and to discuss the extent to which the acquisition of these phenomena can be understood in light of the structural representations and formal organizational principles proposed within metrical theory and autosegmental theory.
There now exists a substantial body of descriptive work in this area that provides us with some understanding of how stress, tone, and intonation develop over time, from the pre-linguistic sensitivity shown by very young infants to late-acquired aspects of these prosodic properties. Further progress in this area is contingent on having better developmental data, both in terms of coverage and quality. There is a noticeable lack of work outside the usual stock of familiar languages (e.g. English, Dutch, German, French, Chinese, Japanese). Particularly problematic is the paucity of information on the acquisition of languages with an iambic system of stress or an African type tone system. Without access to developmental data from such systems, it is difficult to test the full range of typological predictions that follow from metrical approaches to prosodic acquisition.
The quality of developmental data also needs to improve. Much of the production data we currently have are based on transcriptions, which are not only difficult to verify, but also prone to adult-listener bias. This is a particularly serious concern for stress, which has a complex relationship with its phonetic correlates. As we do not fully understand how stress is acoustically signaled in early production, adult transcribers who listen for stress using their phonetic correlates may miscode the data. The issue of phonetic realization also applies to laboratory-based developmental research. For instance, most experimental studies do not fully control for the acoustic dimensions that vary between the stressed and unstressed syllables in the stimuli. As a result, we cannot be sure what cues are used by the participant infants or children to determine the presence or position of stress.9
Turning now to more theoretical issues, it still remains an open question whether the acquisition of prosody is best understood within metrical and autosegmental phonology, despite the many attempts to make the case. With respect to the development of stress, we have seen that the affinity between the predictions of metrical theory and developmental data does not in itself warrant a causal relationship between them, since (p. 86) at least some of the developmental observations are also consistent with models that do not assume the abstract structures featured in metrical phonology. On the other hand, it is clear that these frameworks provide us with extremely useful representational devices to describe and characterize the developing system and its difference from the adult system. It is an added advantage of the metrical approach that the hypothesized developing stress systems can be subsumed within a general architecture of phonology that has been proposed to account for mature systems. The issues of learnability and its connection to metrical structure spawn empirical questions that can direct our exploration of the potentially inherent mechanisms (constraints or biases) involved in the development of stress. Furthermore, they afford an impetus to examine how language acquisition is related to the typological properties of phonological systems attested in human languages.
Similarly, several types of arguments can be put forward to support the idea that tone and intonation are acquired as sequences of tonal elements linearly organized independently from other phonological structures, as proposed in autosegmental theory. Unlike research on stress development, however, these claims have not been systematically pitted against developmental models that do not rely on pre-wired structural units. This is probably due to the fact that our understanding of tone and intonation per se (and hence their development) has generally lagged behind that of stress. But this state of affairs is rapidly changing with recent developments in prosodic modeling. Most of the work in this area explores ways to best model pitch contours, using, in some cases, the same discrete and static tonal categories in autosegmental phonology (e.g. ToBI; Silverman et al. 1992), but in other cases, more articulatorily or acoustically motivated targets or parameters (e.g. PENTA, Xu and Wang 2001; INTSINT, Hirst and Di Cristo 1998; Tilt, Taylor 2000). Computationally informed work is also emerging in the area of tonal and intonational acquisition, addressing non-trivial questions such as how tonal categories can be learned from continuous speech signals, and what acoustic information, types of learning functions, and potential structure in the hypothesis space may be necessary for tonal and intonational learning to succeed (Gauthier et al. 2007, 2009; Yu 2011). Both types of research are likely to shed new light on the representational requirements and learning mechanisms involved in the development of tone and intonation.
(1) A superscript vertical line is placed before the location of primary stress and a subscript vertical line before the location of secondary stress, if any.
(2) Tesar (2004b) proposes an augmentation (Inconsistency Detection) to this procedure, which allows the learner to resolve ambiguities in the data (i.e. when the attested pattern is consistent with more than one interpretation) by comparing a range of attested forms.
(3) One problem with this approach to modeling prosodic acquisition is that the putative “stages” of development in reality are not discrete as assumed in the models, but rather overlap in time. This issue has been addressed in a more recent simulation of the same Dutch data using a probabilistic, gradual learning model of constraint reranking (i.e. Boersma’s (1998) Gradual Learning Algorithm) (Curtin and Zuraw 2002).
(5) A circumflex indicates a falling tone, an acute accent a level high tone, and a grave accent a level low tone.
(6) How learners can translate the continuous and time-varying acoustic signal of pitch into discrete phonological units is not a trivial question. But recent computational work promises a solution (Yu 2011).
(7) In Demuth’s transcription, only high tones are marked (by an acute accent), and all other vowels are assumed to have low tone. For the sake of clarity, low-tone vowels have been marked here with a grave accent. 1sPN = first-person singular pronoun; 1sSM = first-person singular subject marker; PRS = present tense.
(8) Tones in these examples are marked with the standard number notation used for Asian languages: 35 = ‘high-rise,’ 55 = ‘high level,’ 11 = ‘low level.’