Probabilistic Linguistics
Abstract and Keywords
Linguistic phenomena at all levels display properties of continua and show markedly gradient behavior. While categorical approaches have focused on the endpoints of distributions of linguistic phenomena, probabilistic linguistics focuses on the gradient middle ground. This chapter discusses how grammatical formalisms can be extended with a probabilistic interpretation, in particular within the data-oriented parsing framework. The chapter considers some well-known probabilistic models and show how these can deal with gradient phenomena in perception, production, acquisition and variation. Probabilistic linguistics proposes that frequencies are an inherent part of the human language system and that new expressions are constructed by generalizing over previously analyzed expressions. Relations with other models are discussed and the consequences of the probabilistic view for Universal Grammar.
Keywords: gradience, frequency, probability model, data-oriented parsing, constructions, grammaticality judgments, alternations, acquisition, productive units, universal grammar
28.1 Introduction
28.1.1 Categorical Versus Probabilistic Linguistics
Modern linguistic theory has evolved along the lines of the principle of categoricity: knowledge of language is characterized by a categorical system of grammar. Numbers play no role. This idealization has been fruitful for some time, but it underestimates human language capacities (Bresnan and Hay 2007). There is a growing realization that linguistic phenomena at all levels of representation, from phonological and morphological alternations to syntactic well-formedness judgments, display properties of continua and show markedly gradient behavior (see, for example, Wasow 2002; Gries 2005; Aarts 2007). While the study of gradience has a long history in the generative tradition (Chomsky 1955, 1961), recent approaches such as the Minimalist Program (Chomksy 1995) do not explicitly allow for gradience as part of the grammar (see Crocker and Keller 2006). Probabilistic linguistics, instead, aims to focus on this relatively unexplored gradient middle ground.
But why should we model gradience by probabilities rather than by ranked rules, fuzzy set theory, connectionism, or yet another approach? One of the strongest arguments in favor of using probabilities comes from the wealth of frequency effects that pervade gradience in language (Bybee and Hopper 2001; Ellis 2002; Jurafsky 2003). (p. 664) Frequent words and constructions are learned faster than infrequent ones (Goodman et al. 2008). Frequent combinations of phonemes, morphemes and structures are perceived as more grammatical, or well-formed, than infrequent combinations (Coleman and Pierrehumbert 1997; Manning 2003). We can best model these effects by making explicit the link between frequency and probability: probability theory not only provides tools to working with the frequency of events but also with the frequency of combinations of events. In computing the probability of a complex event, such as a syntactic structure, we may not observe the structure in a store of previous language data. Probability theory allows for computing the probability of a complex event by combining the probabilities of their subparts.
Probabilistic linguistics is not just about modeling gradient linguistic phenomena, it also makes a cognitive claim. Following Bod, Hay, and Jannedy (2003), Gahl and Garnsey (2004), Jaeger and Snider (2008), and others, probabilities are an inherent part of the human language system. Probabilistic linguistics proposes that the language processing system is set up in such a way that, whenever an instance of a linguistic structure is processed, it is seen as a piece of evidence that affects the structure’s probability distribution (Jaeger and Snider 2008).
While many linguists agree that there is a need to integrate probabilities into linguistics, the question is: Where? The answer in this chapter, as well as in other reviews, is: Everywhere. Probabilities are relevant at all levels of representation, from phonetics and syntax to semantics and discourse. Probabilities are operative in acquisition, perception, production, language change, language variation, language universals, and more. All evidence points to a probabilistic language faculty.
28.1.2 What Does it Mean to Enrich Linguistics with Statistics?
To dispel dogmatic slumbers it may be good to realize that the main business of probabilistic linguistics is not to collect frequencies of words, collocations, or transitional probabilities. There is still a misconception that probabilities can be recorded only over surface events (see Manning 2003 for a discussion). Instead, there is no barrier to calculating probabilities over hidden structure, such as phrase-structure trees, feature-structures, or predicate–argument structures. Probabilistic linguistics enriches linguistic theory with statistics by defining probabilities over complex linguistic entities, from phonological to semantic representations. Probabilistic linguistics does therefore not abandon all the progress made by linguistics thus far; on the contrary, it integrates this knowledge with a probabilistic perspective.
One of the earliest successes of a probabilistic enrichment of a grammatical formalism is the Probabilistic Context-Free Grammar or PCFG (Grenander 1967; Suppes 1970). A PCFG consists of the simplest possible juxtaposition of a context-free grammar and probability theory: each context-free rule is enriched with a probability of application, such that the probability of a successive application of rules resulting in derivation of a sentence is computed by the product of the probabilities of the rules involved. For a (p. 665) long time, PCFGs had a less than marginal status in linguistics, which was partly due to the focus on categorical approaches in generative linguistics but also to the lack of annotated linguistic corpora needed for learning rule probabilities. Only during the last fifteen years or so have PCFGs led to concrete progress in modeling gradient linguistic phenomena, such as garden path effects (Jurafsky 1996), ambiguity resolution (Klein and Manning 2003), acceptability judgments (Crocker and Keller 2006), and reading times (Levy 2008). This progress crucially depended on the availability of linguistically annotated data (see Abeillé 2003 for an overview).
Despite this success, the shortcomings of PCFGs are also well acknowledged: their productive units capture only local dependencies while most syntactic phenomena involve non-local dependencies (see Joshi 2004). Furthermore, PCFGs correspond to the class of context-free languages while natural languages are known to be beyond context-free (Huybregts 1984). Although PCFGs have been useful in accurately parsing Penn Treebank sentences (e.g., Charniak 1997; Collins 1999) their cognitive relevance is much disputed (e.g., Fong and Berwick 2008).
Yet, the approach of “stochasticizing” a grammatical formalism by enriching its grammatical units with probabilities has been applied in many other formalisms, such as Tree-Adjoining Grammar, Combinatory Categorial Grammar, Lexical-Functional Grammar, and Head-Driven Phrase-Structure Grammar (e.g., Riezler and Johnson 2001; Hockenmaier and Steedman 2002; Chiang 2003). However, these probabilistic enrichments implicitly assume that the units of grammar coincide with the units of production and comprehension. Proponents of Construction Grammar, Cognitive Grammar, and usage-based linguistics have long emphasized that larger and more complex units play a role in language production and perception, such as conventional phrases, constructions, and idiomatic expressions (e.g., Langacker 1987b; Kay and Fillmore 1999; Barlow and Kemmer 2000; Bybee 2006a). What is needed is to assign probabilities to larger units of production, to which we will come back in the following section.
Instead of enriching the units of a grammatical formalism with probabilities, it is also possible to focus on a specific gradient phenomenon, next single out the possible factors that determine that phenomenon, and finally combine these factors into a probability model such as logistic regression. Logistic regression models are functions of a set of factors that predict a binary outcome (Baayen 2006). These models have been increasingly employed to deal with gradience in language production, such as in genitive alternation, dative alternation, presence/absence of complementizer (see Roland et al. 2005; Bresnan et al. 2007; Jaeger and Snider 2008). A logistic regression model permits simultaneous evaluation of all the factors in a model and assesses the strength of each factor relative to others. For example, in modeling ditransitive alternation between New Zealand and American English (e.g., in choosing between “You can’t give cheques to people” vs. “You can’t give people cheques”), Bresnan and Hay (2007) come up with a number of linguistic factors that may influence this syntactic choice, ranging from syntactic complexity, animacy, discourse accessibility and pronominality to semantic class. They next feed these factors to a logistic regression model, which (p. 666) indicates that NZ English speakers are more sensitive to animacy. Bresnan et al. (2007) furthermore show that their statistical model can correctly predict 94% of the production choices of the dative sentences in the 3-million-word Switchboard collection.
The method of logistic regression is flexible enough that it can be used for modeling a wide variety of other gradient phenomena, from grammatical choices in children’s productions to syntactic persistence. However, logistic models require a set of predefined factors to begin with, rather than that they learn these factors from previous language experiences. Moreover, as with PCFGs, logistic models may have difficulties with global dependencies and larger units in language production. There is thus an important question whether these low-level models can be subsumed by a more general learning model.
Despite the differences of the statistical models discussed here, there is also a common view that emerges from these models and that may be summarized as follows: Knowledge of language is sensitive to distributions of previous language experiences. Whenever an expression is processed, it is seen as a piece of evidence that affects the probability distribution of language experiences. New expressions are constructed by probabilistically generalizing over previous expressions.
28.2 How Far Can Probabilistic Linguistics be Stretched?
An approach that takes the direct consequence of the view above is Data-Oriented Parsing or DOP (Bod 1992, 1998; Scha et al. 1999; Kaplan 1996; and others). This approach analyzes and produces new sentences by combining fragments from previously analyzed sentences stored in a “corpus”. Fragments can be of arbitrary size, ranging from simple context-free rules to entire trees, thereby allowing for both productivity and idiomaticity. The frequencies of occurrence of the fragments are used to compute the distribution of most probable analyses for a sentence (in perception), or the distribution of most probable sentences given a meaning to be conveyed (in production).
By allowing for all fragments, DOP subsumes other models as special cases, such as the aforementioned PCFGs (e.g., by limiting the fragments of trees to the smallest ones), as well as probabilistic lexicalized grammars (Charniak 1997) and probabilistic history-based grammars (Black et al. 1993). Carroll and Weir (2000) show that there is a subsumption lattice where PCFGs are at the bottom and DOP at the top. Moreover, DOP models can be developed for other linguistic representations, such as for HPSG’s feature structures (e.g., Neumann and Flickinger 2002), LFG’s functional structures (e.g., Arnold and Linardaki 2007), or TAG’s elementary trees (e.g., Hoogweg 2003). DOP thus proposes a general method for “stochasticizing” a grammatical formalism.
(p. 667) 28.2.1 An Illustration of a Generalized Probabilistic Model for Phrase-Structure Trees
What does such a general model, which takes all fragments from previous data and lets frequencies decide, look like? In this section, we will illustrate a DOP model for syntactic surface constituent trees, although we could just as well have illustrated it for phonological, morphological, or other kind of representations. Consider a corpus of only two sentences with their syntactic analyses given in Figure 28.1 (we leave out some categories to keep the example simple).
On the basis of this corpus, the (new) sentence She saw the dress with the telescope can for example be derived by combining two fragments from the corpus—which we shall call fragment trees or subtrees—as shown in Figure 28.2. Note that there is no explicit distinction between words and structure in the subtrees. The combination operation between subtrees will for our illustration be limited to label substitution (but see below for extensions). This operation, indicated as ?, identifies the leftmost nonterminal leaf node of the first subtree with the root node of the second subtree, i.e., the second subtree is substituted on the leftmost nonterminal leaf node of the first subtree provided that their categories match. (p. 668)
Thus in Figure 28.2, the sentence She saw the dress with the telescope is interpreted analogously to the corpus sentence She saw the dog with the telescope: both sentences receive the same phrase structure where the prepositional phrase with the telescope is attached to the VP saw the dress.
We can also derive an alternative phrase structure for the test sentence, namely by combining three (rather than two) subtrees from Figure 28.1, as shown in Figure 28.3. We will write (t ºu)ºv as t ºuºv with the convention that º is left-associative.
In Figure 28.3, the sentence She saw the dress with the telescope is analyzed in a different way where the PP with the telescope is attached to the NP the dress, corresponding to a different meaning from the tree in Figure 28.2. Thus the sentence is ambiguous in that it can be derived in (at least) two different ways, which is analogous either to the first tree or to the second tree in Figure 28.1.
Note that an unlimited number of sentences can be generated by combining subtrees from the corpus in Figure 28.1, such as She saw the dress on the rack with the telescope and She saw the dress with the dog on the rack with the telescope, etc. Thus we obtain unlimited productivity by finite means. Note also that most sentences generated by DOP are highly ambiguous: many different analyses can be assigned to each sentence due to a combinatorial explosion of different prepositional-phrase attachments. Yet, most of the analyses are not plausible. They do not correspond to the interpretations humans perceive. Probabilistic linguistics proposes that it is the role of the probability model to select the most probable structure(s) for a certain utterance.
28.2.2 How to Enrich a Grammatical Formalism with Probabilities
How can we enrich the DOP model above with probabilities? By having defined a method for combining subtrees from a corpus of previous trees into new trees, we effectively established a way to view a corpus as a tree generation process. This process becomes a statistical process if we take the frequency distributions of the subtrees into account. For every tree and every sentence we can compute the probability that it is (p. 669) generated by this statistical process. Before we go into the details of this computation, let us illustrate the generation process by means of an even simpler corpus. Suppose that our example corpus consists of the two phrase-structure trees in Figure 28.4.
To compute the frequencies of the subtrees in this corpus, we need to define the (multi)set of subtrees that can be extracted from the corpus trees, which is given in Figure 28.5. Some subtrees occur twice in Figure 28.5:a subtree may be extracted from different trees and even several times from a single tree if the same node configuration appears at different positions. (Note that, except for the frontier nodes, each node in a subtree has the same daughter nodes as the corresponding node in the tree from which the subtree is extracted.)
As explained above, by using the substitution operation, new sentence-analyses can be constructed by means of this subtree collection. For instance, an analysis for the sentence Mary likes Susan can be generated by combining the three subtrees in Figure 28.6 from the set in Figure 28.5.
For the following it is important to distinguish between a derivation and an analysis of a sentence. By a derivation of a sentence we mean a sequence of subtrees the first of which is labeled with S and for which the iterative application of the substitution operation produces the particular sentence. By an analysis of a sentence we mean the resulting parse tree of a derivation of the sentence. Then the probability of the derivation in Figure 28.6 is the joint probability of three statistical events:
(1) selecting the subtree _{s}[NP_{vp} [_{v} [likes]NP]] among the subtrees with root label S
(2) selecting the subtree _{np}[Mary] among the subtrees with root label NP
(3) selecting the subtree _{np}[Susan] among the subtrees with root label NP.
The probability of each event can be computed from the frequencies of the occurrences of the subtrees in the corpus. For instance, the probability of event (1) is computed by dividing the number of occurrences of the subtree _{s}[NP_{vp} [V [likes] NP]] by the total number of occurrences of subtrees with root label S: 1/20. (p. 670)
(p. 671) In general, let |t | be the number of times subtree t occurs in the bag and r(t) be the root node category of t, then the probability assigned to t is
Since in our statistical generation process each subtree selection is independent of the previous selections, the probability of a derivation is the product of the probabilities of the subtrees it involves. Thus, the probability of the derivation in Figure 28.6 is: 1/20 × 1/4 × 1/4 = 1/320. In general, the probability of a derivation t_{1}◦ ... ◦t_{n} is given by
It should be stressed that the probability of an analysis or parse tree is not equal to the probability of a derivation producing it. There can be many different derivations resulting in the same parse tree. This “spurious ambiguity” may seem redundant from a linguistic point of view (and should not be confused with the “structural” ambiguity of a sentence). But from a statistical point of view, all derivations resulting in a certain parse tree contribute to the probability of that tree, such that no subtree that could possibly be of statistical interest is ignored.
For instance, the parse tree for Mary likes Susan derived in Figure 28.6 may also be derived as in Figure 28.7 or Figure 28.8. Thus, a parse tree can be generated by a large number of different derivations that involve different subtrees from the corpus. Each of these derivations has its own probability of being generated. For example, Table 28.1 shows the probabilities of the three example derivations given above.
The probability of a parse tree is the probability that it is produced by any of its derivations, also called the disjoint probability. That is, the probability of a parse tree T is the sum of the probabilities of its distinct derivations D:
Table 28.1 Probabilities of the derivations in Figures 25.6, 25.7, and 25.8
P(Fig. 25.6) |
= 1/20 × 1/4 × 1/4 |
= 1/320 |
P(Fig. 25.7) |
= 1/20 × 1/4 × 1/2 |
= 1/160 |
P(Fig. 25.8) |
= 2/20 × 1/4 × 1/8 × 1/4 |
= 1/1280 |
Analogous to the probability of a parse tree, the probability of an utterance is the probability that it is yielded by any of its parse trees. This means that the probability of a word string W is the sum of the probabilities of its distinct parse trees T :
For the task of language comprehension, we are often interested in finding the most probable parse tree given an utterance—or its most probable meaning if we use a corpus in which the trees are enriched with logical forms—and for the task of language production we are usually interested in the most probable utterance given a certain meaning or logical form. The probability of a parse tree T given that it yields a word string W is computed by dividing the probability of T by the sum of the probabilities of all parses that yield W (i.e., the probability of W): (p. 673)
Since the sentence Mary likes Susan is unambiguous with respect to the corpus, the conditional probability of its parse tree is simply 1, by a vacuous application of the formula above. Of course a larger corpus might contain subtrees by which many different representations can be derived for a single sentence, and in that case the above formula for the conditional probability would provide a probabilistic ordering for them. For instance, suppose an example corpus contains the following trees given in Figure 28.9.
Two different parse trees can then be derived for the sentence John hates buzzing bees, given in Figure 28.10.
The DOP model will assign a lower probability to the tree 28.10 (a) since the sub-analysis 28.11 (a) of 28.10 (a) is not a corpus subtree and hence must be assembled from several smaller pieces (leading to a lower probability than when the sub-analysis was a corpus-subtree, since the probabilities of the pieces must be multiplied—remember that probabilities are numbers between 0 and 1). The sub-analysis 28.11 (b) of 28.10 (b) can also be assembled from smaller pieces, but it also appears as a corpus fragment. This means that 28.10 (b) has several more derivations than 28.10 (a), resulting in a higher total probability (as the probability of a tree is the sum of the probabilities of its derivations).
In general, there tends to be a preference in DOP for the parse tree that can be generated by the largest number of derivations. Since a parse tree which can (also) be generated by relatively large fragments has more derivations than a parse tree which can only be generated by relatively small fragments, there is also a preference for the parse tree that can be constructed out of the largest possible corpus fragments, and thus for the parse tree which is most similar to previously seen utterance-analyses (and note that the parse tree with the largest corpus subtrees also corresponds to the shortest derivation consisting of the fewest subtrees). The same kind of reasoning can be made (p. 674) for the probability of an utterance, i.e., there is a preference for the utterance (given a certain meaning or intention) that can be constructed out of the largest possible corpus fragments, thus being most similar to previously seen utterances. This is particularly important to explain the use of constructions and prefabricated word combinations by natural language users (as we will discuss in section 28.3).
The notion of probability may be viewed as a measure for the average similarity between a sentence and the exemplars in the corpus: it correlates with the number of corpus trees that share fragments with the sentence, and also with the size of these shared fragments. DOP is thus congenial to analogical approaches to language that also interpret new input analogous to previous linguistic data, such as Skousen (1989) and Daelemans and van den Bosch (2005).
(p. 675) The probability model explained above is one of the simplest probability models for DOP, better known as “DOP1” (Bod 1998). DOP_{1} is “sound” in that the total probability mass of the sentences generated by the model is equal to one (Chi and Geman 1998; Bod 2009). However, DOP_{1} has an inconsistent estimator (Johnson 2002): it can be shown that the most probable trees do not converge to the correct trees when the corpus grows to infinity. More advanced DOP models do have a consistent estimator such as Zollmann and Sima’an (2005) or Bod (2006b). Yet, these models still use DOP_{1} as a backbone; for example, the DOP model in Bod (2006b) starts with the subtree frequencies as in DOP_{1} that are next iteratively trained on a set of annotated sentences by the Expectation-Maximization algorithm (Dempster et al. 1977). It is important to stress that the definitions for computing the probability of a derivation, a parse tree, and a sentence are independent of the way the subtree probabilities are derived, and remain the same for different linguistic formalisms (see Bod, Scha and Sima’an 2003 for more details).
28.2.3 DOP Models for Richer Grammatical Formalisms
There is a common misconception that probabilities only deal with frequencies of events. On the contrary, probability models can incorporate many other factors, such as recency, meaning, and discourse context (Bod 1999), and, in Bayesian terms, probabilities can represent degrees of belief (e.g., Tenenbaum et al. 2006). Probability models have long been used in sociolinguistics (Labov 1966) and language change (see Bod, Hay, and Jannedy 2003), and they can also be defined over other grammatical frameworks, from Optimality Theory (Boersma and Hayes 2001) to Principles and Parameters theory (Yang 2004).
In this subsection, we will give a very short summary of a DOP model for a richer formalism just to show how such models can be developed in principle. In Bod and Kaplan (1998) we proposed a DOP model for the linguistically sophisticated representations used in LFG theory (Kaplan and Bresnan 1982). LFG representations consist of constituent structures and functional structures in correspondence. While constituent structures are labeled with simplex syntactic categories, functional structures also contain grammatical categories for subject, predicate and object, as well as agreement features and semantic forms, like predicate–argument structures. Figure 28.12 gives an example of a very simple corpus containing two LFG-representations for the sentences Kim eats and John fell, each of which consists of a constituent structure (a tree), a functional structure (an attribute-value matrix), and a mapping between the two (a correspondence function, represented by arrows).
As before, we take all fragments from these representations as possible productive units, and let statistics decide—but without using the derivational mechanism and rule-system provided by LFG theory. In this so-called “LFG-DOP” model, fragments are connected subtrees whose nodes are in correspondence with sub-units of f-structures (Bod and Kaplan 1998). For example, the following combination of (p. 676) fragments from the corpus in Figure 28.12 represents a derivation for the new sentence Kim fell (Figure 28.13).
The probability model for LFG-DOP can be developed along the same lines as in section 28.2.2, by assigning relative frequencies to the fragments and using the same definitions for the probability of a derivation and an analysis (see Hearne and (p. 677) Sima’an 2003 for more sophisticated fragment-estimation methods). Bod and Kaplan (1998) show how an interestingly different notion of “grammaticality with respect to a corpus” arises from LFG-DOP, resulting in a model which is both robust, in that it can parse and rank ungrammatical input, and which offers a formal account of meta-linguistic judgments such as grammaticality at the same time. Way (1999) and Arnold and Linardaki (2007) provide linguistic evaluations of LFG-DOP, while Finn et al. (2006) propose a computationally efficient approximation of LFG-DOP.
28.3 What Can Data-Oriented Parsing Explain?
28.3.1 Constructions and Prefabs
DOP distinguishes itself from other probabilistic enrichments by taking into account constructions and units of arbitrary size. This allows the DOP approach to capture pre-fabs wherever they occur in the corpus (Manning and Schütze 1999: 446). For example, suppose that we want to produce a sentence corresponding to a meaning of asking someone’s age (which in LFG-DOP is represented by the PRED value in the functional structure—see Figure 28.12). There may be several sentences with such a meaning, like How old are you?, What age do you have?, or even How many years do you have? Yet the first sentence is more acceptable than the other ones in that it corresponds to the conventional way of asking someone’s age in English. This difference in acceptability is reflected by the different probabilities of these sentences in a representative corpus of English. While the probability of, for example, What age do you have? is likely to be small, since it will most likely not appear as a prefabricated unit in the corpus and has to be constructed out of smaller parts, the probability of How old are you? is likely to be high since it can also be constructed by one large unit. As we showed at the end of section 28.2.2, DOP’s probability model prefers sentences that can be constructed out of the largest possible parts from the corpus. (And even in the case that both sentences should occur in a representative corpus of English, How old are you? would have the highest frequency.) Thus DOP prefers sentences and sentence-analyses that consist as much as possible of prefabs rather than “open choices”.
28.3.2 Grammaticality Judgments
In Bod (2001), DOP was tested against English native speakers who had to decide as quickly as possible whether three-word (subject–verb–object) sentences were grammatical. The test sentences were selected from the British National Corpus (BNC) and consisted of both frequent sentences such as I like it and low-frequency sentences such (p. 678) as I keep it, as well as sentences that were artificially constructed by substituting a word by another roughly equally frequent word, such as I sleep it and I die it, of which the grammaticality is dubious. Also a number of ungrammatical pseudo-sentences were added. It turned out that frequent sentences are recognized more easily and quickly than infrequent sentences, even after controlling for plausibility, word frequency, word complexity and syntactic structure. Next, an implementation of DOP was used to parse the test sentences. Each fragment f was assigned a response latency by its frequency freq(f) in the BNC as 1/(1 + log freq(f))—see Baayen et al. (1997). The latency of the total sentence was estimated as the sum of the latencies of the fragments. The resulting model matched very well with the experimentally obtained reaction times (up to a constant) but only if all fragments were taken into account. The match significantly deteriorated if two-word and three-word chunks were deleted.
28.3.3 Disambiguation, Interpretation, and Translation
The experiments with grammaticality judgments described in section 25.3.2 trigger the hypothesis that “the accuracy of the model increases with increasing fragment size”— at least for grammaticality judgments. The hypothesis has now been corroborated also for modeling syntactic ambiguity (Bod 2001; Collins and Duffy 2001), translations from one language into another (Hearne and Way 2003), and the accuracy of semantic interpretation (Bod and Kaplan 2003). The hypothesis that the inclusion of larger productive units leads to better models has also been supported for languages other than English, that is, Dutch and French (Bod 1998; Cormons 1999), Hebrew (Sima’an et al. 2001), and Mandarin (Hearne and Way 2004). Furthermore, the hypothesis seems to be independent of the linguistic formalism: it was shown to be valid for LFG, HPSG, and TAG (see Bod, Scha and Sima’an 2003).
28.3.4 Syntactic Priming and Alternations
A possible challenge to DOP, and probabilistic linguistics in general, may seem to be the phenomenon of syntactic priming where it is the low-frequency rather than the high-frequency constructions that, when observed, have the highest chance of being primed. However, it should be kept in mind that the greatest change in a probability distribution is caused not by observing a high-frequent structure but by a low-frequent structure. Jaeger and Snider (2008) show that low-frequency constructions prime more as they result in a bigger change in the probability distribution, which in turn leads to an increased probability of reusing the same structure. Moreover, Snider (2008) develops a DOP model that integrates structural and lexical priming in language production. His model, coined DOP-LAST, is an extension of DOP_{1} with Exemplar Theory that can deal both with dative alternations and complex voice (active/passive) alternations.
(p. 679) 28.3.5 Predicting the Productive Units
Although DOP starts from the assumption that any fragment can constitute a productive unit (and that large fragments are important), it can also make explicit predictions about the productive units that are actually used by humans in producing new sentences. Zuidema (2006) develops a DOP model that starts out with all subtrees, but that aims at finding the smallest set of productive units that explain the occurrences and co-occurences in a corpus. Large subtrees only receive non-zero weights if they occur more frequently than can be expected on the basis of the weights of smaller subtrees. In this way, Zuidema is able to make predictions about multi-word units and constructions used in adult language such as “I’d like to X”, “from X to Y”, “What’s X doing?”, etc. Borensztajn et al. (2008) test Zuidema’s DOP model on child-produced utterances from the CHILDES database (MacWhinney 2000), where they split each corpus (for Eve, Sarah, and Adam) into three consecutive periods. It is found that the most likely productive units predicted by DOP closely correspond to the constructions found in empirical child-language studies by Tomasello (2003) and Lieven et al. (2003). In particular, Borensztajn et al. (2008) show that the DOP-derived productive units get more abstract with age (i.e., the number of open slots in the units increases across different periods). This corresponds to the empirical observation that children move from very concrete, item-based constructions (“holophrases”) to more abstract contructions with open positions. We will come back to DOP and language acquisition in the next section.
It often happens that the productive units predicted by DOP look counter-intuitive, such as a subtree that is lexicalized only with the subject-noun and the determiner of the object with all other lexical elements as open slots. Yet it turns out that there are constructions where the subject has scope on the object’s determiner, for instance in She sneezed her way to the allergist where the subject and the possessive determiner must be coreferential (*She sneezed his way to the allergist) (Goldberg, p.c.).
28.4 How Can Probabilistic Linguistics Deal with Language Acquisition?
Probabilistic linguistics, as discussed so far, does not say anything about how the first structures are learned. It deals with statistical enrichments of linguistic formalisms on the basis of a corpus of already given structures. There is thus an important question of how we can extend our probabilistic approach to the problem of language acquisition.
Previous probabilistic learning models have been mostly based on a principle attributed to Harris (1954): word sequences surrounded by equal or similar contexts are likely to form the same constituent (e.g., van Zaanen 2000; Clark 2001; Klein and Manning 2005). While this idea has been fruitful, it has mostly been limited to (p. 680) contiguous contexts. For example, the Constituent-Context Model (CCM) by Klein and Manning (2005) is said to take into account “all contiguous subsequences of a sentence” in learning constituents (Klein and Manning 2005: 1410). But this means that CCM neglects dependencies that are non-contiguous, such as between closest and to in “Show me the closest station to Union Square”. Such non-contiguous dependencies are ubiquitous in natural language, ranging from particle verbs, agreement to auxiliary inversion.
There is a growing realization that non-linear, structural contexts must be included into a model of language learning (e.g., Culicover and Novak 2003; Dennis 2005; Solan et al. 2005; Seginer 2007). Below we will discuss how such contexts can be integrated in a general DOP framework for language learning.
28.4.1 A DOP Model for Language Acquisition: U-DOP
We can extend DOP to language learning in a rather straightforward way, which is known as “Unsupervised DOP” or “U-DOP” (Bod 2006b, 2007a). If a language learner does not know which phrase-structure tree should be assigned to a sentence, he or she initially allows for all possible trees and lets linguistic experience decide which is the most likely one. As a first approximation we will limit the set of all possible trees to unlabeled binary trees. However, we can easily relax the binary restriction, and we will briefly come back to learning category labels in the next section. Conceptually, we can distinguish three learning phases under U-DOP:
(i) Assign all possible (unlabeled binary) trees to a set of given sentences
(ii) Divide the binary trees into all subtrees
(iii) Compute the most probable tree for each sentence.
The only prior knowledge assumed by U-DOP is the notion of tree and the concept of most probable tree. U-DOP thus inherits the rather agnostic approach of DOP. We do not constrain the units of learning beforehand but take all possible fragments and let the most probable tree decide.^{1} We will discuss below how such an approach generalizes over other learning models, but we will first explain U-DOP in some detail by describing each of the learning phases above separately.
(i) Assign all unlabeled binary trees to a set of sentences
Suppose that a hypothetical language learner hears the two sentences watch the dog and the dog barks. How could the learner figure out the appropriate tree structures for these sentences? U-DOP conjectures that a learner does so by allowing (initially) any (p. 681) fragment of the heard sentences to form a productive unit and to try to reconstruct these sentences out of most probable combinations.
The set of all unlabeled binary trees for the sentences watch the dog and the dog barks is given in Figure 28.14, which for convenience we shall again refer to as the “corpus”. Each node in each tree in the corpus is assigned the same category label X, since we do not (yet) know what label each phrase will receive. To keep our example simple, we do not assign labels to the words, but this can be done as well.
Although the number of possible binary trees for a sentence grows exponentially with sentence length, these binary trees can be efficiently represented in quadratic space by means of a “chart” or “shared parse forest”, which is a standard technique in computational linguistics (see, for example, Kay 1980; Manning and Schütze 1999). However, for explaining the conceptual working of U-DOP, we will exhaustively enumerate all trees, keeping in mind that the trees are usually stored by a compact parse forest.
(ii) Divide the binary trees into all subtrees
Figure 28.15 lists the subtrees that can be extracted from the trees in Figure 28.14. The first subtree in each row represents the whole sentence as a chunk, while the second and the third are “proper” subtrees.
Note that while most subtrees occur once, the subtree [the dog ]x occurs twice. The number of subtrees in a binary tree grows exponentially with sentence length, but there exists an efficient parsing algorithm that parses a sentence by means of all subtrees from a set of given trees. This algorithm converts a set of subtrees into a compact reduction which is linear in the number of tree nodes (Goodman 2003).
(iii) Compute the most probable tree for each sentence
From the subtrees in Figure 28.15, U-DOP can compute the most probable tree for the corpus sentences as well as for new sentences. Consider the corpus sentence the dog barks. On the basis of the subtrees in Figure 28.15, two phrase-structure trees can be generated by U-DOP for this sentence, shown in Figure 28.16. Both tree structures can be produced by two different derivations, either by trivially selecting the largest (p. 682) possible subtrees from Figure 28.15 that span the whole sentence or by combining two smaller subtrees.
Thus the sentence the dog barks can be trivially parsed by any of its fully spanning trees, which is a direct consequence of U-DOP’s property that subtrees of any size may play a role in language learning. This situation does not usually occur when structures for new sentences are learned.
U-DOP computes the most probable tree in the same way as the supervised version of DOP explained above. Since the subtree [the dog ] is the only subtree that occurs more than once, we can informally predict that the most probable tree corresponds to the structure [[the dog ] barks] where the dog is a constituent. This can also be shown formally by applying the probability definitions given in section 28.2. Thus the probability of the tree structure [the [dog barks]] is equal to the sum of the probabilities of its derivations in Figure 28.16. The probability of the first derivation consisting of the fully spanning tree is simply equal to the probability of selecting this tree from the space of all subtrees in Figure 28.15, which is 1/12. The probability of the second derivation of [the [dog barks]] in Figure 28.16 is equal to the product of the probabilities of selecting the two subtrees which is 1/12 × 1/12 = 1/144. The total probability of the tree is the probability that it is generated by any of its derivations which is the sum of the probabilities of the derivations:
Similarly, we can compute the probability of the alternative tree structure [[the dog ] barks], which follows from its derivations in Figure 28.16. Note that the only difference is the probability of the subtree [the dog ] being 2/12 (as it occurs twice). The total probability of this tree structure is:
Thus the second tree wins, although by just a little bit. We leave the computation of the conditional probabilities of each tree given the sentence the dog barks to the reader (these are computed as the probability of each tree divided by the sum of probabilities of all trees for the dog barks). The relative difference in probability is small because the derivation consisting of the entire tree takes a considerable part of the probability mass (1/12).
For the sake of simplicity, we only used trees without lexical categories in our illustration of U-DOP. But we can straightforwardly assign abstract labels X to the words as well. If we do so for the sentences in Figure 28.14, then one of the possible subtrees for the sentence watch the dog is given in Figure 28.17. This subtree has a discontiguous yield watch X dog, which we will therefore refer to as a discontiguous subtree.
Discontiguous subtrees are important for covering a range of linguistic constructions, as those given in italics in sentences (1)–(5): (p. 684)
(1) BA carried more people than cargo in 2005.
(2) What’s this scratch doing on the table?
(3) Don’t take him by surprise. (4) Fraser put dollie nighty on.
(5) Most software companies in Vietnam are small-sized.
These constructions have been discussed at various places in the literature (e.g., Bod 1998; Goldberg 2006), and all of them are discontiguous. They range from idiomatic, multi-word units (e.g., (1)–(3)) and particle verbs (e.g., (4)) to regular syntactic agreement phenomena as in (5). The notion of subtree can easily capture the syntactic structure of these discontiguous constructions. For example, the construction more … than … in (1) may be represented by the subtree in Figure 28.18.
28.5 What Can U-DOP Learn
28.5.1 Learning Discontiguous Phenomena
Discontiguous, structural dependencies play a role in virtually any facet of syntax. In Bod (2009), we show how U-DOP can learn particle verbs from child-directed speech in the Eve corpus (MacWhinney 2000), such as blow … up, take … away, put … on, etc. For example, from the four child-directed sentences below, U-DOP can derive the dependency between put and in:
(1)
(2)
(3)
(4)
(p. 685) These sentences suffice for U-DOP to learn the construction put X in. At sentence 3, U-DOP induced that can put it in and can put the stick in are generalized by can put X in. At sentence 4, U-DOP additionally derived that put X in can occur separately from can, resulting in an additional constituent boundary. Thus by initially leaving open all possible structures, U-DOP incrementally rules out incorrect structures until the construction put X in is learned. Once the correct particle verb construction is derived, the production of incorrect constructions is blocked by the probability model’s preference for reusing largest possible units given a meaning to be conveyed (assuming a semantic DOP model as in Bonnema et al. 1997).
28.5.2 Child Language Development from Concrete to Abstract
Note that in the examples above (but there are many more examples—see Bod 2009), U-DOP follows a route from concrete constructions to more abstract constructions with open slots. These constructions initially correspond to “holophrases” after which they get more abstract resulting in the discontiguous phrasal verb. This is consonant with studies of child language acquisition (Peters 1983; Tomasello 2003) which indicate that children move from item-based constructions to constructions with open positions. The same development from concrete to abstract constructions has been quantitatively shown by Borensztajn et al. (2008) to hold for many other phenomena, including the use of the progressive, the use of auxiliaries, and do-support in questions and negations.
28.5.3 Learning Rule-Like Behavior Without Rules: The Case of Auxiliary Fronting
U-DOP can learn complex syntactic facets that are typically assumed to be governed by rules or constraints. Instead, in U-DOP/DOP, rule-like behavior can be a side effect of computing the most probable analysis. To show this we will discuss with some detail the phenomenon of auxiliary fronting. This phenomenon is often taken to support the well-known “Poverty of the Stimulus” argument and is called by Crain (1991) the “parade case of an innate constraint”. Let’s start with the usual examples which are the same as those used in Crain (1991), MacWhinney (2005), Clark and Eyraud (2006), and many others:
(5)
If we turn sentence (5) into a (polar) interrogative, the auxiliary is is fronted, resulting in sentence (6). (p. 686)
(6)
A language learner might derive from these two sentences that the first occurring auxiliary is fronted. However, when the sentence also contains a relative clause with an auxiliary is, it should not be the first occurrence of is that is fronted but the one in the main clause:
(7)
(8)
Many researchers have argued that there is no reason that children should favor the correct auxiliary fronting. Yet children do produce the correct sentences of the form (7) and rarely of the form (9) even if they have not heard the correct form before (Crain and Nakayama 1987).^{2}
(9)
According to the nativist view and the “poverty of the stimulus” argument, sentences of the type in (8) are so rare that children must have innately specified knowledge that allows them to learn this facet of language without ever having seen it (Crain and Nakayama 1987). On the other hand, it has been claimed that this type of sentence can be learned from experience alone (Lewis and Elman 2001; Reali and Christiansen 2005). We will not enter the controversy at this point (see Kam et al. 2005), but believe that both viewpoints overlook an alternative possibility, namely that auxiliary fronting needs neither be innately specified nor in the input data in order to be learned. Instead, the phenomenon may be a side effect of computing the most probable sentence-structure without learning any explicit rule or constraint for this phenomenon.
The learning of auxiliary fronting can proceed when we have induced tree structures for the following two sentences:
(10)
(11)
(p. 687) Note that these sentences do not contain an example of complex fronting where the auxiliary should be fronted from the main clause rather than from the relative clause. The tree structures for (10) and (11) can be derived from exactly the same sentences as in Clark and Eyraud (2006):
(12)
(13)
(14)
(15)
It can be shown that the most probable trees for (10) and (11) computed by U-DOP from sentences (10)–(15) are those in Figure 28.19 (see Bod 2009 for details).
Given these trees, we can easily show that the most probable tree produces the correct auxiliary fronting. In order to produce the correct AUX-question, Is the man who is eating hungry, we only need to combine the following two subtrees in Figure 28.20 from the acquired structures in Figure 28.19 (note that the first subtree is discontiguous).^{3}
Instead, to produce the incorrect AUX-question *Is the man who eating is hungry? we would need to combine at least four subtrees from Figure 28.1, which are given in Figure 28.21. (p. 688)
The derivation in Figure 28.2 turns out to be the most likely one, thereby overruling the incorrect form produced in Figure 28.2. This may be intuitively understood as follows. We have already explained in section 28.4.2 that (U-)DOP’s probability model has a very strong preference for sentences and structures that can be constructed out of largest corpus fragments. This means that sentences generated by a shorter derivation tend to be preferred over sentences that can only be generated by longer derivations.
We should keep in mind that the example above is limited to a couple of artificial sentences. It only shows that U-DOP/DOP can infer a complex auxiliary question from a simple auxiliary question and a complex declarative. But a language learner does not need to hear each time a new pair of sentences to produce a new auxiliary question—such as Is the girl alone? and The girl who is crying is alone in order to produce Is the girl who is crying alone? In Bod (2009), we show that U-DOP can also learn auxiliary fronting from the utterances in the Eve corpus (MacWhinney 2000), even though complex auxiliary fronting does not occur in that corpus. Furthermore, by sampling from the probability distribution of possible auxiliary sentences (rather than computing the most probable sentence as above), U-DOP can simulate many of the errors made by children as elicited in the experiments by Ambridge et al. (2008).
(p. 689) 28.5.4 Learning Categories and Semantics?
Previous work has noted that category induction is an easier task than structure induction (Redington et al. 1998; Clark 2000; Klein and Manning 2005; Borensztajn and Zuidema 2007). The U-DOP approach can be generalized to category learning as follows. Assign initially all possible categories to every node in all possible trees (from a finite set of n abstract categories C_{1} … Cn) and let the most probable tree decide which trees correspond to the best category assignments (see Bod 2006b).
The unsupervised learning of semantic representations for sentences is still in its infancy. Although we can quite accurately learn meaning representations from a set of pre-annotated sentences (e.g., Bonnema et al. 1997; Bod 1999; Wong and Mooney 2007), the acquisition of semantic representations directly from child-directed speech is a largely unexplored field. Some progress has been made in (semi-)unsupervised learning of predicate–argument structure and compositional semantics (e.g., Alishahi and Stevenson 2008; Piantadosi et al. 2008). But it is fair to say that no satisfactory model for learning high-level semantic representations exists to date, neither in probabilistic nor in categorical linguistics.
28.5.5 Distinguishing Possible from Impossible Languages?
There is an important question of whether probabilistic models, in particular U-DOP, don’t learn too much. Can they learn impossible languages? Although this question has hardly been investigated so far, it is noteworthy that (U-)DOP’s probability model, with its preference for the shortest derivation, limits the set of possible languages that can be learned. For example, a language that inverts a word string, called “linear inversion”, will be ruled out by U-DOP. This is because linear inversion would lead to one of the longest derivations possible, since it can only be accomplished by decomposing the tree structures into the smallest subtrees for each single word, after which they must be reattached in the reverse order to the covering tree structure. Thus any structural operation leading to shorter derivations will win over this linear operation that tends to result in the longest possible derivation (at least for sentences longer than two words). While this property of U-DOP is promising, the relation between language typology and probabilistic learning models has still to be explored.
28.6 Relation to Other Models
As explained at the beginning of this chapter, probabilistic extensions can be created for virtually any linguistic theory or formalism. The underlying assumptions of (U-)DOP seem to be most congenial to Cognitive Grammar and Construction Grammar. DOP embraces the notion of “maximalist grammar” in cognitive and usage-based models, (p. 690) as coined by Langacker (1987b), and DOP is also consonant with Radical Construction Grammar where any exemplar or fragment is stored even if it is compositional (Croft 2001). However, we believe that both Cognitive Grammar and Construction Grammar suffer from a lack of formalization, especially in defining how constructions are combined and how they are learned—see Bod (2009) for a detailed criticism. At the same time, we have argued in Bod (2009) that DOP can be seen as a formalization and computational realization of Construction Grammar.
The link between DOP and exemplar/usage-based models is also straightforward. According to Exemplar Theory, stored linguistic tokens are the primitives of language that allow for production and perception as analogical generalizations over stored memories. Most exemplar models have been limited to phonetics (Johnson 1997; Bybee 2006a), but recent years have seen increasing interest in developing exemplar models for syntax, as evidenced by the papers in Gahl and Yu (2006). As Hay and Bresnan (2006) note, phonetic exemplar theory mainly deals with classification while syntactic exemplar theory (like DOP) focuses on compositionality. Schütze et al. (2007) aim to integrate the two approaches by extending similarity metrics from phonetic exemplar theory to syntax, which is congenial to the DOP-LAST model by Snider (2008).
Various linguists have argued that both rules and exemplars play a role in language, and have designed their linguistic theories accordingly (e.g., Sag and Wasow 1999; Jackendoff 2002; Culicover and Jackendoff 2005). The DOP model takes this idea one step further: It proposes that rules and exemplars are part of the same distribution and that both can be represented by fragment trees or subtrees. Rule-like behavior is then no more than a side effect of maximizing the probability from the frequencies of the subtrees. In language acquisition, U-DOP is most similar to Item-Based Learning (MacWhinney 1978; Tomasello 2003), especially in simulating the development from concrete item-based constructions to increasingly abstract constructions.
The first instantiation of DOP, DOP_{1} (see section 28.2), is formally equivalent to a Tree-Substitution Grammar or TSG. TSGs constitute a subclass of Tree-Adjoining Grammars or TAGs (Joshi 2004), and are equivalent to the class TAGs when DOP_{1}’s substitution operation is extended with adjunction (Hoogweg 2003). DOP can be seen as a TAG grammar where the elementary trees correspond to the set of all fragment trees derived from a treebank. One may wonder whether the learning method by Zuidema (2006)—where large subtrees only receive non-zero weights if they occur more frequently than can be expected from the weights of smaller subtrees (section 28.3.5)—turns the redundant DOP model into a linguistically succinct TAG model. But this is not the case. Zuidema’s prediction of the productive units include redundant, overlapping fragments such as “I want X”, “I’d like to X”, “want from X to Y”, “What’s X doing?”, etc. Without allowing for redundancy we cannot model gradient acceptability judgments (section 28.3.2) as these judgments are based on possibly overlapping constructions.
Probabilistic linguistics, with its emphasis on redundant, usage-based data, is of course in clear opposition to theories that emphasize a minimal set of non-redundant rules, in particular the Minimalist Program (Chomsky 1995). Yet, even there we may (p. 691) observe a converging trend. In their well-known paper, Hauser et al. (2002) claim that the core language faculty comprises just recursion and nothing else. If we take this idea seriously, then U-DOP may be the first computational model that instantiates it. U-DOP’s trees encode the ultimate notion of recursion where every label can be recursively substituted for any other label. All else is statistics.
28.7 Conclusion
Probabilistic linguistics takes all linguistic evidence as positive evidence and lets statistics decide. It allows for accurate modeling of gradient phenomena in production and perception, and suggests that rule-like behavior is no more than a side effect of maximizing probability. Rules still appear in the scientific discourse but are not part of knowledge of language. According to this view, linguistic competence would consist not of a collection of succinctly represented generalizations that characterize a language; rather, competence may be nothing more than probabilistically organized memories of prior linguistic experiences.
We have seen that probabilistic models of language suggest that there is a single model for both language use and language acquisition. Yet these models need a definition of linguistic representation to start with, be it a phrase-structure tree or a functional attribute-value matrix. On this account, the central concern of linguistics would not be finding Universal Grammar but defining a Universal Representation for linguistic experiences that should apply to all languages. If there is anything innate in the human language faculty, it is this Universal Representation for linguistic experiences together with the capacity to take apart and recombine these experiences. (p. 692)
Notes:
Many of the ideas presented in this chapter emerged from joint work carried out during the last fifteen years with (in chronological order): Remko Scha, Khalil Sima’an, Ronald Kaplan, Menno van Zaanen, Boris Cormons, Andyway, Jennifer Hay, Stefanie Jannedy, Willem Zuidema, Gideon Borensztajn, Dave Cochran, Stefan Frank, and others. I am very grateful to all of them. Of course, any Remaining errors in this chapter are entirely my responsibility. This work was partly funded by the Netherlands Organization for Scientific Research (NWO).
(^{1}) In Bod (2009), we also propose a further extension of U-DOP which takes into account the shortest derivation as well.
(^{2}) Crain and Nakayama (1987) found that children never produced the incorrect form (9). But in a more detailed experiment on eliciting auxiliary fronting questions from children, Ambridge et al. (2008) found that the correct form was produced 26.7% of the time, the incorrect form in (9) was produced 4.55% of the time, and auxiliary doubling errors were produced 14.02% of the time. The other produced questions corresponded to shorter forms of the questions, unclassified errors, and other excluded responses.
(^{3}) We are implicitly assuming a DOP model which computes the most probable sentence given a certain meaning to be conveyed, such as in Bonnema et al. (1997) and Bod (1998).