Show Summary Details

Page of

PRINTED FROM OXFORD HANDBOOKS ONLINE ( © Oxford University Press, 2018. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a title in Oxford Handbooks Online for personal use (for details see Privacy Policy and Legal Notice).

Subscriber: null; date: 15 August 2018

Machine Learning

Abstract and Keywords

This article introduces the type of symbolic machine learning in which decision trees, rules, or case-based classifiers are induced from supervised training examples. It describes the representation of knowledge assumed by each of these approaches and reviews basic algorithms for inducing such representations from annotated training examples and using the acquired knowledge to classify future instances. Machine learning is the study of computational systems that improve performance on some task with experience. Most machine learning methods concern the task of categorizing examples described by a set of features. These techniques can be applied to learn knowledge required for a variety of problems in computational linguistics ranging from part-of-speech tagging and syntactic parsing to word-sense disambiguation and anaphora resolution. Finally, this article reviews the applications to a variety of these problems, such as morphology, part-of-speech tagging, word-sense disambiguation, syntactic parsing, semantic parsing, information extraction, and anaphora resolution.

Keywords: symbolic machine learning, supervised training examples, information extraction, syntactic parsing, semantic parsing, anaphora resolution, part-of-speech tagging


This chapter introduces symbolic machine learning in which decision trees, rules, or case-based classifiers are induced from supervised training examples. It describes the representation of knowledge assumed by each of these approaches and reviews basic algorithms for inducing such representations from annotated training examples and using the acquired knowledge to classify future instances. These techniques can be applied to learn knowledge required for a variety of problems in computational linguistics ranging from part-of-speech tagging and syntactic parsing to word-sense disambiguation and anaphora resolution. Applications to a variety of these problems are reviewed.

20.1 Introduction

Broadly interpreted, machine learning is the study of computational systems that improve performance on some task with experience (Mitchell 1997; Langley 1996). However, the term is frequently used to refer specifically to methods that represent learned knowledge in a declarative, symbolic form as opposed to more numerically oriented statistical or neural-network training methods (see Chapter 19). In particular, it concerns methods which represent learned knowledge in the form of interpretable decision trees, logical rules, and stored instances. Decision trees are (p. 377) classification functions represented as trees in which the nodes are attribute tests, the branches are attribute values, and the leaves are class labels. Rules are implications in either propositional or predicate logic used to draw deductive inferences from data. A variety of algorithms exist for inducing knowledge in both of these forms from training examples. In contrast, instance-based (case-based, memory-based) methods simply remember past training instances and make a decision about a new case based on its similarity to specific past examples. This chapter reviews basic methods for each of these three approaches to symbolic machine learning. Specifically, we review top-down induction of decision trees, rule induction (including inductive logic programming), and nearest-neighbour instance-based learning methods.

As described in previous chapters, understanding natural language requires a large amount of knowledge about morphology, syntax, semantics, and pragmatics as well as general knowledge about the world. Acquiring and encoding all of this knowledge is one of the fundamental impediments to developing effective and robust language processing systems. Like the statistical methods described in the previous chapter, machine learning methods offer the promise of automating the acquisition of this knowledge from annotated or unannotated language corpora. A potential advantage of symbolic learning methods over statistical methods is that the acquired knowledge is represented in a form that is more easily interpreted by human developers and more similar to representations used in manually developed systems. Such interpretable knowledge potentially allows for greater scientific insight into linguistic phenomena, improvement of learned knowledge through human editing, and easier integration with manually developed systems. Each of the machine learning methods we review has been applied to a variety of problems in computational linguistics including morphological generation and analysis, part-of-speech tagging, syntactic parsing, word-sense disambiguation, semantic analysis, information extraction, and anaphora resolution. We briefly survey some of these applications and summarize the current state of the art in the application of symbolic machine learning to computational linguistics.

20.2 Learning for Categorization

Most machine learning methods concern the task of categorizing examples described by a set of features. It is generally assumed that a fixed set of n discrete-valued or real-valued features, {f1, …, fn}, are used to describe examples, and that the task is to assign an example to one of m disjoint categories {c1, …, cm}. For example, consider the task of deciding which of the following three sense categories is the correct interpretation (p. 378) of the semantically ambiguous English noun interest given a full sentence in which it appears as context (see Chapter 13).

  1. 1. c1: readiness to give attention

  2. 2. c2: advantage, advancement, or favour

  3. 3. c3: money paid for the use of money

The following might be a reasonable set of features for this problem:

  • W+i: the word appearing i positions after interest for i=1, 2, 3

  • W-i: the word appearing i positions before interest for i=1, 2, 3

  • Ki: a binary-valued feature for a selected keyword for 1 =1, …, k, where K i is true if the i th keyword appears anywhere in the current sentence. Relevant keywords for interest might be, attracted, expressed, payments, bank, etc.

A learning system is given a set of supervised training examples for which the correct category is given. For example:

  1. 1. c1: John expressed a strong interest in computers.

  2. 2. c2: Acme Bank charges very high interest.

  3.  Machine LearningClick to view larger

    Fig.20.1 Learning curves for disambiguating ‘line’

    3. c3: War in East Timor is not in the interest of the nation.

(p. 379) In this case, the values of the relevant features must first be determined in a straightforward manner from the text of the sentence. From these examples, the system must produce a procedure for accurately categorizing future examples.

Categorization systems are typically evaluated on the accuracy of their predictions as measured by the percentage of examples that are correctly classified. Experiments for estimating this accuracy for a particular task are performed by randomly splitting a representative set of labelled examples into two sets, a training set used to induce a classifier, and an independent and disjoint test set used to measure its classification accuracy. Averages over multiple splits of the data into training and test sets provide more accurate estimates and give information on the variation in performance across training and test samples. Since labelling large amounts of training data can be a time-consuming task, it is also useful to look at learning curves in which the accuracy is measured repeatedly as the size of the training set is increased, providing information on how well the system generalizes from various amounts of training data. Fig. 20.1 shows sample learning curves for a variety of systems on a related task of semantically disambiguating the word ‘line’ into one of six possible senses (Mooney 1996). Mitchell 1997 provides a basic introduction to machine learning including discussion on experimental evaluation.

20.3 Decision-Tree Induction

Decision trees are classification functions represented as trees in which the nodes are feature tests, the branches are feature values, and the leaves are class labels. An example is classified by starting at the root and recursively traversing the tree to a leaf by following the path dictated by its feature values. A sample tree for the interest problem is shown in Fig. 20.2. For simplicity, assume that all of the unseen extra branches for W+1 and W+2 are leaves labelled c1. This tree can be paraphrased as follows: If the word bank appears anywhere in the sentence assign sense c3; otherwise if the word following interest is rate, assign sense c3, but if the following word is of and the word two before is in (as in … in the interest of …), then assign sense c2; in all other cases assign sense c1.

 Machine Learning

Fig.20.2 Sample decision tree for disambiguating interest

The goal of learning is to induce a decision tree that is consistent with the training data. Since there are many trees consistent with a given training set, most approaches follow ‘Occam 's razor’ and try to induce the simplest tree according to some complexity measure such as the number of leaves, or the depth of the tree. Since computing a minimal tree according to such measures is an NP hard problem (i.e. a computational problem for which there is no known efficient, polynomial-time algorithm;see also (p. 380) Chapter 9), most algorithms perform a fairly simple greedy search to efficiently find an approximately minimal tree. The standard approach is a divide-and-conquer algorithm that constructs the tree top down, first picking a feature for the root of the tree and then recursively creating subtrees for each value of the selected splitting feature. Pseudocode for such an algorithm is shown in Fig. 20.3.

 Machine Learning
 Machine LearningClick to view larger

Fig.20.3 Decision-tree induction algorithm

 Machine Learning

The size of the constructed tree critically depends on the heuristic used to pick the splitting feature. A standard approach is to pick the feature that maximizes the expected reduction in the entropy, or disorder, of the data with respect to category (Quinlan 1986). The entropy of a set of data, S, with respect to category is defined as: where Si is the subset of S in category i (1 ≤ im). The closer the data is to consisting purely of examples in a single category, the lower the entropy. A good splitting feature (p. 381) fractions the data into subsets with low entropy. This is because the closer the subsets are to containing examples in only a single category) the closer they are to terminating in a leaf) and the smaller will be the resulting subtree. Therefore) the best split is selected as the feature, fi, that results in the greatest information gain defined as: where j ranges over the possible values vij of fi) and Sij is the subset of S with value vij for feature fi The expected entropy of the resulting subsets is computed by weighting their entropies by their relative size |Sij|/|S|.

The resulting algorithm is computationally very efficient (linear in the number of examples) and in practice can quickly construct relatively small trees from large amounts of data. The basic algorithm has been enhanced to handle many practical problems that arise when processing real data) such as noisy data) missing feature values) and real-valued features (Quinlan 1993). Consequently) decision-tree methods are widely used in data mining applications where very large amounts of data need to be processed (Fayyad, Piatetsky-Shapiro, and Smyth 1996; see also Chapter 34). The most effective recent improvement to decision-tree algorithms has been methods for constructing multiple alternative decision trees from the same training data) and then classifying new instances based on a weighted vote of these multiple hypotheses (Quinlan 1993).

20.4 Rule Induction

 Machine Learning

Fig.20.4 Sample rules for disambiguating interest

 Machine Learning

Fig.20.5 Rule-induction covering algorithm

Classification functions can also be symbolically represented by a set of rules, or logical implications. This is equivalent to representing each category in disjunctive normal form (DNF), i.e. a disjunction of conjunctions of feature-value pairs, where each rule is a conjunction corresponding to a disjunct in the formula for a given category. For example, the decision tree in Fig. 20.2 can also be represented by the rules in Fig. 20.4, assuming that c1 is the default category that is assigned if none of the rules apply. (p. 382)

 Machine LearningClick to view larger

Fig.20.6 Top-down rule construction algorithm

Decision trees can be translated into a set of rules by creating a separate rule for each path from the root to a leaf in the tree (Quinlan 1993). However, rules can also be directly induced from training data using a variety of algorithms (Mitchell 1997; Langley 1996). The general goal is to construct the smallest rule set (the one with the least number of symbols, that is consistent with the training data. Again, the problem of learning the minimally complex hypothesis is NP hard, and therefore heuristic search is typically used to induce only an approximately minimal definition. The standard approach is to use a form of greedy set covering, where at each it eration, a new rule is learned that attempts to cover the largest set of examples of a particular category without covering examples of other categories. These examples are then removed, and additional rules are learned to cover the remaining examples of the category. Pseudocode for this process is shown in Fig. 20.5, where Construct Rule(P, N, Features) attempts to learn a conjunction covering as many of the positive examples in P as possible without covering any of the negative examples in N.

 Machine Learning

(p. 383) There are two basic approaches to implementing ConstructRule. Top-down (general-to-specific) approaches start with the most general ‘empty’ rule (Trueci), and repeatedly specialize it until it no longer covers any of the negative examples in N. Bottom-up (specific-to-general) approaches start with a very specific rule whose antecedent consists of the complete description of one of the positive examples in P, and repeatedly generalize it until it begins covering negative examples in N. Since top-down approaches tend to construct simpler (more general) rules, they are generally more popular. Fig. 20.6 presents a top-down algorithm based on the approach used in the FOIL system (Quinlan 1993). At each step, a new condition. fi =Vij is added to the rule and the examples that fail to satisfy this condition are removed. The best specializing feature-value pair is selected based on preferring to retain as many positive examples as possible while removing as many negatives as possible. A gain heuristic analogous to the one used in decision trees can be defined as follows: where Pij and Nij are as defined in Fig. 20.6. The first term, |Pij|, encourages coverage of a large number of positives and the second term encourages an increase in the percentage of covered examples that are positive (decrease in the percentage of covered examples that are negative).

This and similar rule-learning algorithms have been demonstrated to efficiently induce small and accurate rule sets from large amounts of realistic data. Like decision-tree methods, rule-learning algorithms have also been enhanced to handle noisy data and real-valued features (Clark and Niblett 1989; Cohen 1995). More significantly, they also have been extended to learn rules in first-order predicate logic, a much richer representation language. Predicate logic allows for quantified variables and relations and can represent concepts that are not expressible using examples described as feature vectors. For example, the following rules, written in Prolog syntax (where the conclusion appears first), define the relational concept of an uncle:

uncle(X,Y):- brother(X,Z), parent(Z,Y).

uncle(X,Y):- husband(X,Z), sister(Z,W), parent(W,Y).

The goal of inductive logic programming (ILP) or relational learning is to infer rules of this sort given a database of background facts and logical definitions of other relations (Lavrac and Dzeroski 1994). For example, an ILP system can learn the above rules for uncle (the target predicate) given a set of positive and negative examples of uncle relationships and a set of facts for the relations parent, brother, sister, and husband (the background predicates) for the members of a given extended family, such as:

uncle(Tom, Frank), uncle(Bob, John), ¬uncle(Tom, Cindy), ¬uncle(Bob, Tom)

parent(Bob, Frank), parent(Cindy, Frank), parent(Alice, John), parent(Tom, John),

(p. 384) brother(Tom, Cindy), sister(Cindy, Tom), husband(Tom, Alice), husband(Bob, Cindy).

Alternatively, logical definitions for brother and sister could be supplied and these relations could be inferred from a more complete set of facts about only the ‘basic’ predicates: parent, spouse, and gender.

The rule construction algorithm in Fig. 20.6 is actually a simplification of the method used in the FOIL ILP system (Quinlan 1990). In the case of predicate logic, FOIL starts with an empty rule for the target predicate (P(X1, …, Xr): -.) and repeatedly specializes it by adding conditions to the antecedent of the rule chosen from the space of all possible literals of the following forms:

  • Qi(V1, …, Vr)

  • not(Qi(V1 …, Vr))

  • Xi=Xj

  • not(Xi=Xj)

where Qi are the background predicates, Xi are the existing variables used in the current rule, and V1, …, Vr are a set of variables where at least one is an existing variable (one of the Xi) but the others can be newly introduced. A slight modification of the Gain heuristic in equation (20.3) is used to select the best literal.

ILP systems have been used to successfully acquire interesting and comprehensible rules for a number of realistic problems in engineering and molecular biology, such as determining the cancer-causing potential of various chemical structures (Bratko and Muggleton 1995). Unlike most methods which require ‘feature engineering’ to reformat examples into a fixed list of features, ILP methods can induce rules directly from unbounded data structures such as strings, stacks, and trees (which are easily represented in predicate logic). However, since they are searching a much larger space of possible rules in a more expressive language, they are computationally more demanding and therefore are currently restricted to processing a few thousand examples compared to the millions of examples that can be potentially handled by feature-based systems.

20.5 Instance-Based Categorization

Unlike most approaches to learning for categorization, instance-based learning methods (also called case-based or memory-based methods) do not construct an abstract function definition, but rather categorize new examples based on their similarity (p. 385) to one or more specific training examples (Aha, Kibler, and Albert 1991; Stanfill and Waltz 1986). Training generally requires just storing the training examples in a database, although it may also require indexing the examples to allow for efficient retrieval. Categorizing new test instances is performed by determining the closest examples in the database according to some distance metric.

 Machine Learning

For real-valued features, the standard approach is to use Euclidian distance, where the distance between two examples is defined as: where fi(x) is the value of the feature fi for example x. For discrete-valued features, the difference, (fi(x) - fi(y)), is generally defined to be 1 if they have the same value for fi and 0 otherwise (i.e. the Hamming distance). In order to compensate for differences in scale between different features, the values of all features are frequently rescaled to the interval [0, 1]. Intuitively, such distance measures are intended to measure the dissimilarity of two examples.

 Machine Learning

Fig.20.7 K nearest-neighbour categorization algorithm

A standard algorithm for categorizing new instances is the k nearest-neighbour method (Cover and Hart 1967). The k closest examples to the test example according to the distance metric are found, and the example is assigned to the majority class for these examples. Pseudocode for this process is shown in Fig. 20.7. The reason for picking k examples instead of just the closest one is to make the method robust by basing decisions on more evidence than just one example, which could be noisy. To avoid ties, an odd value for k is normally used; typical values are 3 and 5.

The basic nearest-neighbour method has been enhanced with techniques for weighting features in order to emphasize attributes that are most useful for categorization, and for selecting a subset of examples for storage in order to reduce the memory requirements of retaining all training examples (Stanfill and Waltz 1986; Aha, Kibler, and Albert 1991; Cost and Salzberg 1993).

(p. 386) 20.6 Applications to Computational Linguistics

Decision-tree learning, rule induction, and instance-based categorization have been applied to a range of problems in computational linguistics. This section surveys applications to a variety of problems in language processing starting with morphological and lexical problems and ending with discourse-level tasks.

20.6.1 Morphology

Symbolic learning has been applied to several problems in morphology (see Chapter 2). In particular, decision-tree and ILP methods have been applied to the problem of generating the past tense of an English verb, a task frequently studied in cognitive science and neural networks as a touchstone problem in language acquisition. In fact, there has been significant debate whether or not rule learning is an adequate cognitive model of how children learn this task (Rumelhart and McClelland 1986; Pinker and Prince 1988; MacWhinney and Leinbach 1991). Typically, the problem is studied in its phonetic form, in which a string of phonemes for the present tense is mapped to a string of phonemes for the past tense. The problem is interesting since one must learn the regular transformation of adding ed, as well as particular irregular patterns such as that illustrated by the examples sing→7 sang, ring→7 rang, and spring→7 sprang.

Decision-tree algorithms were applied to this task and found to significantly outperform previous neural-network models in terms of producing correct past-tense forms for independent test words (Ling and Marinov 1993; Ling 1994). In this study, verbs were restricted to fifteen phonemes encoded using the UNIBET ASCII standard, and fifteen separate trees were induced, one for producing each of the output phoneme positions using all fifteen of the input phonemes as features. Below is the encoding for the mapping act→ acted, where underscore is used to represent a blank.

ILP rule-learning algorithms have also been applied to this problem and shown to outperform decision trees (Mooney and Califf 1995). In this case, a definition for the predicate Past(X, Y), was learned for mapping an unbounded list of UNIBET phonemes to a corresponding list for the past tense (e.g. Past([&, k, t], [&, k, t, I, d])) using a predicate for appending lists as part of the background. A definition was learned in the form of a decision list in which rules are ordered and the first rule that applies is chosen. This allows first checking for exceptional cases and falling through to a default rule if none apply. The ILP system learns a very concise and comprehensible definition for the past-tense transformation using this approach. Similar ILP methods (p. 387) have also been applied to learning morphology in other European languages (Manandhar, Dzeroski, and Erjavec 1998;Kazakov and Manandhar 1998).

20.6.2 Part-of-speech tagging

Tagging each word with its appropriate part of speech (POS) based on context is a useful first step in syntactic analysis (see Chapter 11). In addition to statistical methods that have been successfully applied to this task, decision-tree induction (Marquez, Padro, and Rodriguez 1999), rule induction (Brill 1995), and instance-based categorization (Dealemans et al. 1996) have also been successfully used to learn POS taggers.

The features used to determine the pas of a word generally include the pas tags in a window of two to three words on either side. Since during testing these tags must also be determined by the classifier, either only the previous tags are used, or an iterative procedure is used to repeatedly update all tags until convergence is reached. For known words, a dictionary provides a set of possible pas categories. For unknown words, all pas categories are possible but additional morphological features such as the last few characters of the word and whether or not it is capitalized are typically used as additional input features. Using such techniques, symbolic learning systems can obtain high accuracies similar to those obtained by other pas tagging methods, i.e. in the range of 96–7 per cent.

20.6.3 Word-sense disambiguation

As illustrated by the interest problem introduced earlier, machine learning methods can be applied to determining the sense of an ambiguous word based on context (see Chapter 13). As also illustrated by this example, a variety of features can be used as helpful cues for this task. In particular, collocational features that specify words that appear in specific locations a few words before or after the ambiguous word are useful features, as are binary features indicating the presence of particular words anywhere in the current or previous sentence. Other potentially useful features include the parts of speech of nearby words, and general syntactic information, such as whether an ambiguous noun appears as a subject, direct object, or indirect object.

Instance-based methods have been applied to disambiguating a variety of words using a combination of all of these types of features (Ng and Lee 1996). A feature-weighted version of nearest neighbour was used to disambiguate 121 different nouns and 70 verbs chosen for being both frequent and highly ambiguous. Fine-grained senses from WordNet were used, resulting in an average of 7.8 senses for the nouns and 12senses for the verbs. The training set consisted of 192, 800 instances of these words found in text sampled from the Brown corpus and the Wall Street Journal and labelled with correct senses. Testing on an independent set of 14, 139 examples from (p. 388) the Wall Street Journal gave an accuracy of 68.6 per cent compared to an accuracy of 63.7 per cent from choosing the most common sense, a standard baseline for comparison. Since WordNet is known for making fine sense distinctions, these results may seem somewhat low. For some easier problems the results were more impressive, such as disambiguating interest into one of six fairly fine-grained senses with an accuracy of 90 per cent.

Decision-tree and rule induction have also been applied to sense disambiguation. Fig. 20.1 shows results for disambiguating the word line into one of six senses using only binary features representing the presence or absence of all known words in the current and previous sentence. Tree learning (C4.5), rule learning (PFOIL), and nearest neighbour perform comparably on this task and somewhat worse than simple neural-network (Perceptron) and statistical (Naive Bayes) methods. A more recent project presents results on learning decision trees to disambiguate all content words in a financial corpus with an average accuracy of 77 per cent (Paliouras, Karkaletsis, and Spyropoulos 1999).

20.6.4 Syntactic parsing

Perhaps the best-studied problem in computational linguistics is the syntactic analysis of sentences (see Chapters 4, 12). In addition to statistical methods that have been successfully applied to this task, decision-tree induction (Magerman 1995; Hermjakob and Mooney 1997; Haruno, Shirai, and Ooyama 1999), rule induction (Brill 1993), and instance-based categorization (Cardie 1993; Argamon, Dagan, and Krymolowski 1998) have also been successfully employed to learn syntactic parsers.

One of the first learning methods applied to parsing the Wall Street Journal (WSJ) corpus of the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993) employed statistical decision trees (Magerman 1995). Using a set of features describing the local syntactic context including the POS tags of nearby words and the node labels of neighbouring (previously constructed) constituents, decision trees were induced for determining the next parsing operation. Instead of growing the tree to completely fit the training data, pruning was used to create leaves for subsets that still contained a mixture of classes. These leaves were then labelled with class-probability distributions estimated from the subset of the training data reaching that leaf. During testing, the system performs a search for the highest probability parse, where the probability of a parse is estimated by the product of the probabilities of its individual parsing actions (as determined by the decision tree). After training on approximately 40, 000 WSJ sentences and testing on 1, 920 additional ones, the system obtained a labelled precision (percentage of constructed constituents that were correct) and labelled recall (percentage of correct constituents that were found) of 84 per cent.

(p. 389) 20.6.5 Semantic parsing

Learning methods have also been applied to mapping sentences directly into logical form (see Chapter 5) by inducing a parser from training examples consisting of sentences paired with semantic representations. Below is a sample training pair from an application involving English queries about a database of US geography:

What is the capital of the state with the highest population?

answer(C, (capital(S, C), largest(P, (state(S), population(S, P))))).

Unfortunately, since constructing useful semantic representations for sentences is very difficult unless restricted to a fairly specific application, there is a noticeable lack of large corpora annotated with detailed semantic representations.

However, ILP has been used to induce domain-specific semantic parsers written in Prolog from examples of natural language questions paired with logical Prolog queries (Zelle and Mooney 1996; Ng and Zelle 1997). In this project, parser induction is treated as a problem of learning rules to control the actions of a generic shift –reduce parser. During parsing, the current context is maintained in a stack and a buffer containing the remaining input. When parsing is complete, the stack contains the representation of the input sentence. There are three types of operators used to construct logical forms. One is the introduction onto the stack of a predicate needed in the sentence representation due to the appearance of a word or phrase at the front of the input buffer. A second type of operator unifies variables appearing in different stack items. Finally, an item on the stack may be embedded as an argument of another one. ILP is used to learn conditions under which each of these operators should be applied, using the complete stack and input buffer as context, so that the resulting parser deterministically produces the correct semantic output for all of the training examples.

This technique has been used to induce natural language interfaces to several database- query systems, such as the US geography application illustrated above. In one experiment using a corpus of 250 queries annotated with logical form, after training on 225 examples, the system was able to answer an average of 70 per cent of novel queries correctly compared to 57 per cent for an interface developed by a human programmer. Similar results were obtained for semantic parsing of other languages after translating the corpus into Spanish, Turkish, and Japanese (Thompson and Mooney 1999).

20.6.6 Information extraction

 Machine LearningClick to view larger

Fig.20.8 Information extraction example

Information extraction is a form of shallow text processing that locates a specified set of relevant items in a natural language document (see Chapter 30). Fig. 20.8 shows an example of extracting values for a set of labelled slots from a job announcement (p. 390) posted to an Internet newsgroup. Information extraction systems require significant domain-specific knowledge and are time consuming and difficult to build by hand, making them a good application for machine learning.

 Machine Learning

Fig.20.9 Sample learned information-extraction rule

A number of rule-induction methods have recently been applied to learning patterns for information extraction (Soderland 1999; Freitag 1998; Califf and Mooney 1999). Given training examples of texts paired with filled templates, such as that shown in Fig. 20.8, these systems learn pattern-matching rules for extracting the appropriate slot fillers from text. Some systems assume that the text has been preprocessed by a POS tagger or a syntactic parser, others use only patterns based on unprocessed text. Fig. 20.9 shows a sample rule constructed for extracting the transaction amount from a newswire article about corporate acquisition (Califf and Mooney 1999). This rule extracts the value undisclosed from phrases such as sold to the bank for an undisclosed amount or paid Honeywell an undisclosed price. The pre-filler pattern consists of two pattern elements: (1) a word whose POS is noun or proper noun, and (2) a list (p. 391) of at most two unconstrained words. The filler pattern requires the word undisclosed tagged as an adjective. The post-filler pattern requires a word in the WordNet semantic category price.

Such systems have acquired extraction rules for a variety of domains, including apartment ads, university web pages, seminar announcements, terrorist news stories, and job announcements. After training on a couple of hundred examples, such systems are generally able to learn rules as accurate as those resulting from a time-consuming human development effort. The standard metrics for evaluating information extraction are precision, the percentage of extracted fillers that are correct, and recall, the percentage of correct fillers that are successfully extracted. On most tasks that have been studied, current systems are generally able to achieve precisions in the mid-80 per cent range and recalls in the mid-60 per cent range.

20.6.7 Anaphora resolution

Resolving anaphora, or identifying multiple phrases that refer to the same entity, is another difficult language processing problem (see Chapter 14). Anaphora resolution can be treated as a categorization problem by classifying pairs of phrases as either coreferring or not. Given a corpus of texts tagged with coreferring phrases, positive examples can be generated as all coreferring phrase pairs and negative examples as all phrase pairs within the same document that are not marked as coreferring. Both decision-tree (Aone and Bennett 1995; McCarthy and Lehnert 1995) and instance-based methods (Cardie 1992) have been successfully applied to resolving various types of anaphora.

In particular, decision-tree induction has been used to construct systems for general noun-phrase coreference resolution. Examples are described using features of both of the individual phrases, such as the semantic and syntactic category of the head noun; as well as features describing the relationship between the two phrases, such as whether the first phrase precedes the second and whether the semantic class of the first phrase subsumes that of the second. In one experiment (Aone and Bennett 1995), after training on 1, 971 anaphors from 295 texts and testing on 1, 359 anaphors from an additional 200 texts, the learned decision tree obtained a precision (percentage of coreferences found that were correct) of 86.7 per cent and a recall (percentage of true coreferences that were found) of 69.7 per cent. These results were superior to those obtained using a previous, hand-built coreference procedure (precision 72.9 per cent, recall 66.5 per cent).

Soon, Ng, and Lim (2001) describe a CS-based decision-tree algorithm to coreference resolution of noun phrases in unrestricted text. The coreference resolution module is part of a larger coreference resolution system also featuring sentence segmentation, tokenization, morphological analysis, part-of-speech tagging, noun-phrase (p. 392) identification, named entity recognition, and semantic class determination (via WordNet). The feature vectors used for training and evaluation on MUC-6 and MUC-7 data consist of twelve features applied to NPs such as distance, string match, number agreement, semantic class agreement, and proper names. The authors also use cross-validation to obtain the learning parameters.

Further Reading and Relevant Resources

Introductory textbooks on machine learning include Mitchell 1997 and Langley 1996. Also useful is the collection of early papers assembled by Shavlik and Dietterich (1990). A recent special issue on natural language learning of the Machine Learning journal is presented by Cardie and Mooney 1999. An overview of the increasingly popular method of memory-based learning is to be found in Daelemans et al. (2001). Some websites with useful machine learning pointers are: and Also see the Association of Computational Linguistics' special interest group on Natural Language Learning at


Aha, D. W, D. F. Kibler, and M. K. Albert. 1991. ‘Instance-based learning algorithms’. Machine Learning, 6(1), 37–66.Find this resource:

    Aone, C. and S. W. Bennett. 1995. ‘Evaluating automated and manual acquisition of anaphora resolution strategies’. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL ′95) (Cambridge, Mass.), 122–9.Find this resource:

      Argamon, S., I. Dagan, and Y. Krymolowski. 1998. ‘A memory-based approach to learning shallow natural language patterns’. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and COLING-98(ACLICOLING-98) (Montreal), 67–73.Find this resource:

        Bratko, I. and S. Muggleton. 1995. Applications of inductive logic programming: Communications of the Association for Computing Machinery, 38(11), 65–70.Find this resource:

          Brill, E. 1993. ‘Automatic grammar induction and parsing free text: a transformation-based approach’. Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics(ACL '93) (Columbus, Oh.), 259–65.Find this resource:

            —— 1995. ‘Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging’. Computational Linguistics, 21(4), 543–65.Find this resource:

              Califf, M. E. and R. J. Mooney. 1999. ‘Relational learning of pattern-match rules for information extraction’. Proceedings of the 16th National Conference on Artificial Intelligence (AAAI ′99) (Orlando, Fla.), 328–34.Find this resource:

                Cardie, C. 1992. (Learning to disambiguate relative pronouns'. Proceedings of the 10th National Conference on Artificial Intelligence (AAAI ′92) (San Jose, Calif.), 38–43.Find this resource:

                  (p. 393) —1993. ‘A case-based apprach to knowledge acquisition for domain-specific sentence analysis’. Proceedings of the 11th National Conference on Artificial Intelligence (AAAI ′93) (Washington), 798–803.Find this resource:

                    —— and R. J. Mooney, 1999. ‘Machine learning and natural language (introduction to special issue on natural language learning)’. Machine Learning, 34, 5–9.Find this resource:

                      Clark, P. and T. Niblett. 1989. ‘The CN2 induction algorithm’. Machine Learning, 3, 261–84.Find this resource:

                        Cohen, W. W. 1995. ‘Fast effective rule induction’. Proceedings of the 12th International Conference on Machine Learning(ICML ′95) (San Francisco), 115–23.Find this resource:

                          Cost, S. and S. Salzberg. 1993. ‘A weighted nearest neighbor algorithm for learning with symbolic features’. Machine Learning, 10(1), 57–78.Find this resource:

                            Cover, T. M. and P. E. Hart. 1967. ‘Nearest neighbor pattern classification’. IEEE Transactions on Information Theory, 13, 21–7.Find this resource:

                              Daelemans, W., J. Zavrel, P. Berek, and S. Gillis. 1996. ‘MBT: a memory-based part of speech tagger-generator’. Proceedings of the 4th Workshop on Very Large Corpora (Copenhagen), 14–27.Find this resource:

                                —— K. van der Sloot, and A. van den Bosch. 2001. TiMBL: Tilburg Memory Based Learner, version 4. 0, Reference Guide. ILK Technical report 01–04. http://ilk.kub.nllpapers/ this resource:

                                  Fayyad, U. M., G. Piatetsky-Shapiro, and P. Smyth. 1996. ‘From data mining to knowledge discovery’. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining. Cambridge, Mass.: MIT Press, 1–34.Find this resource:

                                    Freitag, D. 1998. ‘Toward general-purpose learning for information extraction’. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and COLING-98 (ACL/COLING ′98) (Montreal), 404–8.Find this resource:

                                      Haruno, M., S. Shirai, and Y. Ooyama. 1999. ‘Using decision trees to construct a practical parser’. Machine Learning, 34, 131–50.Find this resource:

                                        Hermjakob, U. and R. J. Mooney. 1997. ‘Learning parse and translation decisions from examples with rich context’. Proceedings of the 35thAnnual Meeting of the Association for Computational Linguistics (ACL ′97) (Madrid), 482–9.Find this resource:

                                          Kazakov, D. and S. Manandhar. 1998. ‘Ahybrid approach to word segmentation’. Proceedings of the 9th International Workshop on Inductive Logic Programming (ILP ′99) (Madison, Wis.), 125–34.Find this resource:

                                            Langley, P. 1996. Elementsof Machine Learning. San Francisco, Calif.: Morgan Kaufmann.Find this resource:

                                              Lavrac, N. and S. Dzeroski. 1994. Inductive Logic Programming: Techniques and Applications. New York: Ellis Horwood.Find this resource:

                                                Ling, C. X. 1994. ‘Learning the past tense of English verbs: the symbolic pattern associator vs. connectionist models’. Journal of Artificial Intelligence Research, I, 209–29.Find this resource:

                                                  —— and M. Marinov. 1993. ‘Answering the connectionist challenge: a symbolic model of learning the past tense of English verbs’. Cognition, 49(3), 235–90.Find this resource:

                                                    McCarthy, J. and W Lehnert. 1995. ‘Using decision trees for coreference resolution’. Proceedings of the 14th Fourteenth International Joint Conference on Artificial Intelligence (IJCAI ′95) (Montreal), 1050–5.Find this resource:

                                                      MacWhinney, B. and J. Leinbach. 1991. ‘Implementations are not conceptualizations’. revising the verb model: Cognition, 40, 291–6.Find this resource:

                                                        Magerman, D. M. 1995. ‘Statistical decision-tree models for parsing’. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL ′95) (Cambridge, Mass.), 276–83.Find this resource:

                                                          (p. 394) Manandhar, S., S. Dzeroski, and T. Erjavec. 1998. ‘Learning multilingual morphology with CLOG’. Inductive LogicProgramming: Proceedings of the 8th International Conference (ILP '98) (Madison, Wis.), 135–44.Find this resource:

                                                            Marcus, M., B. Santorini, and M. Marcinkiewicz. 1993. ‘Building a large annotated corpus of English: the Penn treebank’. Computational Linguistics, 19(2), 313–30.Find this resource:

                                                              Marquez, L., L. Padro, and H. Rodriguez. 1999. ‘Amachine learning approach to POS tagging’. Machine Learning, 39(1), 59–91.Find this resource:

                                                                Mitchell, T. 1997. Machine Learning. New York: McGraw-Hill.Find this resource:

                                                                  Mooney, R. J. 1996. ‘Comparative experiments on disambiguating word senses: An illustration of the role ofbias in machine learning’. Proceedings ofthe Conference on EmpiricalMethods in Natural Language Processing (EMNLP ′96) (Philadelphia), 82–91.Find this resource:

                                                                    —— and M. E. Califf. 1995. ‘Induction of first-order decision lists: Results on learning the past tense of English verbs’. Journal of Artificial Intelligence Research, 3, 1–24.Find this resource:

                                                                      Ng, H. T. and H. B. Lee. 1996. ‘Integrating multiple knowledge sources to disambiguate word sense: an exemplar-based approach’. Proceedings of the 34th Annual Meeting of the Associationfor Computational Linguistics (ACL ′96) (Santa Cruz, Calif.), 40–7.Find this resource:

                                                                        —— and J. Zelle. 1997. ‘Corpus-based approaches to semantic interpretation in natural language processing’. AI Magazine, 18(4), 45–64.Find this resource:

                                                                          Paliouras, G., V Karkaletsis, and C. D. Spyropoulos. 1999. ‘Learning rules for large vocabulary word sense disambiguation’. Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI ′99) (Stockholm), 674–9.Find this resource:

                                                                            Pinker, S. and A. Prince. 1988. ‘On language and connectionism: analysis of a parallel distributed model of language acquisition’. In S. Pinker and J. A. Mehler (eds.), Connections and Symbols. Cambridge, Mass.: MIT Press, 73–193.Find this resource:

                                                                              Quinlan, J. R. 1986. ‘Induction of decision trees’. Machine Learning, 1(1), 81–106.Find this resource:

                                                                                —— 1990. ‘Learning logical definitions from relations’. Machine Learning, 5(3), 239–66.Find this resource:

                                                                                  —— 1993. C4. 5:Programs for Machine Learning. San Mateo, Calif.: Morgan Kaufmann.Find this resource:

                                                                                    —— 1996. ‘Bagging, boosting, and C4. S’. Proceedings of the 13th National Conference on Artificial Intelligence (AAAI ′96) (Portland, Ore.), 725–30.Find this resource:

                                                                                      Rumelhart, D. E. and J. L. McClelland. 1986. ‘On learning the past tense of English verbs’. In D. E. Rumelhart, and J. L. McClelland (eds.), Parallel DistributedProcessing, vol ii, Cambridge, Mass.: MIT Press, 216–71.Find this resource:

                                                                                        Shavlik, J. W. and T. G. Dietterich (eds.). 1990. Readings in Machine Learning. San Mateo, Calif.: Morgan Kaufmann.Find this resource:

                                                                                          Soderland, S. 1999. ‘Learning information extraction rules for semi-structured and free text’. Machine Learning, 34, 233–72.Find this resource:

                                                                                            Soon, W. M., H. T. Ng, and C. Y. Lim. 2001. ‘Amachine learning approach to coreference resolution of noun phrases’. Computational Linguistics, 27(4), 521–44.Find this resource:

                                                                                              Stanfill, C. and D. L. Waltz. 1986. ‘Toward memory-based reasoning’. Communications of the Association for Computing Machinery, 29, 1213–28.Find this resource:

                                                                                                Thompson, C. A. and R. J. Mooney. 1999. ‘Automatic construction of semantic lexicons for learning natural language interfaces’. Proceedings of the 16thNational Conference on ArtificialIntelligence (AAAI ′99) (Orlando, Fla.), 487–93.Find this resource:

                                                                                                  Zelle, J. M. and R. J. Mooney. 1996. ‘Learning to parse database queries using inductive logic programming’. Proceedings of the 13th National Conference on Artificial Intelligence (AAAI ′96) (Portland, Ore.), 1050–5.Find this resource: