Show Summary Details

Page of

PRINTED FROM OXFORD HANDBOOKS ONLINE ( (c) Oxford University Press, 2015. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a title in Oxford Handbooks Online for personal use (for details see Privacy Policy).

Subscriber: null; date: 18 December 2017


Abstract and Keywords

This chapter introduces a conceptual framework for the evaluation of natural language processing systems. It characterizes evaluation in terms of four dimensions: intrinsic versus extrinsic evaluation, stand-alone systems versus components, manual versus automated methods, and laboratory versus real-world conditions. A comparative overview of evaluation methods in major areas of NLP is provided, covering distinct applications such as information extraction, machine translation, automatic summarization, and natural language generation. The discussion of these applications emphasizes commonalities across evaluation methods. Next, evaluation of particular component technologies is discussed, addressing coreference, word sense disambiguation and semantic role labelling, and finally referring-expression generation. The chapter concludes with a brief assessment of the status of evaluation in NLP.

Keywords: intrinsic evaluation, extrinsic evaluation, black-box evaluation, glass-box evaluation, component evaluation, inter-annotator reliability, informativeness, quality, adequacy, fluency

1 Introduction

NLP researchers today usually study naturally occurring corpora where variation is constrained by factors such as the goals of the language users, the modality (text versus speech), the intended audience (a single person, a specific group), the genre, subject matter, and so on. Evaluation is critical for establishing to what degree the results from an NLP system generalize within and across corpora. Additionally, evaluation plays an essential role in defining benchmark data sets and appropriate metrics for NLP applications of all sorts. Without these, comparison of systems and approaches is impossible.

The following section gives a broad overview of four dimensions of evaluation: intrinsic versus extrinsic evaluation, stand-alone system versus component evaluation, evaluation with manual versus automatically computed metrics, and real-world versus laboratory evaluations. We loosely organize the paper around these dimensions, and conclude with a brief summary of open issues and suggested further reading.

2 Four Dimensions of Evaluation

  1. (i) Intrinsic evaluation tests how well a system meets its objectives, and extrinsic evaluation rates the system (e.g. in terms of efficiency and acceptability) in its operational context, which includes the people using the system (Sparck Jones and Galliers 1996). Extrinsic assessments provide a better sense of the system’s practical utility, and can potentially provide developers of individual components with feedback on utility-based factors. Evaluation of a component technology (see below) is often intrinsic, but can be extrinsic in an ablation study, where a system is operationally evaluated with and without specific components.

  2. (ii) Evaluation of a stand-alone application addresses a specific NLP task, as opposed to a component technology. A stand-alone application involves mapping from language or data input so as to produce linguistic or non-linguistic data output, where the mapping constitutes a particular application; examples include machine translation, information extraction, spelling correction, and automatic summarization. A component technology maps from one level of representation to another, without the mapping constituting a distinct application on its own. Examples of the latter include parsing, word-sense disambiguation, coreference, sentence planning for generation, etc. The stand-alone application of information extraction can include, if desired, parsing or coreference as components. Note that even in the case of a component technology, the system may have internal modules that can, if desired, be exposed for evaluation along with the system as a whole. This type of assessment, called a glass-box evaluation, is distinguished from a black-box evaluation, where the input and output of the system as a whole are evaluated without access to internal components. Both stand-alone applications as well as component technologies can be assessed in either a glass-box or black-box fashion, with the latter being somewhat easier to implement though usually less insightful.

  3. (iii) Evaluation can use manual assessments or automatic metrics. For example, the widely used EAGLES methodology (EAGLES 1995), which is similar to the product evaluations found in consumer reports, assesses the functionality and usability of the system in terms of human judgements. EAGLES is based on ISO standards for ‘quality characteristics to be used in the evaluation of software products’. Humans judge the system based on checklists of critical features for different functional properties of the system. For MT, checklist items may include tools to aid users in submitting input in a variety of formats, to pre-edit and post-edit text, and to support portability, e.g. tools to extend the system’s linguistic coverage, handling of different language pairs, extensibility to a new genre of text, etc.

    Checklist judgements are subjective, but are more or less domain-independent. Creating and then evaluating with a checklist depends on human judgement, and requires time and effort. Objective evaluation depends on metrics that are both reliable and valid; ideally, they can be applied automatically, allowing more rapid turnaround for developers. A reliable metric discriminates among competitors with consistency across evaluation settings, and a valid one measures what it is supposed to (Krippendorff 1980; Sparck Jones and Galliers 1996). Much discussion of the evaluation of coreference (see below) addresses its validity. Often, an automatic metric is based on comparing system performance against a benchmark, human-annotated corpus that constitutes a gold standard. If the automatic measurements (i.e. their scores) have a strong correlation with the scores produced by humans, the automatic method can substitute for human judgement. However, when a system needs to be judged just once, human judgements may be preferable.

    Conducting system evaluations using measures of performance requires a basic knowledge of experimental design (Kirk 1968) and the methodology of testing for statistical significance (Cohen 1969; Siegel and Castellan 1988). The system must be tested on inputs that neither it nor the system developer has seen before; thus, the test corpus must be blind, i.e. disjoint from the development corpus (that the developer can inspect while designing the system) and the training corpus (that the machine learning system may have used). In general, as a maturing system evolves through multiple versions, it is useful to perform regression testing to assess changes in accuracy over the blind test set.

  4. (iv) Evaluation can occur in a real-world context, such as a usability test of a fielded dialogue system, or a laboratory evaluation that controls for evaluation parameters (Turunen et al. 2006). A laboratory evaluation can be intrinsic or extrinsic. Real-world evaluations involve the use of real or realistic users in actual settings, where it may be difficult to control key parameters. Laboratory evaluation constitutes a necessary first step; as a system becomes more mature, it can be deployed and tested in a real-world application. However, since real-world applications are hard to study in isolation, our discussion below is confined to laboratory evaluation.

3 Intrinsic versus Extrinsic Evaluation

3.1 Intrinsic evaluation paradigms

The type of intrinsic evaluation to apply depends on the mapping that the NLP system performs. This mapping can be among natural language utterances, as in the case of MT and summarization, or from natural language utterances to particular data representations (as in the case of information extraction), or vice versa (in the case of NL generation). Or else, one data representation can map to another; for example, semantic role labelling maps from syntactic parse trees to predicate–argument structures with argument role labels.

In most intrinsic evaluations, the corpus of documents that constitutes the gold standard is annotated based on a set of guidelines. To verify annotation quality, a subsample is usually annotated by multiple annotators and assessed using measures of inter-annotator reliability, e.g. an agreement coefficient such as kappa (Cohen 1960), or one of the many related metrics discussed in (Artstein and Poesio 2008). (For more details, see Chapter 19, Corpus Annotation.)

This ‘best-practice’ methodology is not quite a science, and the iterative development of guidelines can be expensive, with the timeline for preparing a gold standard being highly variable. Finally, the evaluation results can be specific to the particular genre or corpus used for system development. For example, a statistical part-of-speech tagger trained and evaluated on a newswire corpus may not do very well at tagging questions.

3.2 Extrinsic evaluation paradigms

Extrinsic evaluations involve testing a component or stand-alone application in terms of some other task. Thus, a component such as a coreference module may be assessed for its impact on the overall task of information extraction; or an application such as machine translation or automatic summarization may be assessed in terms of how accurately people can answer questions based on reading the translations of summaries. As the architecture of a system grows more complex, the influence of the component on the overall system becomes harder to characterize, even when exploring that influence using ablation.

Extrinsic evaluations can also be relatively indirect, as in measuring the influence of a system on the larger work environment, such as the use of MT on the overall workflow of an organization, or the impact of a particular type of information extraction capability on people’s use of search engines. Usually, the stakeholders of such evaluations are people concerned with impact on real users in laboratory or operational settings, or with possible commercial impact. Acceptance by users and actual deployment of a particular technology can be as important an indicator of success as performance metrics; in some cases, extrinsic evaluation may occur after transitioning the technology. This can also involve optimizing system performance or simplifying its functionality.

4 Evaluation of Stand-alone NLP Application Areas

4.1 Information extraction

Information extraction (see Chapter 35) extracts entities, relations, and events from natural language, and maps them to a structured representation such as frames or database tables. Measures for these tasks include Precision/Recall and slot error rate. Precision is the number of correctly detected instances over total number of instances detected. Recall is the number of correctly detected instances over the number that should have been detected. The geometric mean, F-measure (i.e. F1-measure), summarizes the trade-off between the two. Error rate is defined as the number of insertions (false alarms) + deletions (missed instances) + substitutions, divided by the total number of true instances.

In the classic Named Entity evaluation for the Message Understanding Conferences (MUCs) (Grishman and Sundheim 1996; Hirschman 1998), systems were scored for precision and recall against a gold-standard corpus identifying proper names of persons, organizations, and locations, along with certain numerical expressions such as dates and money. The Automatic Content Evaluation conferences (ACE) (Doddington et al. 2004) extended this paradigm, with gold-standard corpora for identifying entities (e.g. people, organizations, locations) when referred to by nominals and pronouns, as well as proper names, and for identifying specified semantic relations. The latter include relations between people (e.g. employer), between people and organizations (e.g. CEO), part–whole relations between organizations (e.g. subsidiary of), and locations of entities. Finally, there are events such as births, marriages, attacks, and so on, along with entities that are participants in those events.

One evaluation innovation in ACE is a ‘value’ metric that subtracts from the perfect score (100%) the percentage of missing instances and the percentage of false alarms, while weighting each data element based on application interest. The latter helps tune the evaluation to application needs (say one in which accurate person-name coreference is more important than getting all locations correct), at the expense of transparency (or intuitive understandability) in the eventual value score. An issue with the ACE evaluations, however, is the difficulty of the task for humans. For example, Ji and Grishman (2008) found inter-annotator agreement on ACE’2005 English data to be only 40.3 F-measure for identifying and classifying event triggers (i.e. identifying the main word which expresses an event occurrence, along with its event type) and 50.6 F-measure on event argument identification (i.e. correctly identifying mentions of entities that are participants in the event). Simplifying the task might remedy the low agreement, in addition to potentially shortening the annotation guidelines and making the scoring easier to interpret. The ACE challenge, common for programmes that seek to extend the scope and relevance of NLP, is thus to achieve a balance between the sophistication of the task and the ability to evaluate it in a reliable and meaningful fashion.

4.2 Machine translation

There are two dimensions in terms of which systems that produce natural language output can be evaluated. Quality (also called fluency) is the extent to which the text is well-formed, understandable, and coherent. Informativeness is the extent to which a text preserves information content. The latter is called adequacy in the case of machine translation (see also Chapter 32) when the translation does not add any new information (though it is worth bearing in mind that due to mismatches and divergences across languages, new information may be required). An output judged to be of high quality may of course fail to preserve information, thus both dimensions need to be measured.

Machine translation quality can be judged based on subjective grading. Because automatic style and grammar checkers (statistical or other) do not yield particularly insightful assessments, it is usually manual. Traditionally, such grading has been used to assess lapses in grammaticality, style, word choice, untranslated words, inappropriate rendering of proper names, and so on (ALPAC 1966; Nagao et al. 1985; Vilar et al. 2006). It is worth bearing in mind that quality measures often involve implicit task-related criteria, which can confound results (Sparck Jones and Galliers 1996).

Informativeness can be measured by comparing system output against the input, or against reference output. The former assesses whether a translation preserves the information in the source, without adding new information (ALPAC 1966). Reliable judgements against input are challenging due to lexical mismatches and syntactic divergence across languages. A problem with using reference output is that different experts can produce different equally informative translations. While judgements of relative informativeness of reference outputs can be carried out by a monolingual human, it is not clear how many reference outputs are enough, or what the unit should be. The shorter the reference segment (passage, sentence, clause, phrase), the less context there is for judging information content. Nevertheless, comparison against reference output has been a tradition in MT (e.g. Jordan et al. 1993; White 1995).

Comparing against reference output has also proven highly amenable to automated metrics. Given that a source text may have several possible reference translations, any such metric needs to take such multiplicity into account. Automated informativeness metrics here have included edit distance measures, n-gram-based comparison, and semantic comparison metrics. We now discuss these in turn.

Translation edit rate (TER) (Snover et al. 2006) is an automatic edit distance metric that computes the number of edits required to make a candidate translation identical to a reference translation; here, a sequence of movements that result in an entire phrase being moved are treated as a single error.

The automated BLEU (Bilingual Evaluation Understudy) metric (Papineni et al. 2002) is a modified precision metric that compares word n-grams in the system output against multiple reference translations, computing a precision score separately for n-grams of length 1 (i.e. comparing individual words), as well as higher-order n-grams (phrases of length 2, 3, etc.) that take word order into account. These different precision scores are then averaged to give a pre-final score. Since an MT system can undergenerate, a somewhat ad hoc ‘brevity penalty’ is multiplied with the pre-final score so as to lower the score of candidate translations that are much shorter in length than reference translations.

The NIST Open MT evaluations have relied on BLEU scores for all evaluations from 2005 through 2009, and follow Papineni et al. (2002) in using four reference translations. The impact of the number of reference translations was evaluated in Papineni et al. (2002), where BLEU scores using four reference translations were compared to those using one reference corpus, but from different translators. While the magnitude of BLEU scores was lower using the single reference corpus, the ranking of systems was the same.

The BLEU metric has fuelled considerable progress in evaluation of statistical MT systems, allowing such systems to be automatically trained by optimizing for a high BLEU score. As such, it provides an excellent example of the benefits and risk of good automated metrics. Callison-Burch et al. (2006) observe that statistical MT can optimize for high BLEU scores to achieve measurable performance improvements without necessarily achieving recognizably higher quality. Conversely, they point to SYSTRAN as an example of a translation system that does not use statistical methods, and that gets the highest scores from humans for fluency and adequacy in the 2005 OpenMT, but whose BLEU score is outranked by five other systems.

Semantic matching between a candidate and reference translation is difficult to automate because of the difficulties of parsing (especially ill-formed outputs) and semantic interpretation (especially the difficulty of aligning meaning elements across texts). However, one approach is to extend word matches to allow for classes of words that share the same word-stem, and in addition to allow synonyms to match, as in the METEOR metric (Banerjee and Lavie 2005; Denkowski and Lavie 2011).

There have been a large number of intrinsic MT evaluations. Here we focus on MetricsMATR (Przybocki et al. 2009), a large-scale evaluation of thirty-nine different automatic MT metrics carried out in 2008 by NIST. English machine translations of documents in Chinese, Arabic, and Farsi, along with their existing reference translations (four per translation) were assessed by NIST judges using subjective grading as to their informativeness (called ‘Adequacy’). Specifically, judges were asked to assess on a seven-point scale how much of the meaning of the reference translation was captured in the system translation, and in addition, whether or not the machine translation had ‘essentially the same meaning’ as the reference translation. The top fifteen metrics were able to correctly discriminate between machine and human translations of documents at least 90% of the time. However, the main finding was that the correlations of metrics with human judgement varied greatly depending on (i) the unit of analysis (segments of a document, whole documents, or entire systems across many documents), and (ii) whether one or more references were used. The top ten metrics (which included at least one of translation edit rate, n-gram matching, and semantic matching metrics) varied from moderate to high correlation with human judgement. No single metric consistently stood out. This suggests that a variety of metrics from these and other classes should continue to be explored.

An example of extrinsic MT evaluation involves translation of instruction manuals, where it is possible to measure the efficiency of execution of translated instructions (e.g. Sinaico and Klare 1971). Accuracy in reading comprehension tests, where subjects read system (and also human) translations and then answer questions, has also been used (e.g. Orr and Small 1967; White 1995).

In sum, many issues in MT evaluation remain to be explored regarding metrics, human judgements of quality and informativeness, and the modelling and automation of semantic comparisons.

4.3 Automatic summarization

As in MT, evaluation of summarization systems (for more on text summarization, see Chapter 37) independently requires assessments of quality and informativeness. However, additional complications are that some loss of information is desirable, and in the case of multi-document summaries, there is a need to excise ‘redundant’ information that is repeated across documents. Further, a summary can be an extract, i.e. consisting entirely of material copied from the source document (or documents), or an abstract that includes wording not present in the source, as in the case of an opinion. In the latter case, especially when the abstract is not a paraphrase of the source, the informativeness judgements comparing the abstract against the rather different source or against other abstracts can present challenges that don’t arise with extracts. Finally, summarization can be judged with respect to a set of users (or a topic, or a query), or as a ‘generic’ summary that is aimed at a broad audience.

Extractive summaries can inadvertently omit relevant context, thus leading to dangling anaphors and gaps in rhetorical structure or the lack of connection between topics. Overall coherence can be assessed by readability criteria. For example, Minel et al. (1997) had subjects grade readability of summaries based on dangling anaphors, failure to preserve the integrity of structured environments like lists or tables, ‘choppiness’ of the text, and so on. Abstracts, too, have been graded based on general readability criteria such as spelling and grammar, clear indication of the topic of the source document, impersonal style, conciseness, understandability, or acronyms being presented with expansions (Saggion and Lapalme 2000).

Informativeness has been measured using subjective grading or automated metrics to determine to what extent a summary covers information in the input document by means of an information extraction template (Paice and Jones 1993), a rhetorical structure for the text (Minel et al. 1997), or even a list of highlighted phrases in the input (Mani et al. 1998). Recently, Louis and Nenkova (2009) have demonstrated that an automatic metric, one that compares the distributions of words in the summary and the input, correlates well with human judgements of summary responsiveness (see below).

Comparison against reference summaries has relied on human-produced extracts and abstracts. Prior research has shown that human reference summaries can vary considerably (Rath et al. 1961; Salton et al. 1997); however, there is some evidence that they tend to agree more on the most important sentences to include (Marcu 1999), or on the most important semantic content to include. The latter idea has been explored in the Pyramid Method of Nenkova and Passonneau (2004), where human judgements are used to identify common concepts across reference summaries of the same length, ranking concepts by frequency.

Automatic metrics here have used n-grams or semantic units. The BLEU-inspired automated ROUGE metric (Lin 2004) is a recall measure that compares word n-grams (or word-stems if desired) between system and reference summaries (n-grams that occur in more reference summaries are favoured), along with a bonus based on the brevity of the system summary. Semantic comparison using Basic Elements (BE) (Tratz and Hovy 2009) parses sentences and then extracts, by post-processing, head-dependent relationships, between a head of a syntactic phrase (e.g. noun, verb, preposition, etc.) and each of its arguments. Recent modifications to BE have included flexible matching of pronouns to names, synonym matches, abbreviation expansions, etc.

The Document Understanding Conference (DUC) evaluations of text summarization systems assessed generic and topic-focused summaries of English newspaper and newswire text. For DUC 2005, 2006, and 2007, four reference summaries were used for ROUGE scores and Pyramid scores. In Nenkova and Passonneau (2004) it was argued that scores stabilized given four reference summaries of the same length. The number of reference summaries needed for ROUGE was discussed in Lin (2004), where it was argued that while the number of reference summaries helped, the number of samples from each system mattered more. Criticisms of ROUGE include its inability to discriminate sufficiently between human and machine summaries (Conroy and Dang 2008). Various target sizes of summaries have been used (10–400 words), and both single- and multi-document summaries have been evaluated.

Since 2007, DUC and its successor, the Text Analysis Conference (TAC), have also evaluated update summaries, meaning summaries with new information on a given topic. Pyramid-based evaluations have also been conducted. In 2008, TAC also investigated query-focused summaries of opinions found in blogs. Summaries were manually judged for both content and readability, as well as on a five-point scale of how ‘responsive’ the summary was in satisfying the information need of the topic. (Such responsiveness measures are subject to many confounding factors, such as the lack of precise guidelines, and interactions between different aspects, such as informativeness and readability.) Automatic evaluations using ROUGE and BE were found to correlate well with human judgements of responsiveness, especially when the topics were specific. An additional n-gram metric that compares graphs of character n-grams (Giannakopoulos and Karkaletsis 2008) was found to correlate well with human responsiveness metrics.

As with MT, accuracy in reading comprehension tests has been used in extrinsic evaluations of summarization, e.g. (Morris et al. 1992). Relevance assessment (of documents to topics) has been used in extrinsic evaluations (Mani et al. 1998), in order to evaluate the effect of different summarization techniques on speed and accuracy of relevance assessment. More recently, Elhadad et al. (2005) have generated summaries of journal articles tailored to patients based on their medical records, and evaluated them by measuring the time it takes a medical expert to find information related to patient care.

Overall, summarization is still a field very much in search of evaluation measures that are valid for the nature of the compression involved, and metrics that can reliably discriminate among system summaries, or between system and human summaries.

4.4 Natural language generation

Natural language generation (see also Chapter 29) is often decomposed into a pipeline of three component functions: content determination (what to say), sentence planning, and surface realization (two aspects of how to say it). Sentence planning has been evaluated based on training and testing from a corpus of human-rated, machine-generated sentence plan trees (e.g. Stent et al. 2004) in the context of a spoken dialogue system. In comparison, the evaluation of surface realization, which maps from input semantic representations to the final surface form of a sentence, has been more developed. A typical evaluation technique used here is corpus regeneration: the source text is parsed to a semantic representation, to which the surface realization component is applied; the syntax of the generated text is then compared against that of the source text (e.g. Bangalore et al. 2000). This technique is suitable when there are likely to be very similar lexical choices across the texts being compared.

All the methods used in evaluating MT and Automatic Summarization for quality and informativeness have been applied to NLG. Thus, comparison against the input, in the case of evaluation of generated weather forecasts, has involved showing experts the raw forecast data, in the form of partly numeric tabular data, along with the textual forecasts (generated as well as reference texts), and soliciting judgements on quality and informativeness (Reiter and Belz 2008).

Extrinsic evaluations of NLG have sometimes used highly indirect methods. For example, Reiter et al. (2003) evaluated the STOP system, which generates personalized smoking-cessation letters based on its medical effectiveness. Smokers were sent STOP-generated letters or controls, and the evaluation measured how many smokers in each group quit smoking.

Since NLG often relies on domain knowledge, portability has also been assessed in extrinsic evaluations of NLG. In Robin (1994), ‘Robustness’ was defined as the percentage of output test sentences that could be covered without adding new knowledge (linguistic and domain knowledge) to the system, and ‘Scalability’ measured the percentage of the knowledge base that consisted of new concepts that had to be added to cover the test sentences.

NLG still is a young field in terms of exploring evaluation methods. Evaluation of content determination and sentence planning, and design of methods unique to NLG are open issues. Evaluation methods specific to referring expression generation are discussed in section 5.3.

5 NLP Component Evaluation

There are many component processes that feature in a large number of applications. Although much of the evaluation of individual components to date is intrinsic, it does not follow that a component process judged to have superior performance in isolation is likely also to have superior extrinsic performance. Extrinsic evaluation of a component function C is rare because for N versions of C, the overall cost increases by a factor of N.

The first large-scale resource for intrinsic evaluation using a gold-standard corpora is the circa 1992 Penn TreeBank corpus of syntactic parse trees for Wall Street Journal text, which fostered enormous progress in part-of-speech tagging and parsing over the next decade. In this section, we highlight two more recent areas of component evaluation. For coreference resolution, which we briefly mentioned earlier, the standard evaluation suites are the two MUC corpora (1995 and 1998), which are relatively small, and the ACE corpora of 2004 and later. As discussed below, neither corpus is ideal, and the evaluation methodology is in flux, though gaining momentum due to the increasing availability of new corpora (Ng 2010).

5.1 Coreference resolution

Coreference resolution involves linking together mentions of the same entity in a given document. There has been an evolving debate over competing evaluation metrics. Issues include what to do about so-called singleton mentions that do not corefer with anything else, whether the scope of a coreference resolution component includes identifying all mentions or only the coreferring ones, and whether it includes discriminating between expressions that refer and those that do not (e.g. pleonastic it).

Differences in what is considered the scope of reference resolution lead to different annotation criteria for gold-standard corpora. Two extensively used sources of corpora are the earlier-mentioned MUC and ACE evaluations. In the relatively small and homogenous MUC-6 and MUC-7 corpora (sixty news documents each), NPs that have no coreferent expression (singleton mentions) are not annotated, and in ACE corpora, only expressions that belong to an ACE entity type (see above) are annotated.

The MUC scoring algorithm (Vilain et al. 1995) treats reference resolution as the problem of finding all the coreferential chains of mentions of length two or more. It counts links in the chains, thus ignoring all singleton mentions, which omits many of the annotated NPs in ACE corpora from consideration. Several metrics have been proposed to improve upon the MUC score, including B3 (Bagga and Baldwin 1998), CEAF (Luo 2005) and its variants, and BLANC (Recasens and Hovy 2010). The various coreference metrics all use precision, recall, and F-measure, but differ in what they count. B3 counts all mentions, and assumes that the system and gold-standard mentions are identical. As a result, it cannot be used to evaluate how well a coreference resolver identifies all mentions. CEAF counts entities, and does so by finding the best one-to-one mapping from gold-standard entities to system entities. The variants of B3 and CEAF attempt to compensate for so-called twinless mentions, those which occur only in the gold standard or only in the system response (Stoyanov et al. 2009; Cai and Strube 2010). BLANC, an implementation of the Rand Index, rewards coreference links and non-coreference links, with separate recall and precision scores for each. So far, none of the proposals has been accepted as having ideal properties.

While there are many sources for the debate over coreference metrics, the most important is different views regarding the scope of reference resolution. Additionally, a general criticism levelled against recall-based measures is that they are overused, and do not always apply to the NLP tasks they are used for (Wilks 1999), because they require a fixed set of evaluation objects. With coreference, the mentions in a text can be enumerated, given a specific definition of mention, but it is harder to define, hence to enumerate, all possible entities. Other coreference metrics are possible; an alternative proposed in Passonneau (2006) applies an agreement coefficient (Krippendorff 1980), thus factoring out agreement between the system and gold standard that could have arisen by chance. Mentions are compared but not counted; for each mention, the set difference of the mention and all other mentions of the same entity produced by the system is compared with the corresponding value in the gold standard.

As noted above in the discussion of intrinsic evaluation, because all the metrics mentioned here are analogues of recall and precision, they all treat each link or mention or entity equally. Yet it is surely the case that not all entities mentioned in a document are equally important (nor all mentions), and that importance must be relative to some communicative goal. The ACE ‘value’ score mentioned earlier attempts to weight some data elements more than others, and similarly a weighted metric might be used here, but any such weighting can make the metric less transparent.

The evaluation for anaphora resolution is not the same as that for coreference resolution since they have relatively different outputs. In anaphora resolution the system has to determine the antecedent of the anaphor; for nominal anaphora any preceding NP which is coreferential with the anaphor is considered as the correct antecedent. On the other hand, the objective of coreference resolution is to identify all coreferential chains. For more on anaphora resolution evaluation, see Mitkov (2001).

5.2 Word sense disambiguation and semantic role labelling

One of the efficiencies of natural language is that every word has multiple meanings and uses. In general terms, a word sense disambiguation (WSD) component resolves the meaning of a word in its context, which of course depends on a method for representing word meaning (see Chapter 25 for more details). A second efficiency is that the same argument-taking word can have different syntactic realizations, sometimes but not always with the same meaning. The italicized clause above can be re-expressed in the passive voice as the meaning of a word is resolved by a word sense disambiguation (WSD) component. Active versus passive voice changes the perspective on an event, and permits the instigator of an action to be omitted, but does not change the meaning: the clown popped the balloon and the balloon was popped both entail someone or something popped the balloon. The job of semantic role labelling (SRL) is to resolve the syntactic arguments within a clause (and their grammatical roles) to a canonical predicate argument structure (the reader is referred to Chapter 24).

For polysemous words that have many syntactic realizations, WSD and SRL can overlap. For example, the verb name in the sense of ‘appoint’ or ‘award’ takes three core arguments. In the active voice, they are realized as the subject, direct object, and object of the preposition to: [The National Governor’s Association]1 named [New Jersey’s Governor Chris Christie]2 to [its executive committee]. In its sense of ‘christen’ it can have three core arguments, none marked by to: [Clinton]1 named [his dog]2 [Buddy]3. Determining the sense of name in a given sentence could help in the SRL step of identifying all the relevant arguments. As noted in Màrquez et al. (2008), argument identification accounts for most of the errors in SRL for the CoNLL-2005 task, in comparison to the next step of argument labelling. Conversely, identifying the arguments of name in each sentence could facilitate WSD. Where there is no overlap, WSD and SRL can be interdependent, thus resolving the sense of a noun can potentially facilitate argument identification and labelling due to selectional constraints on verb arguments.

Until recently, WSD and SRL were evaluated independently of each other in the SensEval and SemEval evaluation efforts sponsored by ACL’s SIGLEX or CoNLL. However, SemEval-2007 included both WSD and SRL evaluation (Pradhan et al. 2007) based on the SRL-annotated PropBank corpus (Palmer, Gildea, and Kingsbury 2005). Here we briefly touch on some of the themes that affect the evaluation: defining the scope of the component, annotating the gold-standard corpora, design of reliable and valid evaluation metrics, and the role of intrinsic versus extrinsic evaluation of components.

For evaluation of WSD, annotated corpora require an annotation language for word senses (explicit tags or labels for each sense), or a method for identifying word sense that annotators can agree on. Annotating with explicit sense labels has relied on dictionaries for SENSEVAL-1 (Kilgarriff and Palmer 2000), or on other lexical resources. WordNet (Fellbaum 1998) has become widely used for this purpose (Edmonds and Kilgarriff 2002). Typically, annotators are asked to select a single sense, which can be overly restrictive if a given usage in the corpus is intermediate between a pair of sense labels from the resource. There have been proposals to allow annotators to select multiple senses (Veronis 1998), but implementing the proposal has raised issues for inter-annotator agreement (Dorr et al. 2010). Much debate is devoted to the balance between achieving good inter-annotator agreement and handling polysemous words, or using fine-grained versus coarse-grained sense inventories. Recently, there has been a push towards new corpora specifically for lexical research, such as DANTE (Kilgarriff 2010), or enhancements of existing corpora with word sense annotation, such as the MASC subset of the American National Corpus (Ide et al. 2010). DANTE’s word sense annotation method is still under development. MASC relies on WordNet. Concurrently, alternative annotation methods have been proposed, such as asking annotators to rate the applicability of all of a word’s WordNet senses (Erk and McCarthy 2009).

A widely adopted evaluation metric for WSD is accuracy. Accuracy, however, has clear shortcomings that make comparison across corpora and sense inventories difficult. One way to compensate for this is to report characteristics of the evaluation corpus, such as its average polysemy (Stevenson and Wilks 2001). Recently, as evidence by SEMEVAL 2007, there is a push towards evaluation of performance on coarse-grained sense inventories (Pradhan et al. 2007).

Evaluation of SRL typically involves two stages. The first, identification of a verb’s arguments, is essentially a parsing task. It is evaluated using precision, recall, and F measure, with the assumption that the exact word boundaries of an argument must be identified. Labelling of the semantic roles is evaluated using classification accuracy. The most widely used corpus for SRL is PropBank, which uses framesets that account for syntactic alternations of the same sense, as illustrated above for the passive voice. It relies on theory-neutral semantic role labels (e.g. ARG0, ARG1), making it very general. On the other hand, this limits inferencing and generalization across verbs about specific roles (Pradhan et al. 2007).

5.3 Referring expression generation

Evaluating the component function of referring expression generation has been a focus of interest in automatic metrics for intrinsic evaluation in NLG, e.g. TUNA (Gatt et al. 2009) and GREC (Belz and Kow 2010). For example, in the TUNA-REG Shared-Task Evaluation Competition (Gatt et al. 2009), the system is given a set of entities, each with a set of attributes and values, and the goal is to generate a short description, such as a noun phrase, that picks out one of the entities (the referent) from the set of entities. The TUNA corpus consists of sets of entities with attributes and values, along with human-created descriptions of them; the latter are obtained by soliciting descriptions of pictures of the entities shown on a web page. Both edit distance and n-gram comparison (based on BLEU) have been used. Human judges assessed the informativeness (or Adequacy) and quality (or Fluency) of the descriptions generated by the systems. The scores on the automatic metrics did not significantly correlate with the human judgements, except for the counts of how often the edit distance was zero (i.e. identical matches between system and reference descriptions). In an extrinsic evaluation, the referring expression generation capability was assessed via a task of identifying the referent based on the generated description.

6 Conclusion

While evaluation is critical to the field, there are few general guidelines about how to carry out NLP evaluations; each application area or component process adopts its own methods and metrics. Thus quality and informativeness play a role in evaluations of MT, summarization, and NLG, but are not measured the same way. Evaluation has borrowed methods from the natural sciences (e.g. experiments), software engineering (e.g. test suites), and the social and behavioural sciences (e.g. human factors). The one generalization to offer is that the best evaluation methodologies are essentially experimental, with a measurable criterion of success using metrics that are both reliable and valid.

As mentioned above, the gold-standard-based methodology for intrinsic evaluation is widely used but can be very expensive. Alternatives based on much faster, more accurate and lightweight annotation, some of it amenable to crowdsourcing, are therefore desirable. However, it is not clear how to develop more lightweight annotations for deeper semantic analysis, such as event extraction (ACE, TimeML; see <>), discourse parsing, and textual entailment. For coreference resolution, there has long been a push towards unsupervised methods that do not rely on annotated training data, with recent success for an unsupervised method that outperforms supervised ones (Haghighi and Klein 2010). Most SRL remains supervised, and in addition, is limited by the performance of syntactic parsing, with a perceived need to move towards unsupervised methods (Edmonds and Kilgarriff 2002).

Overall, evaluation has been critical in fostering the development of NLP systems and resources in recent decades, while also allowing for empirical comparisons of methods that are of interest to various stakeholders (researchers, funders, developers, users, etc.). It has also become an object of study in its own right, with considerable emphasis on the identification of different characteristics that evaluation metrics should have.

Further Reading and Relevant Resources

The Language Resources and Evaluation Conferences ( provide useful overviews and discussions of language evaluation research, as does the associated journal ( For earlier evaluations, there are proceedings of the Message Understanding Conferences, the DARPA Speech Recognition Workshops, Human Language Technology Workshops, and Broadcast News Workshops ( The Linguistic Data Consortium ( and ELRA—European Language Resources Association ( have catalogues of language resources and test suites. NIST ( is a good source for evaluation tools and test suites, including <>, <>, <>, <>, and <>. Other evaluations mentioned here can be found at <>, <>, <>, and <>.


ALPAC (1966). Language and Machines: Computers in Translation and Linguistics. A report by the Automatic Language Processing Advisory Committee, Division of Behavioral Sciences, National Academy of Sciences, National Research Council, Publication 1416. Washington, DC.Find this resource:

    Artstein, Ron and Massimo Poesio (2008). ‘Inter-coder Agreement for Computational Linguistics’, Computational Linguistics 34(4): 555–596.Find this resource:

      Bagga, Amit and Breck Baldwin (1998). ‘Algorithms for Scoring Coreference Chains’. In Proceedings of the First International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference (LREC ’98), Granada, Spain, 563–566. Paris, France: European Languages Resources Association.Find this resource:

        Banerjee, Satanjeev and Alon Lavie (2005). ‘METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments’. In Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, Sydney, Australia, 65–72. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

          Bangalore, Srinivas, Owen Rambow, and Steve Whittaker (2000). ‘Evaluation Metrics for Generation’. In Proceedings of the First International Conference on Natural Language Generation (INLG 2000), Mitzpe Ramon, Israel, 1–8. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

            Belz, Anja and Eric Kow (2010). ‘The GREC Challenges 2010: Overview and Evaluation Results’. In Proceedings of the 6th International Conference on Natural Language Generation (INLG 2010), Dublin, Ireland, 219–229. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

              Cai, Jie and Michael Strube (2010). ‘Evaluation Metrics for End-to-End Coreference Resolution’. In Proceedings of SIGDIAL 2010: The 11th Annual Meeting of the Special Interest Group in Discourse and Dialogue, Tokyo, Japan, 28–36. Special Interest Group on Dialogue (SIGDIAL). Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                Callison-Burch, Chris, Miles Osborne, and Philipp Koehn (2006). ‘Re-evaluating the Role of Bleu in Machine Translation Research’. In Diana McCarthy and Shuly Wintner (eds), EACL 2006: Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, 249–256. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                  Cohen, Jacob (1960). ‘A Coefficient of Agreement for Nominal Scales’, Educational and Psychological Measurement 20(1): 37–46.Find this resource:

                    Cohen, Jacob (1969). Statistical Power Analysis for the Behavioral Sciences. New York: Academic Press.Find this resource:

                      Conroy, John and Hoa Trang Dang (2008). ‘Mind the gap: Dangers of divorcing evaluations of summary content from linguistic quality’. In Proceedings of the 22nd International Conference on Computational Linguistics, Manchester, United Kingdom, 145-152. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                        Denkowski, Michael and Alon Lavie (2011). ‘Meteor 1.3 Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems’. In Proceedings of the EMNLP 2011 Workshop on Statistical Machine Translation, Edinburgh, United Kingdom, 85–91. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                          Doddington, George, Alexis Mitchell, Mark Przybocki, Lance Ramshaw, Stephanie Strassel, and Ralph Weischedel (2004). ‘The Automatic Content Extraction (ACE) Program: Tasks, Data, and Evaluation’. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal, 837–840. Paris, France: European Languages Resources Association.Find this resource:

                            Dorr, Bonnie, Rebecca Passonneau, David Farwell, Rebecca Green, Nizar Habash, Stephen Helmreich, Eduard Hovy, Lori Levin, Keith Miller, Teruko Mitamura, Owen Rambow, Florence Reeder, and Advaith Siddharthan (2010). ‘Interlingual Annotation of Parallel Text Corpora: A New Framework for Annotation and Evaluation’, Natural Language Engineering 16(3): 197–243.Find this resource:

                              EAGLES (1995). ‘EAGLES: Evaluation of Natural Language Processing Systems’. Final report, EAGLES Document EAG-EWG-PR.2. Available online at < >.

                              Edmonds, Philip and Adam Kilgarriff (2002). ‘Introduction to the special issue on evaluating word sense disambiguation systems’, Natural Language Engineering 8(4): 279–291.Find this resource:

                                Elhadad, Noemie, Kathleen McKeown, David Kaufman, and Desmond Jordan (2005). ‘Facilitating Physicians’ Access to Information via Tailored Text Summarization’. In Proceedings of the AMIA 2005 Annual Symposium, Washington, DC, 226–230. Bethesda, Maryland American Medical Informatics Association.Find this resource:

                                  Erk, Katrin and Diana McCarthy (2009). ‘Graded Word Sense Assignment’. In Philipp Koehn and Rada Mihalcea (eds), Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 440–449. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                    Fellbaum, Christiane (ed.) (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.Find this resource:

                                      Gatt, Albert, Anja Belz, and Eric Kow (2009). ‘The TUNA-REG Challenge 2009: Overview and Evaluation Results’. In Proceedings of the 12th European Workshop on Natural Language Generation, Athens, Greece, 174–182. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                        Giannakopoulos, George and Vangelis Karkaletsis (2008). ‘Summarization System Evaluation Revisited: N-gram Graphs’. ACM Transactions on Speech and Language Processing (TSLP) 5(3): 1–39.Find this resource:

                                          Grishman, Ralph and Beth Sundheim (1996). ‘Message Understanding Conference-6: A Brief History’. In COLING-96: Proceedings of the 16th Conference on Computational Linguistics, Copenhagen, Denmark, 466–471. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                            Haghighi, Aria and Dan Klein (2010). ‘Coreference Resolution in a Modular, Entity-Centered Model’. In Human Language Technologies: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association of Computational Linguistics, Los Angeles, California, 385–393. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                              Hirschman, Lynette (1998). ‘The Evolution of Evaluation: Lessons from the Message Understanding Conferences’, Computer Speech and Language 12(4): 283–285.Find this resource:

                                                Ide, Nancy, Colin Baker, Christiane Fellbaum, Charles Fillmore, and Rebecca Passonneau (2010). ‘MASC: A Community Resource For and By the People’. In Jan Hajic, Sandra Carberry, and Stephen Clark (eds), ACL 2010: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 68–73. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                  Ji, Heng and Ralph Grishman (2008). ‘Refining Event Extraction through Cross-Document Inference’: Proceedings of the Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 254-262. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                    Jordan, Pamela, Bonnie Dorr, and Jack Benoit (1993). ‘A First-Pass Approach for Evaluating Machine Translation Systems’, Machine Translation 8(1–2): 49–58.Find this resource:

                                                      Kilgarriff, Adam (2010). ‘DANTE: A Detailed, Accurate, Extensive, Available English Lexical Database’. In Human Language Technologies: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association of Computational Linguistics, Los Angeles, California, 21–24. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                        Kilgarriff, Adam and Martha Palmer (2000). ‘Introduction to the Special Issue on SENSEVAL’, Computers in the Humanities 34(1–2): 1–13.Find this resource:

                                                          Kirk, Roger (1968). Experimental Design Procedures for the Behavioral Sciences. Belmont, CA: Wadsworth.Find this resource:

                                                            Krippendorff, Klaus (1980). Content Analysis. Newbury Park, CA: Sage Publications.Find this resource:

                                                              Lin, Chin-Yew (2004). ‘ROUGE: A Package for Automatic Evaluation of Summaries’. In Marie-Francine Moens and Stan Szpakowicz (eds), Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona, Spain, 74–81. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                                Louis, Annie, and Ani Nenkova (2009). ‘Automatically Evaluating Content Selection in Summarization without Human Models’. In EMNLP 2009: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 306–314. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                                  Luo, Xiaoqiang (2005). ‘On Coreference Resolution Performance Metrics’. In HLT-EMNLP 2005: Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, 25–32. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                                    Mani, Inderjeet, Therese Firmin, David House, Michael Chrzanowski, Gary Klein, Lynette Hirschman, Beth Sundheim, and Leo Obrst (1998). ‘The TIPSTER SUMMAC Text Summarization Evaluation: Final Report’, Natural Language Engineering 8(1): 43–68.Find this resource:

                                                                      Marcu, Daniel (1999). ‘Discourse Trees are Good Indicators of Importance in Text’. In Inderjeet Mani and Mark Maybury (eds), Advances in Automatic Text Summarization, 123–136. Cambridge, MA: MIT Press.Find this resource:

                                                                        Marquez, Lluis, Xavier Carreras, Kenneth Litkowski, and Suzanne Stevenson (2008). ‘Semantic Role Labeling: An Introduction to the Special Issue on Semantic Role Labeling’, Computational Linguistics 34(2): 145–159.Find this resource:

                                                                          Minel, Jean-Luc, Sylvaine Nugier, and Gerald Piat (1997). ‘How to Appreciate the Quality of Automatic Text Summarization’. In Proceedings of the ACL/EACL’97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, 25–30. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                                            Mitkov, Ruslan (2001). ‘Towards a More Consistent and Comprehensive Evaluation of Anaphora Resolution Algorithms and Systems’, Applied Artificial Intelligence: An International Journal 15(3): 253–276.Find this resource:

                                                                              Morris, A., G. Kasper, and D. Adams (1992). ‘The Effects and Limitations of Automatic Text Condensing on Reading Comprehension Performance’. Information Systems Research, 3(1): 17–35Find this resource:

                                                                                . In

                                                                                Inderjeet Mani and Mark Maybury (eds) (1999). Advances in Automatic Text Summarization, 305–324. Cambridge, MA: MIT Press.Find this resource:

                                                                                  Nagao, Makoto, Jun-ichi Tsujii, and Jun-ichi Nakamura (1985). ‘The Japanese Government Project for Machine Translation’, Computational Linguistics 11: 91–110.Find this resource:

                                                                                    Nenkova, Ani and Rebecca Passonneau (2004). ‘Evaluating Content Selection in Summarization: The Pyramid Method’. In Proceedings of the Joint Annual Meeting of Human Language Technology and the North American Chapter of the Association for Computational Linguistics (HLT/NAACL), Boston, Massachusetts, 145–152. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                                                      Ng, Vincent (2010). ‘Supervised Noun Phrase Coreference Research: The First Fifteen Years’. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 11–16 July, Uppsala, Sweden, 1396–1411. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                                                        Orr, David and Victor Small (1967). ‘Comprehensibility of Machine-Aided Translations of Russian Scientific Documents’, Mechanical Translation and Computational Linguistics 10: 1–10.Find this resource:

                                                                                          Paice, Chris and Paul Jones (1993). ‘The Identification of Important Concepts in Highly Structured Technical Papers’. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM-SIGIR’93), Pittsburgh, Pennsylvania, 69–78. New York: Association for Computing Machinery’s Special Interest Group on Information Retrieval.Find this resource:

                                                                                            Palmer, Martha, Dan Gildea, and Paul Kingsbury (2005). ‘The Proposition Bank: An Annotated Corpus of Semantic Roles’, Computational Linguistics 31(1): 71–105.Find this resource:

                                                                                              Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu (2002). ‘BLEU: A Method for Automatic Evaluation of Machine Translation’. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, 311–318. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                                                                Passonneau, Rebecca (2006). ‘Measuring Agreement on Set-Valued Items (MASI) for Semantic and Pragmatic Annotation’. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), Genoa, Italy, 831–836. Paris, France: European Languages Resources Association.Find this resource:

                                                                                                  Pradhan, Sameer, Edward Loper, Dmitriy Dligach, and Martha Palmer (2007). ‘SemEval-2007 Task 17: English Lexical Sample, SRL and All Words’. In SemEval ’07: Proceedings of the 4th International Workshop on Semantic Evaluations, Prague, Czech Republic, 87–92. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                                                                    Przybocki, Mark, Kay Peterson, Sebastien Bronsart, and Gregory Sanders (2009). ‘The NIST 2008 Metrics for Machine Translation Challenge: Overview, Methodology, Metrics, and Results’, Machine Translation 23 (2–3): 71–103.Find this resource:

                                                                                                      Rath, G., A. Resnick, and T. Savage (1961). ‘The Formation of Abstracts by the Selection of Sentences’, American Documentation 12(2): 139–143Find this resource:

                                                                                                        . In

                                                                                                        Inderjeet Mani and Mark Maybury (eds) (1999). Advances in Automatic Text Summarization, 287–292. Cambridge, MA: MIT Press.Find this resource:

                                                                                                          Recasens, Marta and Eduard Hovy (2010). ‘BLANC: Implementing the Rand Index for Coreference Evaluation’, Natural Language Engineering 17(4): 485–510.Find this resource:

                                                                                                            Reiter, Ehud and Anja Belz (2008). ‘An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems’, Computational Linguistics 35(4): 529–558.Find this resource:

                                                                                                              Reiter, Ehud, Roma Robertson, and Liesl Osman (2003). ‘Lessons from a Failure: Generating Tailored Smoking Cessation Letters’, Artificial Intelligence 144(1–2): 41–58.Find this resource:

                                                                                                                Robin, Jacques (1994). ‘Revision-based Generation of Natural Language Summaries Providing Historical Background: Corpus-based Analysis, Design and Implementation’. PhD thesis, Columbia University, Upper Manhattan, New York.Find this resource:

                                                                                                                  Saggion, Horacio and Guy Lapalme (2000). ‘Concept Identification and Presentation in the Context of Technical Text Summarization’. In Proceedings of the Workshop on Automatic Summarization, Hong Kong, China, 1–10. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                                                                                    Salton, Gerard, Amit Singhal, Mandar Mitra, and Chris Buckley (1997). ‘Automatic Text Structuring and Summarization’, Information Processing and Management 33(2): 193–208Find this resource:

                                                                                                                      . In

                                                                                                                      Inderjeet Mani and Mark Maybury (eds) (1999). Advances in Automatic Text Summarization, 341–356. Cambridge, MA: MIT Press.Find this resource:

                                                                                                                        Siegel, Sidney and John Castellan (1988). Nonparametric Statistics for the Behavioral Sciences. New York: McGraw-Hill.Find this resource:

                                                                                                                          Sinaico, Wallace and George Klare (1971). Further Experiments in Language Translation: Readability of Computer Translations. Arlington, VA: Institute for Defense Analyses.Find this resource:

                                                                                                                            Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul (2006). ‘A Study of Translation Edit Rate with Targeted Human Annotation’. In Proceedings of the Association for Machine Translation in the Americas, Cambridge, Massachusetts, 223–231. Stroudsburg, PA: Association for Machine Translation in the Americas.Find this resource:

                                                                                                                              Sparck Jones, Karen and Julia Galliers (1996). Evaluating Natural Language Processing Systems: An Analysis and Review. Lecture Notes in Artificial Intelligence 1083. Berlin, Germany: Springer Verlag.Find this resource:

                                                                                                                                Stent, Amanda, Rashmi Prasad, and Marilyn Walker (2004). ‘Trainable Sentence Planning for Complex Information Presentation in Spoken Dialog Systems’. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 79–86. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                                                                                                  Stevenson, Mark and Yorick Wilks (2001). ‘The Interaction of Knowledge Sources in Word Sense Disambiguation’, Computational Linguistics 27: 321–349.Find this resource:

                                                                                                                                    Stoyanov, Veselin, Nathan Gilbert, Claire Cardie, and Ellen Riloff (2009). ‘Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art’. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, Singapore, 656–664. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                                                                                                      Tratz, Stephen and Eduard Hovy (2009). ‘BEwT-E for TAC 2009 AESOP’s Task’. In Proceedings of TAC-09. Gaithersburg, Maryland.Find this resource:

                                                                                                                                        Turunen, Markku, Jaakko Hakulinen, and Anssi Kainulainen (2006). ‘Evaluation of a Spoken Dialogue System with Usability Tests and Long-term Pilot Studies’. In Proceedings of the Ninth International Conference on Spoken Language Processing (Interspeech 2006—ICSLP), Pittsburgh, Pennsylvania, 1057–1060. International Speech Communication Association.Find this resource:

                                                                                                                                          Veronis, Jean (1998). ‘A Study of Polysemy Judgements and Inter-annotator Agreement’. In Web-based Proceedings of the SENSEVAL Workshop, Sussex, United Kingdom, 2–4.Find this resource:

                                                                                                                                            Vilain, Marc, John Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman (1995). ‘A Model Theoretic Coreference Scoring Scheme’. In Proceedings of the 6th Message Understanding Conference, 45–52. San Mateo, CA: Morgan Kaufmann.Find this resource:

                                                                                                                                              Vilar, David, Jia Xu, Fernando D’Haro, and Hermann Ney (2006). ‘Error Analysis of Statistical Machine Translation Output’. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, 697–702. Paris, France: European Languages Resources Association.Find this resource:

                                                                                                                                                White, John (1995). ‘Approaches to Black-Box MT Evaluation’. Proceedings of MT Summit V, Luxembourg. Geneva, Switzerland: European Association for Machine Translation.Find this resource:

                                                                                                                                                  Wilks, Yorick (1999). ‘Book Review of K. Sparck Jones and J. Galliers, Evaluating Natural Language Processing Systems: An Analysis and Review’, Artificial Intelligence 107: 165–170.Find this resource: