Show Summary Details

Page of

PRINTED FROM OXFORD HANDBOOKS ONLINE (www.oxfordhandbooks.com). (c) Oxford University Press, 2015. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a title in Oxford Handbooks Online for personal use (for details see Privacy Policy).

Subscriber: null; date: 18 December 2017

# Machine Translation

## Abstract and Keywords

Machine Translation (MT) is and always has been a core application in the field of natural-language processing. It is a very active research area and it has been attracting significant commercial interest, most of which has been driven by the deployment of corpus-based, statistical approaches, which can be built in a much shorter time and at a fraction of the cost of traditional, rule-based approaches, and yet produce translations of comparable or superior quality. This chapter aims at introducing MT and its main approaches. It provides a historical overview of the field, an introduction to different translation methods, both rationalist (rule-based) and empirical, and a more in depth description of state-of-the-art statistical methods. Finally, it covers popular metrics to evaluate the output of machine translation systems.

# 1 Introduction

Machine Translation (MT) is the field in language processing concerned with the automatic translation of texts from one (source) language into another (target) language. MT is one of the oldest applications of computer science, dating back to the late 1940s. However, unlike many other applications that quickly evolved and became widely adopted, MT remains a challenging yet interesting and relevant problem to address. After over sixty years of research and development with alternating periods of progress and stagnation, one can argue that we are finally at a stage where the large-scale adoption of existing MT technology is a real possibility. This results from a number of factors, including the evident scientific progress achieved particularly from the early 1990s with statistical methods, the availability of large collections of multilingual data for training such methods (and the reduction of the cost of processing such collections), the availability of open-source statistical systems, the popularization of online systems as a consequence of mainstream providers such as Google1 and Microsoft,2 and the increasing consumer demand for fast and cheap translations.

It was only in the last decade or so that MT started to be seen as useful beyond very limited domains, such as texts produced in controlled languages, but also for general-purpose translation. Building or adapting systems for specific domains will still lead to better translations, but this adaptation can now be done in more dynamic ways. Many translation service providers and institutions that need to translate large amounts of text have already adopted MT as part of their workflow. A common scenario, especially for translation service providers, is to use MT, either by itself or in combination with translation memories, terminology databases, and other resources, to produce draft translations that can then be post-edited by human translators (see Chapter 33, ‘Translation Technology’). This has proved to save translation costs and time. Interactive MT is also an attractive approach, where translators revise and correct translations as they are produced and these corrections can then be used to guide the choices of the MT system in subsequent portions of the text being translated. Another possibility is the use of MT without any human intervention. This can generally be done in specific domains or in cases where perfect quality is not a crucial requirement for the translated documents: for example, companies may use MT to publish their product information/reviews directly in other languages so that they can reach a larger number of potential customers, or to translate internal communication documents, such as emails.

Four main approaches to MT can be distinguished in research and commercial environments: rule-based MT (RBMT), example-based MT (EBMT), statistical MT (SMT), and hybrid MT. Rule-based approaches consist of sets of rules that can operate at different linguistic levels to translate a text. These are generally handcrafted by linguists and language experts, making the process not only very language-dependent, but also costly and time-consuming. Designing rules to cover all possible language constructions is also an inherently difficult task. On the other hand, a mature and well-maintained rule-based system has the potential to produce correct translations in most cases where its rules apply. Rule-based systems can vary from direct translation systems, which use little or no linguistic information, to interlingual systems, which abstract the source text into a semantic representation language, to then translate it to another language. Intermediate systems based on transfer rules, generally expressed at the syntactic level, are the most successful type of rule-based system. These constitute the vast majority of systems used in commercial environments, and consist of rules to transfer constructions specific to a source language into a particular target language. These systems are discussed in section 3.

Example-based and statistical MT approaches make up the so-called corpus-based approaches: those that rely on a database of examples of translations to build translations for new texts, as opposed to using handcrafted rules. While statistical systems can sometimes be seen as an automatic way of extracting translation rules that make up a ‘translation model’ (which can also use linguistic information), in example-based systems the definition of a ‘model’ is not so clear. Example-based MT approaches fetch previously translated segments that are similar to the new segment to translate. They then produce translations for the entire source text by combining the target side of possibly multiple partial matching segments. Example-based systems are discussed in section 4.

Statistical approaches constitute the bulk of the current research in MT. These systems can vary from simple, word-by-word translation, to complex models including syntactic and even semantic information. While for most language pairs, state-of-the-art performance is achieved with reasonably shallow models based on sequences of words (phrase-based statistical MT), a number of novel and promising developments incorporate linguistic information into such models. The basic statistical approaches along with some recent developments are discussed in section 5.

A number of strategies for combining MT systems to take advantage of their individual strengths have been proposed, and these can be done at different levels. At one extreme, MT systems can be considered as black boxes and a strategy can be defined to select the best translation from many MT systems (system selection). These MT systems may be based on the same or different paradigms. The selection of the best translation can also be done by considering partial translations, such as phrases, particularly with statistical MT systems (system combination). At the other extreme, much tighter integration strategies can be exploited by creating hybrid systems that combine features of different types of MT paradigms, such as rule-based and example-based MT. Interested readers are referred to Way (2010a: ch. 19) for recent hybrid systems.

In the remainder of this chapter we give a historical overview of the field of MT (section 2), briefly describe the rule- and example-based MT approaches (sections 3 and 4), and then focus on statistical approaches (section 5), covering both phrase-based and tree-based variations. We also present a number of quality evaluation metrics for MT (section 6). We conclude with a discussion of perspectives on and future directions for the field (section 7). It is important to note that this book chapter is not intended to serve as an extensive survey of the field of MT, but rather to provide a basic introduction with a few pointers for more advanced topics.

# 2 History

The historical overview in this section is very much inspired by the detailed description in Hutchins (2007). MT began alongside the development of computers themselves, as a new application for machines which had proved so very useful in solving mathematical problems. The first documented effort came from Warren Weaver in 1949, when he proposed ways of addressing translation problems such as ambiguity by using techniques from statistics, information theory, and cryptography (Hutchins 1986). Major projects began in the US, the then USSR, the UK, and France, and the first public demonstration was in 1954 by researchers from IBM and Georgetown of a system that could translate forty-nine Russian sentences into English with a restricted vocabulary (250 words) and a set of six grammar rules (Hutchins 2007).

For about a decade, most of the projects concentrated on designing rules to translate specific constructions from one language to another, in what nowadays we would call a ‘direct approach’, with virtually no linguistic analysis. As early as 1956, Gilbert King had predicted that MT could be done by statistics, even though no one knew how to gather relevant data. Research on more theory-orientated approaches for MT also started around that time, using transfer and interlingual rules, which were based on the linguistic analysis and generation of the text at different levels. However, their implementation as computational systems only took place in the early 1970s (see section 3).

By the mid-1960s a number of research groups had been established across the world. After almost two decades of modest (yet significant) progress, given the pessimism from some researchers about further progress (Bar-Hillel 1960), the US government requested a reassessment of the field in 1964. A committee was commissioned to study the achievements of recent years and estimate how far the field could progress, and what the barriers to progress were. The outcome was the rather pessimistic ALPAC (Automatic Language Processing Advisory Committee) report stating that MT was slower, less accurate, and more expensive than human translation and that there was no ‘immediate or predictable prospect of useful machine translation’ (Pierce et al. 1966).

As a consequence of the ALPAC report, government funding for MT-related projects was drastically reduced, particularly in the US. This situation continued for almost a decade. Meanwhile, in Europe and Canada the demand for translating documents between official languages became evident. A number of projects flourished, focusing mostly on interlingua and transfer-based approaches. The transfer-based METEO system was proposed at the University of Montreal for translating weather forecasts from English to French. With specialized vocabulary and grammar, METEO was operational for twenty years (until 2001) and is one of the first success stories in MT.

Still in the 1970s, a number of innovative interlingual approaches were proposed using different formalisms as interlingua, at lower or higher abstraction levels. The less ambitious transfer-based approaches appeared to be a better option once again. At Grenoble University a transfer system was implemented (by Bernard Vauquois, who suggested the famous ‘Vauquois triangle’—see section 3). Methods inspired by artificial intelligence were proposed to improve MT quality. The idea was to use deeper semantic knowledge in order to refine the understanding of the text to be translated. This included the use of Yorick Wilks’ preference semantics and semantic templates (Wilks 1973a, 1973b) and Roger Schank’s conceptual dependency theory (Schank 1973), and resulted later in the development of expert systems and knowledge-based approaches to translation.

A decade after the ALPAC report, around 1976, the interest in MT resurged with a more modest ambition: to translate texts in very restricted domains or translate texts for gisting. Other commercial systems appeared and became operational. These included Systran,3 which had originally been proposed as a direct rule-based system in 1968, but was restructured as a transfer-based system in the late 1970s and extended from Russian–English translation to a wide range of languages. Systran was the world’s first major commercial, open-domain MT system, and it is still operational today.

METAL and Logos, along with Systran, were the three most successful general-purpose commercial systems, which could be customized by adapting their dictionaries. An example of a domain-specific system developed from 1976 is PAHO,4 a transfer-based system still in use by the Pan American Health Association. Also during the 1980s, Japan had an important role in the development of domain-specific systems for computer-aided human translation. Early in that decade, MT systems were released for the newly created personal computers (Hutchins 2007).

Besides attracting significant commercial interest, research in transfer-based MT also restarted. More advanced systems like Ariane (Grenoble University) and SUZY (Saarbrücken University) used linguistic representations at different levels (dependency, phrase structure, logical, etc.) and a wide range of types of techniques (phrase structure rules, transformational rules, dependency grammar, etc.). Although they did not become operational systems, they have influenced a number of subsequent projects in the 1980s.

One such project was Eurotra, a large EU project aimed at a multilingual transfer system for all EU languages. The approach combined lexical, syntactic, and semantic information in a complex transfer model. It strongly stimulated MT research across different countries in Europe.

In parallel, after the mid-1980s, there was also a revival of interest in interlingua MT, with systems like DLT (Distributed Language Translation), which used a modified version of Esperanto as the intermediate language, and Rosetta, which exploited the Montague grammar as interlingua, both from the Netherlands. In Japan, a large interlingua system was PIVOT (NEC), which counted on participants from many of the major research institutes in Japan and other countries in Asia.

Following the interlingua approach, knowledge-based systems were proposed, where the interlingua is a rich representation in the form of a network of propositions, generated from the process of deep semantic analysis. KANT, developed at Carnegie Mellon University, was a domain-specific system and required the development of domain-dependent components (such as a lexicon of concepts). When used with a controlled language, it achieved satisfactory quality.

While in the 1980s MT was still not considered good enough to help human translators, other tools, such as Computer Aided Translation (CAT) systems, emerged, focusing on aiding professional translators (Kay 1980; Melby 1982). These included electronic dictionaries, glossaries, concordancers, and especially translation memories (see Chapter 33). Translation memory systems, which are still extensively used nowadays, work by searching for the most similar segment to the one that needs to be translated in a database of previously translated segments and offering its translation to the user for revision.

Inspired by the ideas and success of the translation memory systems, Makoto Nagao proposed an alternative to the rule-based approach in the early 1980s, based on examples of translations (Nagao 1984). As in translation memory systems, the translation process involves searching for analogous sequences of words that have already been translated in a corpus of source texts and their translations. The search for matching sequences (and their translations) was proposed by Nagao using linguistic information, which included having a syntactic representation of the source text and examples in the database (the matching is thus constrained by syntax) and a rich thesaurus to allow the similarity between words to be measured during the matching process. Once matching sequences are found, they have to be combined to compose the final translation. As will be discussed in section 4, most modern EBMT systems use statistical techniques for matching and recombination, which makes the boundaries between EBMT and SMT blurred.

The real emergence of empirical or corpus-based approaches for MT came with the proposal of statistical MT (SMT) in 1989. A seminal work by IBM Research proposed generative models (Brown et al. 1990) to translate words in one language to another based on statistics collected from parallel corpora with potential mutual translations (see section 5.1.1). Initially based on word-to-word translation, the statistical models showed surprisingly good results given the resources used and the complete absence of linguistic information. This stimulated research in the field to advance word-based models further into phrase-based models (Koehn et al. 2003) and structural models (Chiang 2005) which can also incorporate linguistic information, as will be discussed in section 5.2. SMT has been the most prominent approach to MT up to now and it appears that it will remain so for some time to come. For almost twenty years after the proposal of the initial word-based models, most of the developments in SMT remained in academia, through projects funded by government initiatives. As a consequence of some of these projects, a number of free, open-source MT and related tools have been released, including various toolkits for SMT such as Moses (Koehn et al. 2007), Joshua (Li et al. 2009), cdec (Dyer et al. 2010), and phrasal (Green et al. 2014). In the last decade or so, however, commercial interest in statistical MT has significantly increased. Evidence of this interest is companies such as Language Weaver5 (acquired by SDL in 2010) and Asia Online,6 dedicated to developing customizable SMT systems.

Research and development of rule-based MT also continued through the 1990s. Among a number of projects, the following can be mentioned (Hutchins 2007): CATALYST, a commercially successful joint effort between Carnegie Mellon University and Caterpillar for multilingual large-scale knowledge-based MT (interlingual approach) using controlled languages to translate technical documentation; ULTRA (Farwell and Wilks 1991), an interlingua system at the New Mexico State University; UNITRAN (Dorr 1993), an interlingua system based on the linguistic theory of Principles and Parameters at the University of Maryland; the Pangloss project,7 a collaboration between a number of US universities funded by the then ARPA (Advanced Research Projects Agency) for multi-engine interlingua-based translation; and the UNL (Universal Networking Language) project,8 sponsored mostly by the Japanese government for multilingual interlingua-based MT (section 3.3). More recent projects include Apertium 3, a platform for shallow transfer-based MT. Most research in rule-based MT is now dedicated to some form of hybrid rule-based and corpus-based MT.

For detailed historical descriptions of the field of MT please refer to several publications by John Hutchins (Hutchins 1986, 2000, 2007), Harold Somers (Hutchins and Somers 1992), and Yorick Wilks (Wilks 2009), among others.

# 3 Rule-Based MT (RBMT)

Despite the evident progress in statistical approaches to MT, the RBMT approach is still widely used especially in commercial systems, either on its own or in combination with corpus-based approaches.

RBMT approaches are traditionally grouped in three types: (i) direct, (ii) transfer, and (iii) interlingua. The main feature distinguishing these three types is the level of representation at which the translation rules operate. With direct approaches, rules are mostly based on words, hence the ‘direct’ translation, word-by-word. With transfer approaches, the rules operate at a more abstract level, including part-of-speech (POS) tags, syntactic trees, or semantic representations. The rules consist in transferring this representation from the source language into an equivalent representation in the target language. Additional steps of analysis (to generate this representation from the text in the source language) and generation (to generate target language words from the representation in the target language) are necessary. Interlingua approaches operate at an even more abstract level, where the representations are presumably language-independent. In such approaches, the transfer step is replaced by a deeper process of analysis of the source text into this language-neutral representation, followed by a more complex generation step from this representation into the target text. An analogy between these three classical approaches for RBMT and the different levels of linguistic knowledge that can be represented in transfer-based systems can be made using the famous Vauquois Triangle in Figure 1.

Click to view larger

Figure 1 Rule-based approaches: Adapted from the Vauquois Triangle

To exemplify the different rules that could be produced according to these three different RBMT approaches, consider the following sentence in English and its translation in Portuguese:

Source: I saw him.

Target: Eu o vi.

With the direct approach, simple lexical rules, including the use of variables, such as in Rule 1, and localized reordering, could be produced. With a transfer approach, rules could exploit morphological and syntactic information, such as in Rule 2. Finally, with an interlingual approach, rules would map text or syntactic representations into a semantic representation, such as in Rule 3, where a subject–verb–object sequence is transformed in two semantic relations between the concepts representing the words in these three roles. Similar rules from the interlingual representation into text in the target language are necessary.

Rule 1: [X saw Y] [X Y vi]

Rule 2: [Subject see Object] [Subject Object ver]

Rule 3: [Subject see Object] [agent(concept-see, concept-subj), object(concept-see, concept-obj)]

It is important to mention that while these three classical approaches are usually associated with RBMT, rules can also be learned using corpus-based methods: for example, a word dictionary for the direct approach can be extracted from a parallel corpus. Syntactic transfer rules can also be induced from a parallel corpus preprocessed with syntactic information. In fact, this is what is done in the state-of-the-art syntax-based SMT systems.

## 3.1 Direct RBMT approach

Generally speaking, a direct RBMT approach processes the text translating it word-by-word without intermediate structures. The most important source of information is a bilingual dictionary. Information about words can also be used, such as morphology. For example, one can extract the lemmas of the words, perform the translation into lemmas, and then regenerate the morphological information in the target language. Simple reordering can be done as part of the rules, such as in the example in Rule 1, or utilizing POS tags as opposed to words, for example, a rule to say that [Adjective Noun] (e.g. ‘beautiful woman’) in English becomes [Noun Adjective] in Portuguese (‘mulher bonita’).

The direct RBMT approach is straightforward to implement; however, it is very limited and difficult to generalize. One can end up with very large sets of rules to represent different contexts and word orders in which certain words may appear and the final translations can still suffer from incorrect ordering for long-distance relationships.

While this approach served as a good starting point for the development of RBMT, it does not produce good-quality translations. As a consequence, most current systems use more advanced representations, as we will discuss in the following sections.

## 3.2 Transfer RBMT approach

The idea behind the transfer approach is to codify contrastive knowledge, i.e. knowledge about the difference between two languages, in rules. Most systems are made up of at least three major components:

1. 1. Analysis: rules to convert the source text into some representation, generally at the syntactic level, but possibly also at some shallow semantic levels. This representation is dependent on the source language. Analysis steps can include morphological analysis, POS tagging, chunking and parsing, and semantic role labelling. These steps require source language resources with morphological, grammatical, and semantic information.

2. 2. Transfer: rules to convert (transfer) the source representations from the analysis step into corresponding representations in the target language, e.g. a parse tree of the target language. Transfer rules can involve complex modifications such as long-distance reorderings. They can also deal with word sense disambiguation, assignment of preposition attachment, etc. Transfer rules can operate at different levels: lexical transfer rules, structural transfer rules, or semantic transfer rules. This step requires bilingual resources relating source and target languages (such as a dictionary of words or base forms).

3. 3. Generation: rules to convert from the abstract representation of the target language into text (actual words) in the target language, including dealing with for example morphological generation. This step requires target language resources with morphological, grammatical, and semantic information.

The internal representation in transfer approaches can vary significantly. It is common to have rules including at least syntactic and lexical information, but they can also include semantic constraints, in either one or both languages. For example, Rule 2 could be enriched to indicate that the subject needs to be animate, making Rule 4:

Rule 4: [Subject[+animate] see Object] → [Subject[+animate] Object ver]

Because of the knowledge about both source and target languages’ grammar, morphology, etc. and their relation, transfer approaches can produce fluent translations. On the other hand, this is an expensive approach: each language requires its own analysis and generation modules and resources; in addition, for each language pair, specific transfer rules are necessary. In most cases, these rules are not bidirectional: that is, they will not apply to translations in both directions (source–target and vice versa), so two sets of transfer rules are necessary. Because of the complexity of the rules, systems implemented using the transfer approach are usually difficult to maintain: the addition of a rule requires a deep understanding of the whole collection of rules and of the consequences any change may have.

Systems such as Systran and PROMT9 are some of the most well-known and widely used examples of commercial, open-domain transfer RBMT systems. PAHO, the system by the Pan American Health Association, is the successful example of a specific-purpose/domain system. Although less common, open-source RBMT systems are also available. Apertium10 is a free/open-source platform for shallow transfer-based MT developed by the Universitat d’Alacant, in Spain. Besides resources for some language pairs, it provides language-independent components that can be instantiated with linguistic information and transfer rules for specific language pairs. It also provides tools to produce the resources necessary to build an MT system for new language pairs. Another example of a free, open-source system is OpenLogos.11 Although it was created as a commercial system in 1972, in 2000 it became available as open-source software.

## 3.3 Interlingua RBMT approach

Interlingua is a term used to define both the approach and the intermediate representation used in interlingua RBMT systems. An interlingua is, by definition, a conceptual, language-neutral representation. The main motivation for this approach is its applicability to multilingual MT systems, as opposed to bilingual MT systems. Instead of language-to-language components, interlingua approaches aim to extract the meaning of the source text so that it can be transformed into any target language. Interlingua approaches thus have two main steps, which are in principle completely independent from each other and share only the conceptual representation:

1. 1. Analysis of the source text into a conceptual representation: this process is independent from the target language.

2. 2. Generation of the target text from the conceptual representation: this process is independent from the source language.

Click to view larger

Figure 2 A multilingual RBMT system with the transfer approach (three languages, twelve modules)

Click to view larger

Figure 3 A multilingual RBMT system with the interlingua approach (three languages, six modules)

The interlingua approach avoids the problem of proliferation of modules in a multilingual environment. Adding a new language to the system requires fewer modules than with transfer-based approaches: only new analysis and generation modules for that language are necessary. Translation from and to any other language in the system can then be performed. For example, a multilingual transfer system performing translation from and to three languages (say English, French, and German) will require twelve modules: three for the analysis of each language, three for the generation of each language, and six for the transfer in both directions (see Figure 2). A multilingual interlingua system, on the other hand, will require only six modules: three for the analysis of each language and three for the generation of each language (see Figure 3).

Intuitively, with the interlingua approach it can also be simpler to write analysis/generation rules, since they require knowledge of a single language. On the other hand, this approach assumes that all necessary steps to transform a text into a conceptual representation are possible and accurate enough. Accurate deep semantic analysis is however still a major challenge. Choosing/specifying a conceptual representation is also a complex task. Such a representation should be very expressive to cover all linguistic variations that can be expressed in any of the languages in the multilingual MT system. However, a very expressive language will require complex analysis and generation grammars. Moreover, it can be argued that by abstracting from certain linguistic variations, useful/interesting information can be lost, such as stylistic choices.

One of the largest and most recent efforts towards interlingua RBMT is the Universal Networking Language (UNL) Project.12 UNL stands for both the project and its representation language. In UNL, information is represented sentence by sentence as a hypergraph composed of:

• A set of hypernodes, which represent concepts (the Universal Words, or UWs). In order to be human-readable, UWs are expressed using English words. They consist of a headword, e.g. ‘night’, and optionally a list of constraints to disambiguate or further describe the general concept by indicating its connection with other concepts in the UNL ontology, e.g. night(icl>natural_world), where ‘icl’ stands for ‘is a kind of’.

• Attributes, which represent information that cannot be conveyed by UWs or relations, including tense (‘@past’, ‘@future’), reference (‘@def’, ‘@indef’), modality (‘@can’, ‘@must’), focus (‘@topic’, ‘@focus’, ‘@entry’), etc.

• A set of directed binary labelled links between concepts representing semantic relations. Relations can be ontological (such as ‘icl’ = is a kind of and ‘iof’ = is an instance of), logical (such as ‘and’ and ‘or’), thematic (such as ‘agt’ = agent, ‘tim’ = time, ‘plc’ = place).

Click to view larger

Figure 4 Example of UNL graph for the sentence ‘The night was dark!’

For example, the English sentence ‘The night was dark!’ could be represented in UNL as in Figure 4:

where:

Concepts: night(icl>natural_world), dark(icl>color)

Attributes: @def, @past, @exclamation, @entry

Semantic relations: aoj(UW1,UW2), where ‘aoj’ = attribute of an object

The final textual representation for this sentence in UNL would be the following:

aoj(dark(icl>color).@entry.@past.@exclamation, night(icl>natural world).@def

While the UNL Project provides customizable tools to convert a source language into the UNL representation and vice versa, instantiating these tools requires significant effort. The representation itself has been criticized in various ways, but the project is still active in a number of countries, particularly through the UNL Foundation.13

# 4 Example-based MT (EBMT)

Example-based MT (EBMT) can be considered the first type of corpus-based MT approach. It has also been called ‘analogy-based’ or ‘memory-based’ translation. It uses, at run-time, a corpus of already translated examples aligned at the sentence level and three main components:

1. 1. Matching: a process to match new input (source) text against the source side of the example corpus.

2. 2. Extraction: sometimes called ‘alignment’, a process to identify and extract the corresponding translation fragments from the target side of the example corpus.

3. 3. Recombination: a process to combine the partial target matches in order to produce a final complete translation for the input text.

The intuition behind the approach, as put by its proposer Makoto Nagao, is the following:

Man does not translate a simple sentence by doing deep linguistic analysis, rather, man does the translation, first, by properly decomposing an input sentence into certain fragmental phrases (very often, into case frame units), then, by translating these fragmental phrases into other language phrases, and finally by properly composing these fragmental translations into one long sentence. The translation of each fragmental phrase will be done by the analogy translation principle with proper examples as its reference …

The corpus of examples needs to be aligned at the sentence level, as in SMT (section 5). One example of such a type of corpus that is freely available is that of the European Parliament.14 However, depending on how the matching and recombination components are defined, further constraints are desirable in corpora for EBMT. These are similar to those of translation memory systems, particularly with respect to the consistency of the examples: ideally, the same segment (with a given meaning) should not have different translations in the corpus.

The way examples are stored is directly related to the matching technique that will be used to retrieve them. If examples are stored as strings, simple distance-based string-matching techniques are used. In his original proposal, Nagao suggested a thesaurus-based measure to compute word semantic similarity for inexact matches. A common representation is to use tree structures, including constituency and dependency trees. Tree unification techniques, among others, can be exploited as a similarity metric for the matching process. Depending on the types of additional information that are used (variables, POS tags, and syntactic information), one can have literal examples (words/sequences of words), pattern examples (variables instead of words), or linguistic examples (context-sensitive rewrite rules with or without semantic features, like transfer rules). The matching component is thus a search process that can be more or less linguistically motivated, depending on the way the examples are described. Besides exact string matching, even the simplest similarity metrics can consider deletions and insertions, some word reordering, and morphological and POS variants. An example of technique is to store examples as strings with variables to replace symbols and numbers, such as:

Push button A for X seconds.

and then use exact matching techniques to match similar examples such as the following, where B and Z are the unmatched parts of the sequence and the remaining strings fully match.

Push button B for Z seconds.

If an exact match is found between the input text and a translation example, the extraction step is trivial and there is no need for recombination. However, unless the translation task is highly repetitive, in most cases the matching procedure will retrieve one or more approximate matches. If examples are stored as tree structures, the matching is performed at the sub-tree level, and thus extraction is not necessary and recombination works using standard tree unification techniques. When examples are not stored as aligned trees, the extraction and recombination processes play an even more important role.

Extraction techniques to find translation fragments include word alignment as used in SMT (section 5.1.1). To combine these fragments, techniques common in SMT such as the language modelling of the target language (section 5.1.3) and even a standard SMT decoder (section 5.1.5) can also be used in EBMT. In fact, since SMT is able to deal with sequences of words as translation units (as opposed to single words), in modern EBMT systems the recombination step can be exactly the SMT decoding process applied to select the best fragments found in the matching and extraction steps and place them in the best order. Among other things, treating the recombination step as a decoding process mitigates the effects of inconsistencies in the corpus of translation examples and allows a more probabilistic modelling of the translation problem. In other words, redundancies in the training corpus (including those with inconsistent translations) will result in different translation candidates, with the best candidate chosen according to information such as their frequency.

In spite of this evident convergence between EBMT and SMT, the matching of new and existing source segments is still significantly different in the two approaches. In SMT a bilingual dictionary of words, short phrases, or trees is extracted from the set of translation examples at system-building time and therefore the matching of a new input text is restricted to these pre-computed units. Additionally, these units are already bilingual, eliminating the need for the extraction process. In EBMT the matching is performed for every new input text to translate at system run-time, generally looking for the longest possible match. In that respect, provided that the set of translation examples is correct and consistent, EBMT is able to ensure that for any previously translated segment, regardless of its length, a correct translation will be retrieved. SMT, on the other hand, is less likely to extract long enough segments for its dictionary, unless they are highly redundant. The process of combining many smaller segments in SMT can naturally result in a translation that is not the same as the previously translated example.

Examples of modern, open-source EBMT systems are CMU-EBMT15 (Brown 2011), based on the Pangloss project, OpenMaTrEx16 (Dandapat et al. 2010), and CUNEI (Phillips 2011). For a recent overview of the field, we refer the reader to Way (2010b).

# 5 Statistical MT (SMT)

Statistical machine translation, like EBMT, uses examples of translations to translate new texts. However, instead of using these examples at run-time, most SMT approaches use statistical techniques to ‘learn’ a model of how to translate texts beforehand. The core of SMT research has developed over the last two decades, after the seminal paper by Brown et al. (1990). The field has progressed considerably since then, moving from word-to-word translation to phrase translation and other more sophisticated models which take sentence structure and semantics into account.

Click to view larger

Figure 5 The Noisy Channel Model

SMT is inspired by the 1940s’ view of translation as a cryptography problem where a decoding process is needed to translate from a foreign ‘code’ into the English language (Hutchins 1997). Through the application of the Noisy Channel Model (Shannon 1949) (see Chapter 12), this idea forms the basis for the fundamental approach to SMT. The use of the Noisy Channel Model assumes that the original text has been accidentally encrypted and the goal is to find the original text by ‘decoding’ the encrypted version, as depicted in Figure 5. The message I is the input to the channel (text in a native language), that gets encrypted into O (text in a foreign language) using a certain coding scheme. The goal is to find a decoder that can reconstruct the input message as faithfully as possible into I.

Finding I, i.e. the closest possible text to I, can be framed as finding the argument that maximizes the probability of recovering the original (noise-free) input given the noisy text. This problem is commonly defined as the task of translating from a foreign language sentence f into an English sentence e. Given f, we seek the translation e that maximizes P(e|f), i.e. the most likely translation:

$Display mathematics$

Applying Bayes’ Theorem, this problem can be decomposed in subproblems, which are modelled independently based on different resources:

$Display mathematics$

where P(f) can be disregarded, since the input for the translation task f is constant across all possible translations e, and thus will not contribute to the maximization problem. The basic model can therefore be rewritten as:

$Display mathematics$

Following the Noisy Channel Model interpretation, the original message e gets distorted into a foreign language message f, and the translation task consists in recovering a close enough representation of the original message, e. This is based on some prior knowledge of what e could look like, that is P(e), and some evidence (likelihood) of how the input message gets distorted into the foreign language message, P(f|e). The combination (posterior) maximizing these prior and likelihood models will lead to the best hypothesis e.

The process of decomposing the translation problem into subproblems and modelling each of them individually is motivated by the fact that more reliable statistics can be collected using two possible knowledge sources, one bilingual and one monolingual. The two generative models which result from the decomposition of P(e|f) are commonly referred to as (i) the translation model, P(f|e) (built from a bilingual corpus), which estimates the likelihood that e is a good explanation for f—in other words, the likelihood that the translation is faithful to the input text; and (ii) the language model, P(e) (built from a monolingual corpus), which estimates how good a text in the target language e is, and aims at ensuring a fluent translation. These models correspond to two of the fundamental components of a basic SMT system. The third fundamental component of an SMT system, the decoder, is a module that performs the search for the optimal combination of translation faithfulness, estimated by P(f|e), and translation fluency, estimated by P(e), resulting in the presumably best translation e.

This noisy channel generative approach to SMT has been later reformulated using discriminative training approaches, such as in Och and Ney (2002). This reformulation makes it easier to extend the basic approach to add a number of independent components that represent specific properties of alternative translations for a given input text, and to learn the importance of each of these components. These new components, along with the original language and translation models, are treated as feature functions. The relative importance (weight) of each feature is learned by discriminative training to directly model the posterior probability P(e|f) or minimize translation error according to an evaluation metric (Och 2003).

A common strategy to combine these feature functions and their weights uses a linear model with the following general form:

$Display mathematics$

where the overall probability of translating the source sentence into a target sentence is given by a combination of n model components hi(e,f) to be used during the decoding process, weighted by parameters λi estimated for each component (section 5.1.6). P(f|e) and P(e) are generally among the hi(e,f) functions, but many others can be defined, including the reverse translation probability P(e|f) (section 5.1.2) and reordering models (section 5.1.4). The number of feature functions can go from a handful of dense features (10–15) to thousands of sparse features (section 7). During the decoding process, the best translation can then be found by maximizing this linear model (section 5.1.5).

The units of translation in the linear framework can vary from words to flat phrases, gapped phrases, hierarchical representations, or syntactic trees. For close language pairs, the state-of-the-art performance is achieved with phrase-based SMT systems (Koehn et al. 2003), i.e. systems that consider a sequence of words as their translation unit. For more distant language pairs such as Chinese–English, models with structural information perform better. We cover these variations of models in what follows. Most of the description and the terminology used are based on the functioning of Moses17 (Koehn et al. 2007), the freely available and most widely used open-source SMT system. A more comprehensive coverage of Moses-like approaches to SMT can be found in Koehn (2010b).

## 5.1 Phrase-based SMT

In this section we describe a common pipeline for phrase-based SMT (PBSMT), including how to extract and score phrases, the major components of PBSMT systems, and the procedures for tuning the weights of these components and for decoding.

### 5.1.1 Word alignment

According to the noisy channel formulation of the SMT problem, given the input sentence f, the aim is to estimate a general model for P(f|e), i.e. the inverse translation probability, by looking at a parallel corpus with examples of translations between f and e. The most important resource for SMT is thus a parallel corpus containing texts in one language and their translation in another language, aligned at the sentence level.

Extracting probability estimates for whole sentences f and e is however not feasible, since it is unlikely that the corpus would contain enough repeated occurrences of complete sentences. Therefore, shorter portions of the sentences are considered. The first SMT models had words as the basic unit and were generally called word-based translation models.

The first step for estimating word probabilities is to find out which words are mutual translations in the parallel corpus. This process, usually referred to as word alignment, constitutes a fundamental step in virtually all SMT approaches, either for word-based translation or as part of the preprocessing for more advanced approaches.

Word alignment consists in identifying the correspondences between the two languages at the word level. The simplest model is based on lexical translation probability distributions, and aligns words in isolation, regardless of their position in the parallel sentence or any additional information. This model is called IBM Model 1 and it is part of a set of five generative models proposed by Brown et al. (1990, 1993), the IBM Models.

According to IBM Model 1, the translation probability of a foreign sentence f = (f1, . . ., fM) of length M being generated from an English sentence e = (e1, . . ., eL) of length L is modelled in terms of the probability t that individual words fm and el are translations of each other, as defined by an alignment function a:

$Display mathematics$

where ε is a normalization constant to guarantee a probability distribution. (L + 1)M are all possible alignments that map (L + 1) English words (including words aligned to zero source words) into M source words.

The lexical translation probabilities t(fm|ea(l)) are normally estimated using the Expectation Maximization (EM) unsupervised algorithm (Dempster et al. 1977) from a sentence-aligned parallel corpus. EM is initialized with uniform probability distributions: that is, all words are equally likely to be translations of each other, and updated iteratively using counts of (fm, el) word pairs as observed in parallel sentences.

Given translation probabilities, Model 1 can be used to compute the most likely alignment between words in f and e. An alignment a can be defined as a vector a1, . . ., aM, where each am represents the sentence position of the target word generating the fm according to the alignment. The model defines no dependencies between the alignment points given by a, and thus the most likely alignment is found by choosing, for each m, the value for am that leads to the highest value for t. By using such a model for translation, the best translation will be the one that maximizes the lexical alignment a between f and e; in other words, the translation that maximizes the probability that all words in f are translations of words in e. This alignment/translation model is very simple and has many flaws. More advanced models take into account other information, for example, the fact that the position of the words in the target sentence may be related to the position of the words in the source sentence (distortion model), the fact that some source words may be translated into multiple target words (fertility of the words), or the fact that the position of a target word may be related to the position of the neighbouring words (relative distortion model). An implementation of all IBM models, along with other important developments in word alignments (Vogel et al. 1996), are provided in the GIZA++ toolkit18 (Och and Ney 2003). In practice, in modern SMT systems, IBM models along with sequence-based models are used to produce word alignments that will feed other processes to extract richer translation units, as we describe in the remainder of this section.

### 5.1.2 Phrase extraction and scoring

In the context of phrase-based SMT, a phrase is a contiguous sequence of words, as opposed to a linguistically motivated unit, which is used as the basic translation unit. A phrase dictionary in such systems, the so-called phrase table, contains non-empty source phrases and their corresponding non-empty target phrases, where the lengths of a given source–target phrase pair are not necessarily equal.

Click to view larger

Figure 6 Word alignments in both directions and their intersection (black points) and union (black and grey points)

The most common way of identifying relevant segments of a sentence into phrases is to apply heuristics to extract phrase pairs which are consistent with the word alignment between the source and target sentences (Koehn 2010b). Since the parallel corpus can be handled in both directions (i.e. f e and e f), it is common to generate word alignments in both directions and then intersect these two alignments to get a high-precision alignment, or take their union to get a high-recall alignment. For example, consider the word alignments in both directions and their intersection/union for the English–Spanish sentence pair in Figure 6.19

A phrase pair ($f¯,e¯$) is consistent with an alignment a if all words f1, …, fm in $f¯$ that have alignment points in a have these alignment points with words e1, …, en in $e¯$ and vice versa. In other words, a phrase pair is created if the words in the phrases are only aligned to each other, and not to words outside that phrase. Starting from the intersection of two certain word alignments, a new alignment point in the union of two word alignments can be added provided that it connects at least one unaligned word. In the example in Figure 6, the phrases in Table 1 would be generated at each step of the expansion.

Table 1 Phrase pairs extracted from word alignments in Figure 6 using common heuristics

 1 (Maria, Mary), (no, did not), (dio una bofetada, slap), (a la, the), (bruja, witch), (verde, green) 2 (Maria no, Mary did not), (no dio una bofetada, did not slap), (dio una bofetada a la, slap the), (bruja verde, green witch) 3 (Maria no dio una bofetada, Mary did not slap), (no dio una bofetada a la, did not slap the), (a la bruja verde, the green witch) 4 (Maria no dio una bofetada a la, Mary did not slap the), (dio una bofetada a la bruja verde, slap the green witch) 5 (no dio una bofetada a la bruja verde, did not slap the green witch) 6 (Maria no dio una bofetada a la bruja verde, Mary did not slap the green witch)

Once phrases are extracted, phrase translation probabilities $ϕ(f¯|e¯)$ can be estimated using Maximum Likelihood Estimation (MLE), i.e. relative frequencies of such phrase pairs in the corpus:

$Display mathematics$

Although the initial formulation of SMT considers the inverse conditional translation probability, using translation probabilities in both translation directions often results in more reliable estimates. Therefore, $ϕ(e¯|f¯)$ is also estimated from the same word alignment matrix.

Another model usually estimated for phrases is the lexical translation probability within phrases. Phrases are decomposed into their word translations so that their lexical weighting can be taken into account. This is motivated by the fact that rare phrase pairs may have high phrase translation probability if they are not seen as aligned to anything else. This often overestimates how reliable rare phrase pairs are, which is especially problematic if the phrases are extracted from noisy data. The computation of lexical probabilities relies on the word alignment within phrases. The lexical translation probability of a phrase ē given the phrase $f¯$ can be computed as (Koehn 2010b):

$Display mathematics$

where each target word ei is produced by an aligned source word fj with the word translation probability w(ei|fj), extracted from the word alignment. Similar to the phrase translation probabilities, both translation directions can be considered: lex($(e¯|f¯,a)$) and lex($(f¯|e¯,a)$).

Phrase extraction and scoring could alternatively be done simultaneously and directly from a sentence-aligned parallel corpus. Similar to word alignment, it is possible to use the Expectation Maximization algorithm to produce phrase alignments and their probabilities with a joint source–target model (Marcu and Wong 2002), but the task becomes very computationally expensive. Inverse Transduction Grammar constraints were used by Cherry and Lin (2007) to reduce the complexity of the joint phrase model approach.

Extracted phrase pairs are added to the phrase table along with associated phrase and lexical translation probabilities and normally a few other scores for each phrase pair, depending on the feature functions used: for example, lexical reordering scores (see section 5.1.4).

### 5.1.3 Language model

The language model (LM) is a very important component in SMT. It estimates how likely a given target language sentence is to appear in that language, based on a monolingual corpus of the target language. The intuition is that common translations are more likely to be fluent translations. The language model component P(e) for a sentence with J words is defined as the joint probability over the sequence of all words in that sentence:

$Display mathematics$

This joint probability is decomposed into a series of conditional probabilities using the chain rule:

$Display mathematics$

Since the chances of finding occurrences of long sequences of J target words in a corpus are very small because of language variability, the language model component usually computes frequencies of parts of such sentences: n-grams, i.e. sequences of up to n words. The larger the n, the more information about the context of specific sequences, but also the lower their frequencies and thus the lower the chances that reliable estimates can be computed. In practice, common lengths for n vary between 3 and 10, depending on the size of the corpus. The basis for n-gram language models is the Markov assumption that it is possible to approximate the probability of a word given its entire history by computing the probability of a word given the last few words:

$Display mathematics$

For example, a trigram language model (n = 3) considers only two previous words:

$Display mathematics$

Each of these conditional probabilities can be estimated using MLE. For example, for trigrams:

$Display mathematics$

Smoothing techniques can be applied to avoid having zero-counts for a given n-gram and as a consequence having P(e) = 0 for previously unseen sequences. One such technique consists in adding one to all the counts of n-grams.

Off-the-shelf language modelling toolkits such as SRILM,20 IRSTLM,21 KENLM22 are used by many SMT systems and they provide a number of more advanced smoothing strategies.

### 5.1.4 Reordering models

Word order may vary in different languages, and a monotonic order in the translation, where words in e are in the same order as words in f, is likely to result in poor translations. Most PBSMT incorporates a model of reordering. A simple strategy to deal with reordering of words is a distance-based reordering model. According to such a model, each phrase pair is associated with a distance-based reordering function:

$Display mathematics$

where the reordering of a phrase is relative to the previous phrase: starti is the position of the first word of the source phrase that translates to the ith target phrase; endi is the position of the last word of that source phrase. The reordering distance, computed as (starti endi1 1), is the number of words skipped (forward or backward) when source words are taken out of sequence. For example, if two contiguous source phrases are translated in sequence, then starti=endi1+1, i.e. the position of the first word of phrase i is next to the position of the last word of the previous phrase. In that case, the reordering cost will be zero, i.e. a cost of d(0) will be applied to that phrase. This model therefore penalizes movements of phrases over large distances. A common practice to model the reordering probability is to use an exponentially decaying cost function $d(x)=α|X|$, where α is assigned a value in [0, 1] so that d scales as a probability distribution.

This absolute distance-based reordering model uses a cost that is dependent only on the reordering distance, i.e. skipping over two words will cost twice as much as skipping over one word, regardless of the actual words reordered. Therefore, such a model penalizes movement in general, which may lead to little reordering being done in practice.

An alternative is to use lexicalized reordering models with different reordering probabilities for each phrase pair learned from data, in order to take into account the fact that some phrases are reordered more frequently than others. A reordering model po will thus estimate how probable a given type or orientation of reordering (including no reordering) is for each phrase: po($orientation|f¯,e¯$). Common orientations include (Koehn 2010b): monotone order, swap with previous phrase, and discontiguous.

Using the word alignment information, this probability distribution can be estimated together with phrase extraction using MLE:

$Display mathematics$

where o ranges over all the orientation types.

### 5.1.5 Decoding

The decoder is the component that searches for the best translation among the possibilities scored by the combination of different feature functions: for example, using the linear model described in section 5. The best translation e can be found by maximizing this linear model, i.e.:

$Display mathematics$

Most SMT systems implement heuristic search methods to cope with the fact that there is an exponential number of translation options. A popular method is the stack-based beam search decoder (Koehn et al. 2003). This search process generates a translation from left to right in the target language order through the creation and expansion of translation hypotheses from options in the phrase table covering the source words in any arbitrary order (often constrained by a distance limit). It uses priority queues (stacks) as a data structure to store these hypotheses, and a few strategies to prune the search space to keep only most promising hypotheses.

Given a source sentence, a number of phrase translations available in the phrase table can be applied to translate it. Each applicable phrase translation is called a translation option. Each word can be translated individually (phrases of length 1), or by phrases with two or more source words. For example, consider the translation of the Spanish sentence ‘Maria no dio una bofetada a la bruja verde’ into English, assuming some of the phrases available in the phrase table as shown in Table 2. A subset of the possible combinations of these translation options is shown in Table 3.

From the translation options, the decoder builds a graph starting with an initial state where no source words have been translated (or covered) and no target words have been generated. New states are created in the graph by extending the target output with a phrasal translation that covers some of the source words not yet translated. At every expansion, the current cost of the new state is the cost of the original state plus the values of the feature functions for that translation option: for example, translation, distortion, and language model costs of the added phrasal translation. Final states in the search graph are hypotheses that cover all source words. Among these, the hypothesis with the lowest cost (highest probability) is selected as the best translation.

A common way to organize hypotheses in the search space is by using stacks of hypotheses. Stacks are based on the number of source words translated by the hypotheses. One stack contains all hypotheses that translate one source word, another stack contains all hypotheses that translate two source words in their path, and so on. Figure 723 shows the representation of stacks of hypotheses considering some of the translation options given in Table 3.

Click to view larger

Figure 7 Stacks of hypotheses in a beam search decoder for some translation options in Table 3

In the example in Figure 7, after the initial null hypothesis is placed in the stack of hypotheses with zero source words covered, we can cover the first word in the sentence (‘Maria’) or the second word in the sentence (‘no’), and so on. Each derived hypothesis is placed in a stack based on the number of source words it covers. The decoding algorithm proceeds through each hypothesis in the stacks, deriving new hypotheses for it and placing them into the appropriate stack. For example, the stack covering three words has different hypotheses translating ‘Maria no dio’: ‘Mary did not gave’ and ‘Mary not gave’.

Table 2 Examples of translation options in a Spanish–English phrase table

Spanish

English

Maria

Mary

no

not

no

did

no

did not

dio

gave

dio

slap

una

a

slap

a

to

la

the

bruja

witch

verde

green

slap

a la

the

Maria no

Mary did not

did not slap

bruja verde

green witch

a la bruja verde

the green witch

Table 3 A subset of combinations of translation options for the Spanish sentence ‘Maria no dio una bofetada a la bruja verde’ into English given the phrase pairs in Table 2

 Maria no dio una bofetada a a bruja verde Mary not gave a slap to the witch green Mary not slap a slap to the witch green Mary did gave a slap to the witch green Mary did slap a slap to the witch green Mary did not gave a slap to the witch green Mary did not slap a slap to the witch green Mary did not gave a slap to the witch green Mary did not slap a slap to the witch green Mary did not slap to the witch green Mary did not slap the green witch

If an exhaustive search was to be performed for the best hypothesis covering all source words, all translation options in different orders could be used to build alternative hypotheses. However, this would result in a very large search space, and thus in practice this search space is pruned in different ways. For example, a reordering limit is often specified to constrain the difference in the word order for the source and target segments. The use of stacks for decoding also allows for different pruning strategies. A stack has fixed space, so after a new hypothesis is placed into a stack, some hypotheses might need to be pruned to make space for better hypotheses. The idea is to keep only a number of hypotheses that are promising (according to this early-stage guess) and remove the worst hypotheses from the stack.

Examples of pruning strategies are threshold pruning and histogram pruning. Histogram pruning simply keeps a certain number n of hypotheses in each stack (e.g. n = 1000). In threshold pruning, a hypothesis is rejected if its score is less than that of the best hypothesis by a factor (e.g. threshold = 0.001). This threshold defines a beam of good hypotheses and their neighbours, and prunes those hypotheses that fall out of this beam.

As a consequence of the use of stacks, the beam search algorithm also keeps track of a number of alternative translations. For some applications, besides the actual best translation for a given source sentence, it can be helpful to have the second-best translation, third-best translation, and so on. A list of n-best translations, the so-called n-best list, can thus be produced. N-best lists can be used, among other applications, to rerank the output of an SMT system as a post-processing step using features that are richer than those internal to the SMT system. For example, one could parse all n-best translations and rerank them according to their parse tree score in an attempt to reward the more grammatical translations. N-best lists are also used to tune the system parameters, as we describe in section 5.1.6.

### 5.1.6 Parameter tuning

The PBSMT approach discussed so far is modelled as the interpolation of a number of feature functions following a supervised machine learning approach in which a weight is learned for each feature function. The goal of this approach is to estimate such weights using iterative search methods to find the single optimal solution. However, this is a computationally expensive process. In what follows, we describe a popular approximation to such a process for estimating the weights of a small set of features, the Minimum Error-Rate Training (MERT) algorithm (Och 2003).

MERT assumes that the best model is the one that produces the smallest overall error with respect to a given error function, i.e. a function that evaluates the quality of the system translation. It is common to use the same function as that according to which the final translations will be evaluated, generally BLEU (Papineni et al. 2002) (section 6). Parameter tuning is performed on a development set C containing a relatively small (usually 2–10K) number of source sentences and their reference translations, rendering a supervised learning process. Over several iterations, where the current version of the system is used to produce an n-best list of translations for each source sentence, MERT optimizes the weights of the feature functions to rerank these n-best lists such as to make the system translations that have the smallest error according to BLEU (i.e. those that are the closest to the reference translations) appear at the top of the list.

Given an error function E(e*, e) defining the difference (error) between the hypothesized translation e* and a reference translation e, e.g. BLEU, learning the vector of parameters for all features $λ1k$ can be defined as (Lopez 2008):

$Display mathematics$

where argmax corresponds to the decoding step and results in the best-scoring hypothesis e* with respect to the set of feature weights $λ1k$ at a given iteration, and argmin defines the search for the set of $λ1k$ that minimizes the overall error E for the whole development set C.

The algorithm iteratively generates sets of values for $λ1k$ and tries to improve them by minimizing the error resulting from changing each parameter λk while holding the others constant. At the end of this optimization step, the optimized $λ1k$ yielding the greatest error reduction is used as input to the next iteration. Heuristics are used to generate values for the parameters, as the space of possible parameter values is too large to be exhaustively searched in a reasonable time even with a small set of features. This process is repeated until convergence or for a predefined number of iterations.

A number of alternative approaches for discriminative training in SMT have been proposed, including online methods (Watanabe et al. 2007a) and pairwise ranking methods (Hopkins and May 2011). A comparison covering different approaches is given in Cherry and Foster (2012).

## 5.2 Tree-based SMT

The PBSMT approach does not handle long-distance reorderings, which are necessary for many language pairs. Although reordering is a naturally occurring phenomenon within phrases and the PBSMT model has a component for phrase reordering, both cases are generally limited to movements over very short distances. Over-relaxing the constraints of phrase reordering to allow longer-distance reorderings to happen is likely to yield disfluent translations, in addition to making inference more complex. The introduction of structural information in PBSMT models is thus a natural development of such models.

Attempts include hierarchical PBSMT models, which formalize the use of structural information via synchronous context-free grammars (SCFG) that derive both source and target language simultaneously, and extensions of such models utilizing syntactic information. Synchronous rules are commonly represented by , where LHS stands for left-hand side, RHS for right-hand side, f and e for source and target languages, respectively.

### 5.2.1 Hierarchical PBSMT

Hierarchical PBSMT models (Chiang 2005) convert standard phrases into SCFG rules, and use a different decoding algorithm. SCFG rules have the form $X→〈γ,α〉$, with X a non-terminal, α strings of non-terminals and source terminal symbols, and ; strings of non-terminals and target terminal symbols. As in any context-free grammar, only one non-terminal symbol can appear on the left-hand side of the rules. An additional constraint is that there is a one-to-one alignment between the source and target non-terminal symbols on the right-hand side of the rules.

Translation rules are extracted from the flat phrases induced as discussed in section 5.1.2, from a parallel word-aligned corpus. Words or sequences of words in phrases are replaced by variables, which can later be instantiated by other words or phrases (hence the notion of hierarchy). SCFG rules are therefore constructed by subtraction out of the flat phrase pairs: every phrase pair () becomes a rule $X→〈 f¯,e¯ 〉$. Additionally, phrases are generalized into other rules: a phrase pair () can be subtracted from a rule $X→〈γ1f¯γ2,α1e¯α2〉$ to form a new rule $X→〈γ1Xγ2,α1Xα2〉$, where any other rule (phrase pair) can, in principle, be used to fill in the slots. For example, consider that the following two phrase pairs are extracted: (the blue car is noisy, la voiture bleu est bruyante) and (car, voiture). These would be converted into the following rules:

$Display mathematics$

$Display mathematics$

Additionally, a rule with non-terminals would be generated:

$Display mathematics$

We note that these rules naturally allow the reordering of the ‘adjective noun’ constructions.

Replacing multiple smaller phrases may result in multiple non-terminals on the right-hand side of the rules. For example, if the phrase (blue, bleu) is also available, the following rule can be extracted:

$Display mathematics$

To control for the combinatorial growth of the rule sets and alleviate the computational complexity of the resulting models, a number of restrictions can be imposed to the rule extraction process. These include limiting the number of non-terminals on the right-hand side of the rules, or limiting the number of words (terminals) in the rules. These restrictions also reduce the number of ambiguous derivations, leading to better estimates.

Once the rules are extracted, the scoring of each rule can be done in different ways, generating different features, some of which are analogous to those used in PBSMT. For example (Koehn 2010b):

• joint rule probability P(LHS, RHSf, RHSe)

• inverse translation probability P(RHSf|RHSe, LHS)

• direct translation probability P(RHSe|RHSf, LHS)

• rule application probability P(RHSf, RHSe|LHS), etc.

These probability distributions can be estimated using MLE, i.e. counting the relative frequency of the rules, as in the PBSMT. The lexical probabilities can be estimated using word alignment information about the words in the rules. The language model component is usually computed over n-grams of words, as in PBSMT, although research in exploiting syntactic information for language modelling does exist (Tan et al. 2011).

As in PBSMT, the overall score of a given translation is computed as the product of all rule scores that are used to derive that translation, where the scores are given by the combination of the model components described above using a linear model over synchronous derivations. The weights of the model components can be estimated using MERT. The decoding process is similar to that of finding the best parse tree for an input sentence using a probabilistic context-free grammar parser. Decoding is thus performed using a beam search algorithm that computes the space of parse trees of the source text and their projection into target trees (through the synchronous grammar). This space is efficiently organized using a chart structure, similar to a monolingual chart parser (Chiang 2007). In chart parsing, data is organized in a chart structure with chart entries that cover contiguous spans of increasing length of the input sentence. Chart entries are filled bottom-up, generally first with lexical rules, then with rules including non-terminal nodes, until the sentence node (root of the tree) is reached. Similar heuristics as in PBSMT can be used to make the search process more efficient.

### 5.2.2 Syntax-based SMT

While the definition of hierarchical models does not imply the need for syntactic information, this information can be used to produce linguistically motivated hierarchical rules. In order to use syntactic information, the parallel corpus needs to be preprocessed to produce a parse tree for each source and/or target sentence. When syntactic information is used in both source and target texts, the resulting approach is called tree-to-tree syntax-based SMT. Less constrained models, using syntax on the source or target sentences only, are commonly called tree-to-string or string-to-tree models, respectively.

The use of syntax for SMT had actually been proposed before hierarchical PBSMT (Wu 1997; Yamada and Knight 2001), but the work remained theoretical or was not able to achieve comparable performance to PBSMT models on large-scale data sets. The general framework that we describe in what follows is just one among many other approaches for syntax-based SMT. The description that follows for tree-to-tree syntax-based SMT is based on the models presented in Koehn (2010b).

Taking hierarchical PBSMT as a basis for comparison, syntactic information allows different and more informative non-terminal symbols, as opposed to the single symbol ‘X’. These symbols constrain the application of the rules according to the grammar of the languages. For example, given the following English and French sub-trees:

English: (NP (DT the) (JJ blue) (NN car))

French: (NP (DT la) (NN voiture) (JJ bleu))

A rule for noun phrases with reordering of adjectives could be extracted using similar heuristics as in the basic hierarchical models, but now with linguistic information (POS and phrase tags) as part of the rules:

$Display mathematics$

Rule extraction in syntax-based SMT follows the same basic constraints as in hierarchical models: (i) rules can have a single non-terminal on the left-hand side, (ii) rules need to be consistent with word alignment, and (iii) there needs to be a one-to-one alignment between source and target non-terminal symbols on the right-hand side of the rules. For example, given the following sentence pair and its word alignment in Table 4:

English: (S (NP (PRP I)) (VP (VBP have) (NP (JJ black) (NNS eyes))))French: (S (NP (PRP J’)) (VP (VBP ai) (NP (DT les) (NNS yeux) (JJ noir))))

Table 4 Example of French–English word alignment for SCFG rule extraction

J’

ai

les

yeux

noirs

I

have

black

Eyes

The following rules could be generated, among others:

$Display mathematics$

$Display mathematics$

$Display mathematics$

$Display mathematics$

$Display mathematics$

More advanced models allow rules to cover syntactic trees that are not isomorphic in terms of child–parent relationships in both languages. For example, synchronous tree substitution grammars include not only non-terminal and terminal symbols in the right-hand side of the rules, but also trees (Zhang et al. 2007).

A number of variations of the tree-to-tree syntax-based SMT approaches have been proposed in recent years (Hanneman et al. 2008; Zhang et al. 2008). In addition, since syntactic parsers with good enough quality are not available for many languages, other approaches use syntactic information for the source or target language only. Tree-to-string models—which use syntactic information for the source language only—can further constrain the application of rules based on linguistic restrictions of the source language (Quirk et al. 2005; Huang et al. 2006; Zhou et al. 2008; Liu and Gildea 2008). String-to-tree models, which use syntactic information on the target language only, attempt to refine translations by ensuring that they follow the syntax of the target language (Galley et al. 2006; Marcu et al. 2006; Zollmann et al. 2008). Some approaches allow multiple parse trees. These are known as forest-based SMT (Mi and Huang 2008; Zhang et al. 2009). Some of these approaches use different grammar formalisms than the one described here, as well as different feature functions, parameter estimations, and decoding strategies.

Establishing syntactic constraints on tree-based models can make rule tables very sparse, limiting their coverage. In an attempt to reduce such sparsity, Zollmann and Venugopal (2006) propose very effective (yet simple) heuristics to relax parse trees, commonly referred to as syntax augmented machine translation. Significant gains have been observed by grouping non-terminals under more general labels when these non-terminals do not span across syntactic constituents. For example, given a noun phrase sub-tree containing a determiner (DET) followed by an adjective (ADJ) and a noun (NN), ADJ and NN could be grouped to form an ADJ\\NN node. Also aimed at reducing the sparsity in syntax-based models, Hoang and Koehn (2010) propose a soft syntax-based model which combines the precision of such models with the coverage of unconstrained hierarchical models. Constrained and unconstrained non-terminals are used together. If a syntax-based rule cannot be retrieved, the model falls back to the purely hierarchical approach, retrieving a rule with unlabelled non-terminals.

Syntax-based translation models have been shown to improve performance for translating between languages which differ considerably in word ordering, such as English and Chinese. Variations of hierarchical and syntax-based models are implemented in Moses, cdec24 and Joshua,25 which are all freely available open-source SMT toolkits.

## 5.3 Other types of linguistic information for SMT

Besides structure/syntax, other levels of linguistic information have been used to improve purely statistical models. The use of morphological information, especially for translating into morphologically complex languages, such as Arabic and Turkish, has been extensively studied. This includes techniques to preprocess data, such as to segment complex words with affixes, or to post-process data, such as to generate adequate morphological variations once the translation is done. In a popular approach, the use of morphological information in the form of factored models (Koehn and Hoang 2007) has been proposed as an extension of PBSMT models. Words can be represented in their basic form (lemmas or stems) and word-level information such as POS tags and morphological features can be attached to such words. Translation can be performed from the basic word forms (or an intermediate representation) and any additional word-level information can be used to generate appropriate features (inflections, etc.) in the target language.

Another line of research looking into incorporating more information in SMT models focuses on exploiting additional source context and potentially linguistic information associated with it. The use of source context in SMT is limited to a few words in the phrases. In order to guarantee reliable probability estimates, phrases are limited to 3–7 words depending on the size of the parallel corpora used. While in principle longer phrases can be considered if large quantities of parallel data are available, this is not the case for most languages pairs and text domains. This results in a number of problems, particularly due to ambiguity issues. For example, it may not be possible to choose among different translations of a highly ambiguous word without having access to the global context of the source text. While hierarchical models allow the use of some contextual information, this has more noticeable effects in terms of reordering. Other attempts have been made to explicitly use contextual information. For example, Carpuat and Wu (2007) incorporate features for word sense disambiguation models—typically based on words in the context and linguistic information about them, such as their POS tags—as part of the SMT system. Alternative translations for a given phrase are considered as alternative senses, and the source sentence context is used to choose among them. Specia et al. (2008) use WSD models with dictionary translations as senses and a method to rerank translations in the n-best list according to their lexical choices for ambiguous words. Mirkin et al. (2009) and Aziz et al. (2010) use contextual models on the source and target languages to choose among alternative substitutions for unknown words in SMT. Alternative ways of using contextual information include Stroppa et al. (2007), Gimpel and Smith (2008), Chiang et al. (2009), Haque et al. (2010), and Devlin et al. (2014).

Using more semantically orientated types of linguistic information is an interesting recent direction. Wu and Fung (2009) propose a two-pass model to incorporate semantic information into the standard PBSMT pipeline. Standard PBSMT is applied in a first pass, followed by a constituent reordering step seeking to maximize the cross-lingual match of the semantic roles between the source and the translation. Liu and Gildea (2010) choose to add features extracted from the source sentences annotated with semantic role labels to a tree-to-string SMT model. The source sentence is parsed for semantic roles and these are then projected onto the translations using word alignment information at decoding time. The model is modified in order to penalize/reward role reordering and role deletion. Baker et al. (2010) graft semantic information, namely named entities and modalities, to syntactic tags in a syntax-based model. The vocabulary of non-terminals is thus specialized with named entities and modality information. For instance, a noun phrase (NP) whose head is a geopolitical entity (GPE) will be tagged as NPGPE, making the rule table less ambiguous (at the cost of a larger grammar).

Click to view larger

Figure 8 Example of shallow semantic tree

An alternative approach is proposed in Aziz et al. (2011) to extend hierarchical models by using semantic roles to create shallow semantic trees. Semantic roles are used to augment the vocabulary of non-terminals in hierarchical models (X). The hypothesis is that semantic role information should help in selecting the correct synchronous production for better reordering decisions and better lexical selection for ambiguous words. Source sentences are first processed for POS tags, base phrases, and semantic roles. In order to generate a single semantic tree for every entire source sentence, semantic labels are directly grafted to base phrase annotations when a predicate argument coincides with a single base phrase, and simple heuristics are applied in the remaining cases. Tags are also lexicalized: that is, semantic labels are composed by their type (e.g. A0) and target predicate lemma (verb). An example for the sentence He intends to donate his money to charity, but he has not decided which yet, which has multiple predicates and overlapping arguments, is given in Figure 8.

This approach was not able to lead to significant improvements over the performance of standard hierarchical approaches. This was mostly due to the highly specialized resulting grammars, which made the probability estimates less informative, and required more aggressive pruning due to the very large number of non-terminals. However, the approach led to a considerable reduction in the number of rules extracted. As an alternative to tree-based semantic representations, Jones et al. (2012) use graph-shaped representations. Algorithms for graph-to-word alignment and for synchronous grammar rule extraction from these alignments are proposed. The resulting translation model is based on synchronous context-free graph grammars and leads to promising results.

Going beyond sentence semantics, recent work has started exploring discourse information for SMT. Most existing SMT decoders translate sentences one by one, in isolation. This is mostly motivated by computational complexity issues. Considering more than one sentence at a time will result in a much larger search space, and in the need for even more aggressive pruning of translation candidates. In addition, feature functions in standard beam search decoders are limited in the amount of information they can use about the translation being produced, as only partial information is available for scoring these translations (and pruning the search space) before reaching a translation that covers all source words. Recent work has looked into both new decoding algorithms and new feature functions. Hardmeier et al. (2012, 2013) introduced Docent, a document-wide decoder. The decoder works as a multi-stage translation process, with the first stage generated by arbitrarily picking a translation among the pool of candidates or taking the output of a standard Moses-based PBSMT system for the document. Instead of starting with an empty translation and expanding it until all source words are covered, each state of the search corresponds to a complete translation of a document. Improvements on these states (versions) are performed via local search. Search proceeds by making small changes to the current state to transform it gradually into a (supposedly) better translation. Changes are performed based on a set of operations, such as replacing a given phrase by an alternative from the phrase table, deleting or moving a phrase. Different search algorithms can be used. For example, search can be done via standard hill-climbing methods that start with an initial state and generate possible successor states by randomly applying operations to the state. After each operation, the new state is evaluated and accepted if its score is better than that of the previous state, else rejected. Search terminates when the decoder cannot find an acceptable successor state after a certain number of attempts, or when a maximum number of steps is reached. Other search algorithms include simulated annealing, which also accepts moves towards lower-scoring states, and local beam search, which keeps a beam of a fixed number of multiple states at any time and randomly picks a state from the beam to modify at each step.

In order to avoid search errors due to pruning, Aziz et al. (2014, 2013) propose exact optimization algorithms for SMT decoding. They replace the intractable combinatorial space of translation derivations in (hierarchical) phrase-based statistical machine translation—given by the intersection between a translation lattice and a target language model—by a tractable relaxation which incorporates a low-order upper bound on the language model. Exact optimization is achieved through a coarse-to-fine strategy with connections to adaptive rejection sampling. In the experiments presented, it is shown how optimization with unpruned language models leads to no search errors, and therefore better translation quality as compared to beam search.

In terms of feature functions exploring discourse information, a handful have been proposed recently, the vast majority for standard beam search decoders. For example, Tiedemann (2010) and Gong et al. (2011) use cached-based language models based on word distributions in previous sentences. Focusing on lexical cohesion, Xiong et al. (2013) attempt to reinforce the choice of lexical items during decoding by computing lexical chains in the source document and predicting target lexical chains from the source ones. Variants of features include a count cohesion model that rewards a hypothesis whenever a chain word occurs in the hypothesis, and a probabilistic cohesion model that takes chain word translation probabilities into account. Also with the aim of enforcing consistency in lexical choices between test and training sentences and across test sentences, Alexandrescu and Kirchhoff (2009) use graph-based learning to exploit similarities between words used in these sentences.

Within the Docent framework, focusing on pronoun resolution, Hardmeier et al. (2014) use a neural network to predict the translation of a source language pronoun from a list of possible target language pronouns using features from the context of the source language pronouns and the translations of antecedents. Previous approaches to pronoun resolution in SMT applied anaphora resolution systems prior to translation (Le Nagard and Koehn 2010; Hardmeier and Federico 2010), and were heavily affected by the low performance of these systems.

# 6 Quality Evaluation and Estimation

The evaluation of the quality of MT has always been a major concern in the field. Different from metrics used in most tagging or classification applications (see Chapter 15), for a given source text, many translations are possible and could be considered equally good in most cases. Therefore, a simple binary comparison between the system output and a human translation is not an acceptable solution. A number of evaluation metrics have been proposed over the years, initially focusing on manual evaluation and more recently on automatic and semi-automatic evaluation. Manual evaluation metrics rely on human translators to judge criteria such as comprehensibility, intelligibility, fluency, adequacy, informativeness, etc.

NIST, the National Institute of Standards and Technology, has been running a number of evaluation campaigns, both as open competitions, where any MT system can participate,26 but also closed competitions as part of research programs such as those funded by DARPA, which only allow systems from partners in the research programs to participate, e.g. the GALE program evaluations (Olive et al. 2011). The NIST campaigns serve both to compare different MT systems and to measure progress over the years (by using the same test sets). Over a number of years, the way the evaluation is performed in these campaigns has changed. It went from manual scoring of translations for fluency and adequacy following different scales (e.g. a seven-point scale), to fully automated evaluation using various metrics, to human post-editing of machine translations followed by the computation of the edit distance between the original automatic translation and its post-edited version (see section 6.3).

WMT,27 the Workshop on Statistical Machine Translation, is another large evaluation campaign, which was initiated mostly to compare SMT systems, but nowadays allows systems of various types to participate. To date, WMT has had nine editions jointly with major NLP conferences and has been serving as a major platform for open MT evaluation. WMT concentrates on comparing systems, as opposed to evaluating general systems’ quality or their progress over the years—the test sets are different every year and the evaluation is done by means of ranking systems. Moreover, most of the evaluation is done using voluntaries or paid mechanical turkers. Besides evaluating MT systems, WMT also promotes the comparison of MT evaluation and estimation metrics and methods for the combination of MT systems. In recent years, WMT has been showing that SMT systems achieve comparable if not superior performance compared to popular commercial rule-based systems for many language pairs. In addition, for some language pairs SMT systems built under ‘constrained’ conditions (limited training sets) have been shown to outperform online, unconstrained systems. For the results of the most recent campaigns, we refer the reader to Bojar et al. (2014, 2013) and Callison-Burch et al. (2012).

Although manual evaluation is clearly the most reliable way of assessing the performance of MT systems, obtaining manual judgements is costly and time-consuming. Particularly for system development, i.e. to measure the progress of a given system over different versions, most researchers rely on automatic evaluation metrics. Automatic metrics are also an essential component for the discriminative training in SMT models, where hundreds of thousands of translation hypotheses have to be scored over multiple iterations. Many automatic metrics have been proposed to automatically assess the performance of MT systems. A common element in most automatic MT evaluation metrics is the use of human translations, i.e. the reference translations, as ground truth. The hypothesis is that the MT output should be as close as possible to such a reference to be considered a correct translation. In order to measure the level of resemblance, some form of overlap or distance between system and reference translations is computed. Single words or phrases can be considered as the matching units. Most metrics can be applied when either a single reference or multiple references are available for every sentence in the test set. Since a source sentence can usually have more than one correct translation, the use of multiple references minimizes biases towards a specific human translation. A recent development to exhaustively generate reference translations from multiple partial (word, phrases, etc.) sentence translations manually produced by humans has been shown to make reference-based metrics much more reliable (Dreyer and Marcu 2012). Some metrics also consider inexact matches, for example, using lemmas instead of word forms, paraphrases, or entailments, when comparing human and machine translations. In the rest of this section, we review the arguably most popular evaluation metrics (BLEU/NIST, METEOR, and TER/HTER), and briefly touch upon some extensions adding richer linguistic information, and upon a family of metrics which disregard reference translations: quality estimation metrics.

## 6.1 BLEU

BLEU (Bilingual Evaluation Understudy) (Papineni et al. 2002) is the most commonly used evaluation metric for research in MT, although it has well-known limitations and a number of alternative metrics are available. BLEU focuses on lexical precision, i.e. the proportion of n-grams in the automatic translation which are covered by a reference translation. Therefore, BLEU rewards translations whose word choice and word order are similar to the reference.

Let countmatchclip(ngram) be the count of ngram matches between a given system output sentence C and the reference translation, where the count of repeated words is clipped by the maximum number of occurrences of that word in the reference. Let count(ngram) be the total of n-grams in the MT system output. BLEU sums up the clipped n-gram matches for all the sentences in the test corpus, normalizing them by the number of candidate n-grams in the machine-translated test corpus. For a given n, this results in the precision score, pn, for the entire Corpus:

$Display mathematics$

BLEU averages multiple n-gram precisions pns for n-grams of different sizes. The score for a given test corpus is the geometric mean of the pns, using n-grams up to a length N (usually 4) and positive weights $wn=N−1$, summing to 1:

$Display mathematics$

BLEU uses a brevity penalty BP to avoid giving preference to short translations, since the denominator in each pn contains the total number of n-grams used in the machine-translated text, as opposed to the reference text. The brevity penalty aims at compensating for the lack of a recall component by contrasting the total number of words c in the system translations against the reference length r. If multiple references are used, r is defined as the length of the closest reference (in size) to the system output:

$Display mathematics$

BLEU has many limitations, including the following:

• The n-gram matching is done over exact word forms, ignoring morphological variations, synonyms, etc.

• All matched words are weighed equally, i.e. the matching of a function word will count the same as the matching of a content word.

• A zero match for a given n-gram, which is common for higher-order n-grams, will result in a BLEU score equal to zero, unless smoothing techniques are used.

• The brevity penalty does not adequately compensate for the lack of recall.

• It does not correlate well with human judgements at sentence level.

• It does not provide an absolute quality score, but instead a score which is highly dependent on the test corpus, given its n-gram distributions.

Despite these limitations, BLEU has been shown to correlate well with human evaluation when comparing document-level outputs from different SMT systems, or measuring improvements of a given SMT system during its development, in both cases using the same test corpus for every evaluation round. Given its simplicity, BLEU is also very efficient for discriminative training in SMT. A number of other MT evaluation metrics have been proposed to overcome the limitations of BLEU. A similar metric that is also commonly used is NIST (Doddington 2002). It differs from BLEU in the way n-gram scores are averaged, the weights given to n-grams, and the way the brevity penalty is computed. While BLEU relies on the geometric mean, NIST computes arithmetic mean. Moreover, while BLEU uses uniform weights for all n-grams, NIST weights more heavily n-grams which occur less frequently, as an indicator of their higher informativeness. For example, very frequent bigrams in English like ‘of the’ will be weighted low since they are very likely to happen in many sentences and a match with a reference translation could therefore happen merely by chance. Finally, the modified brevity penalty minimizes the impact of small variations in the length of the system output on the final NIST score. Song et al. (2013) further explored variations of components in BLEU for higher correlation with human judgements and improved discriminative training.

## 6.2 METEOR

Another popular metric which is commonly used is METEOR (Metric for Evaluation of Translation with Explicit Ordering) (Lavie and Agarwal 2007). This metric includes a fragmentation score that accounts for word ordering, enhances token matching considering stemming, synonymy, and paraphrase look-up, and can be tuned to weight-scoring components to optimize correlation with human judgements for different purposes. METEOR is defined as:

$Display mathematics$

A matching algorithm performs word alignment between the system output and reference translations. If multiple references are available, the matches are computed against each reference separately and the best match is selected. METEOR allows the unigram matches to be exact word matches, or generalized to stems, synonyms, and paraphrases, if language resources are available. Based on those matches, precision and recall are calculated, resulting in the following Fmean metric:

$Display mathematics$

where P is the unigram precision, i.e. the fraction of words in the system output that match words in the reference, and R is the unigram recall, i.e. the fraction of the words in the reference translation that match words in the system output.

The matching algorithm returns the fragmentation fraction, which is used to compute a discount factor Pen (for ‘penalty’) as follows. The sequence of matched unigrams between system output and reference translation is split into the fewest (and hence longest) possible chunks, where the matched words in each chunk are adjacent and in identical order in both strings. The number of chunks (ch) and the total number of matching words in all chunks (m) are then used to calculate a fragmentation fraction frag = ch/m. The discount factor Pen is computed as:

$Display mathematics$

The parameters of METEOR determine the relative weight of precision and recall (α), the discount factor (γ), and the functional relation between the fragmentation and the discount factor (β). These weights can be optimized for better correlation with human judgements on a particular quality aspect (fluency, adequacy, etc.), dataset, language pair, or evaluation unit (system, document, or sentence level) (Lavie and Agarwal 2007; Agarwal and Lavie 2008).

## 6.3 TER/HTER

Inspired by the Word Error Rate (WER) metrics from the automatic speech recognition field, a popular family of metrics for MT evaluation is that of the edit / error rate metrics based on the Levenshtein distance (Levenshtein 1966). The Translation Edit Rate (TER) metric (Olive et al. 2011) computes the minimum number of substitutions, deletions and insertions that have to be performed to convert the automatic translation into a reference translation, as in WER; however, it uses an additional edit operation that takes into account movements (shifts) of sequences of words:

$Display mathematics$

For multiple references, the number of edits is computed with respect to each reference individually, and the reference with the fewest number of edits necessary is chosen. In the search process for the minimum number of edits, shifts are prioritized over other edits.

Human-targeted Translation Edit Rate (HTER) (Snover et al. 2006) is a semi-automatic variation of TER in which the references are built as human-corrected versions of the machine translations via post-editing. As long as adequate post-editing guidelines are used, the edit rate is measured as the minimum number of edits necessary to transform the system output into a correct translation. Recent versions of TER/HTER also allow the tuning of the weights for each type of edit and the use of paraphrases for inexact matches (Snover et al. 2009).

## 6.4 Linguistically informed metrics

MT evaluation is currently a very active field. A number of alternative metrics have been proposed and many of these metrics have shown to correlate better with human evaluation, particularly at sentence level. Some of these metrics exploit the matching of linguistic information at different levels, as opposed to simple matching at the lexical level. These include the matching of base phrases, named entities, syntactic sub-trees, semantic role labels, etc. For example, Giménez and Màrquez (2010) present a number of variations of linguistic metrics for document and sentence-level evaluation. These are available as part of the Asiya toolkit.28 A few recent metrics exploit shallow semantic information in a principled way by attempting to align/match predicates in the system output and reference translation, to only then align/match their arguments (roles and fillers) (Rios et al. 2011; Lo, Tumuluru, and Wu 2012). Other metrics consider the matching of discourse relations (Guzmán et al. 2014).

For other recent developments in MT evaluation metrics, the readers are referred to the proceedings of recent MT evaluation campaigns, which now include tracks for meta-evaluation of MT evaluation metrics (Callison-Burch et al. 2010, 2011, 2012; Bojar et al. 2013, 2014). In spite of many criticisms (Callison-Burch et al. 2006), BLEU and other simple lexical matching metrics continue to be the most commonly used alternative, since they are fast and cheap to compute.

## 6.5 Quality estimation metrics

Reference-based MT evaluation metrics are very useful to compare MT systems and measure systems’ progress, but their application is limited to the data sets for which references are available, and their results may not generalize to new data sets. Some effort has been put towards reference-free MT evaluation metrics. These are generally built using machine learning algorithms and data sets annotated with some form of quality scores and a number of automatically extracted features. Reference-free metrics are aimed at predicting the quality of new, unseen translated text and have a number of applications, for example:

• Decide whether a given translation is good enough for publishing as is.

• Inform readers of the target language only whether or not they can rely on a translation.

• Filter out translations that are not good enough for post-editing by professional translators.

• Select the best translation among options from multiple MT and/or translation memory systems.

The first metrics derived from the field of automatic speech recognition and were referred to as confidence estimation (Blatz et al. 2003, 2004). These metrics estimate the confidence of an SMT system in the translations it produces by taking into account information from the SMT system itself (the probabilities of phrases used in the translation, its language model score, the overall model score, similarity between the translation and other candidates in the n-best list, information from the search graph, such as number of possible hypotheses, etc.), as well as MT system-independent features, such as the size of the source and candidate hypotheses, the grammaticality of the translations, etc. The first confidence estimation metrics were modelled using reference translations at training time in order to predict an automatic score such as WER or NIST, or to predict binary scores dividing the test set into ‘good’ and ‘bad’ translations. A number of promising features were identified, but the overall results were not encouraging. Quirk (2004) obtained better results with similar learning algorithms and features, but using a relatively small set of translations annotated with human scores for training.

With the overall improvement of MT systems, this challenge has been reshaped as a more general problem of quality estimation (Specia et al. 2010). The goal of quality estimation is to predict the overall quality of a translated text, in particular using features that are independent from the MT system that produced the translations, the so-called ‘black-box features’. These features include indicators of how common the source text is, how fluent the translations are, whether they contain spelling mistakes or unknown words, and structural differences between the source and translation texts, among others. In addition to such features, system-dependent (or ‘glass-box’ features) are thus an additional component that can further contribute to the overall performance of quality estimation metrics. Work has been proposed to estimate document-level quality (Soricut and Echihabi 2010; Scarton and Specia 2014), subsentence-level quality (Bach et al. 2011), and in practical applications such as to directly estimate post-editing time (Specia 2011) or select between candidates from MT and translation memories (He et al. 2010). Shared tasks on quality estimation organized as part of WMT12–14 resulted in a number of interesting systems for sentence-level scoring and ranking, as well as word-level prediction (Callison-Burch et al. 2012; Bojar et al. 2013, 2014).

# 7 Remarks and Perspectives

The state-of-the-art performance in MT varies according to language pair, corpora available, and training conditions, among other variables. It is not possible to provide absolute numbers reflecting the performance of the best MT systems, since evaluation campaigns focus on comparing different systems and ranking them according to such comparisons, as opposed to providing absolute quality scores. Moreover, for practical reasons the evaluations are always limited to a relatively small test set, on a given text genre and domain. However, from the results of recent campaigns, it is prevalent that SMT systems or hybrid systems involving SMT are ranked top for most language pairs considered (Bojar et al. 2014). These top systems often include online systems, such as Google Translate and Microsoft Bing Translator, and variations of open-source toolkits such as Moses.

While the field of MT, and particularly SMT, continues to progress at a fast pace, there are a number of opportunities for improvement. Among the interesting directions for development, the following can be emphasized:

#### Fully discriminative models for SMT.

A practical limitation of the discriminative learning framework for parameter tuning described in section 5.1.6 is that it can only handle a small number of features due to very large space of possible parameter values that has to be searched. A larger number of parameters is likely to result in overfitting. This approach has been extended to fully discriminative methods, where the idea is to use larger feature sets and machine learning techniques that can cope with these feature sets. Common features include word or phrase pairs, whose value could be binary. In other words, instead of using maximum likelihood estimates for word or phrase probabilities, each word or phrase pair from a phrase table can be represented as a feature function: for example, (the blue car, la voiture bleu). The feature values will be binary indicators of whether a candidate translation contains that phrase pair, and their weights will indicate how useful that phrase pair is for the model in general. Other examples of features are words or phrases in the target language: for example, (la voiture bleu), whose value will be a binary indicator of the presence of the phrase in the candidate translation. Linguistic information can also be added: for example, the phrase pairs can be represented by their POS tags, as opposed to the actual words. The tuning of the feature weights can be done using alternative discriminative training methods to minimize error functions, such as perceptron-style learning (Liang et al. 2006). Exploiting very large feature sets requires that the tuning is performed using a large parallel corpus, with millions of sentences, since all variations of translation units must be seen during tuning. Issues such as scalability and overfitting when tuning millions of parameters and sentences are ongoing research topics. Other approaches try to keep the tuning corpus small but still add thousands of features by using ranking approaches to directly learn how to rank alternative translations (Watanabe et al. 2007b; Chiang et al. 2009). Alternative approaches include Bangalore et al. (2007), Venkatapathy and Bangalore (2009), and Kääriäinen (2009). The cdec toolkit facilitates efficient fully discriminative training (Dyer et al. 2010).

This is known to achieve better translations when large quantities of parallel data are available for the text domain under consideration. When this is not possible, the only option is to use data from different domains to build SMT systems, and to leverage any smaller quantities of data available to the domain of interest through domain adaptation techniques. Existing strategies include building phrase tables from in- and out-of-domain corpora and then interpolating them by learning weights for each of these phrase tables (Foster and Kuhn 2007). An alternative strategy consists in exploiting large monolingual in-domain corpora, either in the source or in the target language, which is normally much more feasible to obtain than parallel corpora. Monolingual data can be used to train in-domain target language models, or to generate additional synthetic bilingual data, which is then used to adapt the model components through the interpolation of multiple phrase tables (Bertoldi and Federico 2009). Parallel sentences that are close enough to the domain of interest can also be selected/weighted from large out-of-domain corpora using string similarity metrics (Shah et al. 2012), language modelling, and cross-entropy metrics (Axelrod et al. 2011), among others.

#### MT for end-users.

Tools that make MT more suitable for end-users, especially professional translators, are becoming more popular nowadays, given the general belief that MT has reached levels of quality that render it useful in scenarios other than gisting. This is particularly true for SMT approaches, which remained limited to the research community for many years, where the usability of the tools developed was not a concern. Several research initiatives towards making MT useful and efficient for end-users can be mentioned, including the integration of MT and translation memory systems (He et al. 2010; Koehn and Senellart 2010), metrics to estimate the quality of machine translations (Specia 2011), design of post-editing tools (Koehn 2010a), automatic post-editing (Simard et al. 2007), etc. A few recently funded projects in Europe focus on the development of user-friendly translation tools, the integration of computer-aided translation tools and MT, and the development of pre/post-editing interfaces. These include MosesCore,29 MATECAT,30 CASMACAT,31 and ACCEPT.32

#### Dealing with noisy data.

Translation of social media and other user-generated content such as product reviews, blogs, etc. is also a topic that has been attracting increasing levels of interest. More and more of this type of content gets translated automatically, especially using online SMT systems. For example, Facebook offers translations from Bing for posts in languages other than the language set by the user. User-generated content is known to contain unusual text, such as abbreviations, misspellings and alternative spellings of words, broken grammar, and other aspects that are difficult if not impossible to handle for MT systems built for standard language. Beyond the social motivation of enabling direct communication between end-users, one important reason for translating such type of content is that it can be the sole or most significant source of information in many scenarios. For example, products sold online worldwide may have user-provided reviews only in languages that are not understandable to the current buyer. Most approaches attempt to deal with this type of content by preprocessing it so that it is normalized into standard language text before translation. Others attempt to collect user-generated data to build (statistical) MT systems that can process such data directly (Banerjee et al. 2012). Recent studies have shown that end-users may be more forgiving of lower-quality translation for this type of content, finding it useful in many cases, especially when compared to not having any other translations available (Mitchell and Roturier 2012). CNGL (Centre for Global Intelligent Content) and Microsoft Research developed brazilator,33 a service to provide a live translation stream of tweets relating to the 2014 FIFA World Cup from and into various languages.

While some of these directions are very recent, others have achieved a degree of maturity over the years, although the problems they address are far from solved. The directions, and particularly the few references given for each of them, only represent a small fraction of the research done in the vast field of SMT.

# Further Reading and Relevant Resources

A wealth of information about recent developments in the field of machine translation can be found online. The Machine Translation Archive34 is a compilation sponsored by the European Association for Machine Translation35 that puts together electronic versions of publications (from conferences, workshops, journals, etc.) on various topics related to machine translation, computer translation systems, and computer-based translation tools. These are organized in various ways to facilitate search, including by author, affiliation, and methodology.

The University of Edinburgh’s Statistical Machine Translation research group maintains a website36 with relevant information on the topic, including core references and slides, a list of conferences and workshops, and links for tools and corpora. A large part of this website is dedicated to Moses,37 containing relevant information on how to install and use the popular toolkit, with tutorials, manuals, and specialized mailing lists, as well as information on how to contribute to the code. A recent wiki-based initiative on the same website38 aims at putting together references for publications on various topics in the field of statistical machine translation.

WMT (Workshop on Statistical Machine Translation),39 the series of open evaluation campaigns, is the best source for up-to-date, freely available resources to build machine translation systems, as well as for the latest data and results produced in the competitions on machine translations and evaluation metrics, among other tasks.40 Other relevant open campaigns are the NIST OpenMT,41 which is organized less frequently and often connected to DARPA-funded programs, and IWSLT (International Workshop on Spoken Language Translation),42 which focuses on spoken language translation. In addition to the data made available by these campaigns, an important source of data is the OPUS project,43 which contains a very large and varied collection of parallel corpora in dozens of language pairs that can be used for machine translation. The corpora are crawled from open-source products on the web, like fan-made subtitles, automatically aligned, and in some cases, automatically annotated with linguistic information.

Some of the most important references on the topic, which have been used throughout this chapter, are the following: Hutchins (1997, 2000, 2007) and Wilks (2009) for the history of MT and rule-based approaches; Brown et al. (1990, 1993) for the mathematical foundations of SMT; Lopez (2008) for a recent survey on SMT; and Koehn (2010b) for a textbook on MT, with particular emphasis on SMT.

## References

Agarwal, Abhaya and Alon Lavie (2008). ‘METEOR, M-BLEU and M-TER: Evaluation Metrics for High Correlation with Human Rankings of Machine Translation Output’. In Proceedings of the Third Workshop on Statistical Machine Translation, 115–118. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Alexandrescu, Andrei and Katrin Kirchhoff (2009). ‘Graph-Based Learning for Statistical Machine Translation’. In Human Language Technologies: Proceedings of the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, CO, 119–127. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Axelrod, Amittai, Xiaodong He, and Jianfeng Gao (2011). ‘Domain Adaptation via Pseudo In-Domain Data Selection’. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’11), Edinburgh, UK, 355–362. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Aziz, Wilker, Marc Dymetman, and Lucia Specia (2014). ‘Exact Decoding for Phrase-Based Statistical Machine Translation’. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’11), Doha, Qatar, 1237–1249. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Aziz, Wilker, Marc Dymetman, and Sriram Venkatapathy (2013). ‘Investigations in Exact Inference for Hierarchical Translation’. In Proceedings of the Eighth Workshop on Statistical Machine Translation, Sofia, Bulgaria, 472–483. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Aziz, Wilker, Marc Dymetmany, Shachar Mirkin, Lucia Specia, Nicola Cancedda, and Ido Dagan (2010). ‘Learning an Expert from Human Annotations in Statistical Machine Translation: The Case of Out-of-Vocabulary Words’. In Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT-2010). Saint-Raphaël, France, 28–35. European Association for Machine Translation.Find this resource:

Aziz, Wilker, Miguel Rios, and Lucia Specia (2011). ‘Shallow Semantic Trees for SMT’. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh, UK, 316–322. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Bach, Nguyen, Fei Huang, and Yaser Al-Onaizan (2011). ‘Goodness: A Method for Measuring Machine Translation Confidence’. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, 211–219. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Baker, Kathryn, Michael Bloodgood, Chris Callison-Burch, Bonnie J. Dorr, Nathaniel W. Filardo, Lori Levin, Scott Miller, and Christine Piatko (2010). ‘Semantically-Informed Machine Translation: A Tree-Grafting Approach’. In Proceedings of the Ninth Biennial Conference of the Association for Machine Translation in the Americas, Denver, CO. Association for Machine Translation in the Americas.Find this resource:

Banerjee, Pratyush, Sudip Kumar Naskar, Johann Roturier, Andy Way, and Josef van Genabith (2012). ‘Domain Adaptation in SMT of User-Generated Forum Content Guided by OOV Word Reduction: Normalization and/or Supplementary Data?’ In Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT-2012), Trento, Italy, 169–176. European Association for Machine Translation.Find this resource:

Bangalore, Srinivas, Patrick Haffner, and Stephan Kanthak (2007). ‘Statistical Machine Translation through Global Lexical Selection and Sentence Reconstruction’. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, 152–159. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Bar-Hillel, Yehoshua (1960). ‘The Present Status of Automatic Translation of Languages’, Advances in Computers, 1: 91–163.Find this resource:

Bertoldi, Nicola and Marcello Federico (2009). ‘Domain Adaptation for Statistical Machine Translation with Monolingual Resources’. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece, 182–189. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Blatz, John, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing (2003). ‘Confidence Estimation for Machine Translation’. Technical report, Johns Hopkins University, Baltimore.Find this resource:

Blatz, John, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing (2004). ‘Confidence Estimation for Machine Translation’. In Proceedings of the 20th International Conference on Computational Linguistics (COLING ’04), Geneva, 315–321. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Bojar, Ondrej, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia (2013). ‘Findings of the 2013 Workshop on Statistical Machine Translation’. In Proceedings of the Eighth Workshop on Statistical Machine Translation, Sofia, Bulgaria, 1–44. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Bojar, Ondrej, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna (2014). ‘Findings of the 2014 Workshop on Statistical Machine Translation’. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, MD, 12–58. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Brown, Peter F., John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin (1990). ‘A Statistical Approach to Machine Translation’, Computational Linguistics, 16(2): 79–85.Find this resource:

Brown, Peter F., Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer (1993). ‘The Mathematics of Statistical Machine Translation: Parameter Estimation’, Computational Linguistics, 19(2): 263–311.Find this resource:

Brown, Ralf D. (2011). ‘The CMU-EBMT Machine Translation System’, Machine Translation, 25(2): 179–195.Find this resource:

Callison-Burch, Chris, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, and Omar Zaidan (2010). ‘Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation’. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, Uppsala, Sweden, 17–53. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Callison-Burch, Chris, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia (2012). ‘Findings of the 2012 Workshop on Statistical Machine Translation’. In Proceedings of the Seventh Workshop on Statistical Machine Translation, Montreal, Canada, 10–51. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Callison-Burch, Chris, Philipp Koehn, Christof Monz, and Omar Zaidan (2011). ‘Findings of the 2011 Workshop on Statistical Machine Translation’. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh, UK, 22–64. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Callison-Burch, Chris, Miles Osborne, and Philipp Koehn (2006). ‘Re-evaluating the Role of BLEU in Machine Translation Research’. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL ’06), Trento, Italy, 249–56. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Carpuat, Marine and Dekai Wu (2007). ‘Improving Statistical Machine Translation Using Word Sense Disambiguation’. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, 61–72. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Cherry, Colin and George Foster (2012). ‘Batch Tuning Strategies for Statistical Machine Translation’. In Human Language Technologies: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics, Montreal, Canada, 427–436. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Cherry, Colin and Dekang Lin (2007). ‘Inversion Transduction Grammar for Joint Phrasal Translation Modeling’. In Proceedings of the AMTA Workshop on Syntax and Structure in Statistical Translation, Rochester, NY, 17–24. Association for Machine Translation in the Americas.Find this resource:

Lo, Chi-kiu, Anand Karthik Tumuluru, and Dekai Wu (2012). ‘Fully Automatic Semantic MT Evaluation’. In Proceedings of the Seventh Workshop on Statistical Machine Translation, Montreal, Canada, 243–252. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Chiang, David (2005). ‘A Hierarchical Phrase-Based Model for Statistical Machine Translation’. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, MI, 263–270. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Chiang, David (2007). ‘Hierarchical Phrase-Based Translation’, Computational Linguistics, 33: 201–228.Find this resource:

Chiang, David, Kevin Knight, and Wei Wang (2009). ‘11,001 New Features for Statistical Machine Translation’. In Human Language Technologies: Proceedings of the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, CO, 218–226. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Dandapat, Sandipan, Mikel Forcada, Declan Groves, Sergio Penkale, John Tinsley, and Andy Way (2010). ‘OpenMaTrEx: A Free/Open-Source Marker-Driven Example-Based Machine Translation System’. In Proceedings of the 7th International Conference on Natural Language Processing, Reykjavik, Iceland, 16–18.Find this resource:

Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). ‘Maximum Likelihood from Incomplete Data via the EM Algorithm’, Journal of the Royal Statistical Society, 39(1): 1–38.Find this resource:

Devlin, Jacob, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul (2014). ‘Fast and Robust Neural Network Joint Models for Statistical Machine Translation’. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, 1370–1380. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Doddington, George (2002). ‘Automatic Evaluation of Machine Translation Quality Using n-gram Co-occurrence Statistics’. In Proceedings of the 2nd International Conference on Human Language Technology Research, San Diego, 138–145. San Francisco, CA: Morgan Kaufmann Publishers.Find this resource:

Dorr, Bonnie J. (1993). ‘Interlingual Machine Translation: A Parameterized Approach’, Artificial Intelligence, 63: 429–492.Find this resource:

Dreyer, Markus and Daniel Marcu (2012). ‘Hyter: Meaning-Equivalent Semantics for Translation Evaluation’. In Human Language Technologies: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics, Montreal, Canada, 162–171. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Dyer, Chris, Jonathan Weese, Hendra Setiawan, Adam Lopez, Ferhan Ture, Vladimir Eidelman, Juri Ganitkevitch, Phil Blunsom, and Philip Resnik (2010). ‘cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models’. In Proceedings of the ACL 2010 System Demonstrations, Uppsala, Sweden, 7–12. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Farwell, David and Yorick Wilks (1991). ‘ULTRA: A Multilingual Machine Translator’. In Proceedings of the Machine Translation Summit III, Washington, DC, 19–24.Find this resource:

Foster, George and Roland Kuhn (2007). ‘Mixture-Model Adaptation for SMT’. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, 128–135. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Galley, Michel, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve Deneefe, Wei Wang, and Ignacio Thayer (2006). ‘Scalable Inference and Training of Context-Rich Syntactic Translation Models’. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 961–968. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Giménez, Jesús and Lluís Màrquez (2010). ‘Linguistic Measures for Automatic Machine Translation Evaluation’, Machine Translation, 24(3–4): 209–240.Find this resource:

Gimpel, Kevin and Noah A. Smith (2008). ‘Rich Source-Side Context for Statistical Machine Translation’. In Proceedings of the Third Workshop on Statistical Machine Translation, Columbus, OH, 9–17. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Gong, Zhengxian, Min Zhang, and Guodong Zhou (2011). ‘Cache-Based Document-Level Statistical Machine Translation’. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’11), Edinburgh, UK, 909–919. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Green, Spence, Daniel Cer, and Christopher Manning (2014). ‘Phrasal: A Toolkit for New Directions in Statistical Machine Translation’. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, MD, 114–121. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Guzmán, Francisco, Shafiq Joty, Lluís Màrquez, and Preslav Nakov (2014). ‘Using Discourse Structure Improves Machine Translation Evaluation’. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, 687–698. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Hanneman, Greg, Edmund Huber, Abhaya Agarwal, Vamshi Ambati, Alok Parlikar, Erik Peterson, and Alon Lavie (2008). ‘Statistical Transfer Systems for French–English and German–English Machine Translation’. In Proceedings of the Third Workshop on Statistical Machine Translation, Columbus, OH, 163–166. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Haque, Rejwanul, Sudip Kumar Naskar, Antal van den Bosch, and Andy Way (2010). ‘Supertags as Source Language Context in Hierarchical Phrase-Based SMT’. In Proceedings of the Ninth Biennial Conference of the Association for Machine Translation in the Americas. Denver, CO. Association for Machine Translation in the Americas.Find this resource:

Hardmeier, Christian and Marcello Federico (2010). ‘Modelling Pronominal Anaphora in Statistical Machine Translation’. In Proceedings of the 7th International Workshop on Spoken Language Translation, Paris, France, 283–289.Find this resource:

Hardmeier, Christian, Joakim Nivre, and Jörg Tiedemann (2012). ‘Document-Wide Decoding for Phrase-Based Statistical Machine Translation’. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju, Korea, 1179–1190. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Hardmeier, Christian, Sara Stymne, Jörg Tiedemann, and Joakim Nivre (2013). ‘Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation’. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Sofia, Bulgaria, 193–198. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Hardmeier, Christian, Sara Stymne, Jörg Tiedemann, Aaron Smith, and Joakim Nivre (2014). ‘Anaphora Models and Reordering for Phrase-Based SMT’. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, MD, 122–129. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

He, Yifan, Yanjun Ma, Josef van Genabith, and Andy Way (2010). ‘Bridging SMT and TM with Translation Recommendation’. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 622–630. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Hoang, Hieu and Philipp Koehn (2010). ‘Improved Translation with Source Syntax Labels’. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, Uppsala, Sweden, 409–417. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Hopkins, Mark and Jonathan May (2011). ‘Tuning as Ranking’. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’11), Edinburgh, UK, 1352–1362. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Huang, Liang, Kevin Knight, and Aravind Joshi (2006). ‘A Syntax-Directed Translator with Extended Domain of Locality’. In Proceedings of the Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing, New York, 1–8. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Hutchins, W. John (1986). Machine Translation: Past, Present, Future. New York: Halsted Press.Find this resource:

Hutchins, W. John (1997). ‘Milestones in Machine Translation. Part 1: How It All Began in 1947 and 1948’, Language Today, (3): 22–23.Find this resource:

Hutchins, W. John (ed.) (2000). Early Years in Machine Translation: Memoirs and Biographies of Pioneers. Studies in the History of the Language Sciences, 97. Amsterdam and Philadelphia: John Benjamins.Find this resource:

Hutchins, W. John (2007). ‘Machine Translation: A Concise History’, <http://www.hutchinsweb.me.uk/CUHK-2006.pdf>.

Hutchins, W. John and Harold L. Somers (eds) (1992). An Introduction to Machine Translation. London: Academic Press.Find this resource:

Jones, Bevan, Jacob Andreas, Daniel Bauer, Karl Moritz Hermann, and Kevin Knight (2012). ‘Semantics-Based Machine Translation with Hyperedge Replacement Grammars’. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Bombay, 1359–1376. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Kääriäinen, Matti (2009). ‘Sinuhe: Statistical Machine Translation Using a Globally Trained Conditional Exponential Family Translation Model’. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP ’11), Singapore, 1027–1036. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Kay, Martin (1980). ‘The Proper Place of Men and Machines in Language Translation’. Research report Xerox, Palo Alto Research Center, National Academy of Sciences, National Research Council, Palo Alto, CA.Find this resource:

Koehn, Philipp (2010a). ‘Enabling Monolingual Translators: Post-Editing vs. Options’. In Human Language Technologies: Proceedings of the 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, CA, 537–545. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Koehn, Philipp (2010b). Statistical Machine Translation. Cambridge: Cambridge University Press.Find this resource:

Koehn, Philipp and Hieu Hoang (2007). ‘Factored Translation Models’. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, 868–876. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst (2007). ‘Moses: Open Source Toolkit for Statistical Machine Translation.’ In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, 177–180. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Koehn, Philipp, Franz Josef Och, and Daniel Marcu (2003). ‘Statistical Phrase-Based Translation’. In Human Language Technologies: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Canada, 48–54. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Koehn, Philipp and Jean Senellart (2010). ‘Convergence of Translation Memory and Statistical Machine Translation’. In Proceedings of the AMTA-2010 Workshop Bringing MT to the User: MT Research and the Translation Industry. Denver, CO, 21–31. Association for Machine Translation in the Americas.Find this resource:

Lavie, Alon and Abhaya Agarwal (2007). ‘METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments’. In Proceedings of the 2nd Workshop on Statistical Machine Translation, Prague, Czech Republic, 228–231. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Le Nagard, Ronan and Philipp Koehn (2010). ‘Aiding Pronoun Translation with Co-reference Resolution’. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, Uppsala, Sweden, 252–261. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Levenshtein, Vladimir I. (1966). ‘Binary Codes Capable of Correcting Deletions, Insertions, and Reversals’, Soviet Physics Doklady, 10(8): 707–710.Find this resource:

Li, Zhifei, Chris Callison-Burch, Chris Dyer, Juri Ganitkevitch, Sanjeev Khudanpur, Lane Schwartz, Wren N. G. Thornton, Jonathan Weese, and Omar F. Zaidan (2009). ‘Joshua: An Open Source Toolkit for Parsing-Based Machine Translation’. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece, 135–139. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Liang, Percy, Alexandre Bouchard-Côté, Dan Klein, and Ben Taskar (2006). ‘An End-to-End Discriminative Approach to Machine Translation’. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 761–768. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Liu, Ding and Daniel Gildea (2008). ‘Improved Tree-to-String Transducer for Machine Translation’. In Proceedings of the Third Workshop on Statistical Machine Translation, Columbus, OH, 62–69. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Liu, Ding and Daniel Gildea (2010). ‘Semantic Role Features for Machine Translation’. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING ’10), Beijing, China, 716–724. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Lopez, Adam (2008). ‘Statistical Machine Translation’, ACM Computing Surveys, 40: 1–49.Find this resource:

Marcu, Daniel, Wei Wang, Abdessamad Echihabi, and Kevin Knight (2006). ‘SPMT: Statistical Machine Translation with Syntactified Target Language Phrases’. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’06), Sydney, Australia, 44–52. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Marcu, Daniel and William Wong (2002). ‘A Phrase-Based, Joint Probability Model for Statistical Machine Translation’. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’02), Philadelphia, PA, 133–139. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Melby, Alan K (1982). ‘A Bilingual Concordance System and its Use in Linguistic Studies’. In Proceedings of the Eighth LACUS Forum, 541–549. Columbia, SC: Hornbeam Press.Find this resource:

Mi, Haitao and Liang Huang (2008). ‘Forest-Based Translation Rule Extraction’. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), Honolulu, HI, 206–214. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Mirkin, Shachar, Lucia Specia, Nicola Cancedda, Ido Dagan, Marc Dymetman, and Idan Szpektor (2009). ‘Source-Language Entailment Modeling for Translating Unknown Terms’. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, 791–799. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Mitchell, Linda and Johann Roturier (2012). ‘Evaluation of Machine-Translated User Generated Content: A Pilot Study Based on User Ratings’. In Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT-2012), Trento, Italy, 61–64. European Association for Machine Translation.Find this resource:

Nagao, Makoto (1984). ‘A Framework of a Mechanical Translation between Japanese and English by Analogy Principle’. In Proceedings of the International NATO Symposium on Artificial and Human Intelligence, Lyon, France, 173–180. New York: Elsevier North-Holland.Find this resource:

Och, Franz Josef (2003). ‘Minimum Error Rate Training in Statistical Machine Translation’. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, 160–167. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Och, Franz Josef and Hermann Ney (2002). ‘Discriminative Training and Maximum Entropy Models for Statistical Machine Translation’. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, 295–302. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Och, Franz Josef and Hermann Ney (2003). ‘A Systematic Comparison of Various Statistical Alignment Models’, Computational Linguistics, 29: 19–51.Find this resource:

Olive, Joseph, Caitlin Christianson, and John McCary (eds) (2011). Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation. New York: Springer.Find this resource:

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu (2002). ‘BLEU: A Method for Automatic Evaluation of Machine Translation’. In Proceedings of the 40th Meeting of the Association for Computational Linguistics, Philadelphia, PA, 311–318. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Phillips, Aaron B. (2011). ‘Cunei: Open-Source Machine Translation with Relevance-Based Models of Each Translation Instance’, Machine Translation, 25(2): 161–177.Find this resource:

Pierce, John R., John B. Carroll, Eric P. Hamp, David G. Hays, Charles F. Hockett, Anthony G. Oettinger, and Alan Perlis (1966). ‘Language and Machines: Computers in Translation and Linguistics’. Alpac report, National Academy of Sciences, National Research Council, Washington, DC.Find this resource:

Quirk, Chris, Arul Menezes, and Colin Cherry (2005). ‘Dependency Treelet Translation: Syntactically Informed Phrasal SMT’. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, MI, 271–279. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Quirk, Chris B. (2004). ‘Training a Sentence-Level Machine Translation Confidence Measure’. In Proceedings of the 4th Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, 825–828. Paris: ELRA.Find this resource:

Rios, Miguel, Wilker Aziz, and Lucia Specia (2011). ‘Tine: A Metric to Assess MT Adequacy’. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh, UK, 116–122. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Scarton, Carolina and Lucia Specia (2014). ‘Document-Level Translation Quality Estimation: Exploring Discourse and Pseudo-References’. In Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT-2014), Dubrovnik, Croatia, 101–108. European Association for Machine Translation.Find this resource:

Schank, Roger (1973). ‘Identification of Conceptualizations Underlying Natural Language’. In Computer Models of Thought and Language. San Francisco, CA: W. H. Freeman, 187–247.Find this resource:

Shah, Kashif, Loïc Barrault, and Holger Schwenk (2012). ‘A General Framework to Weight Heterogeneous Parallel Data for Model Adaptation in Statistical Machine Translation’. In Proceedings of the Conference of the Association for Machine Translation in the Americas, San Diego. Association for Machine Translation in the Americas.Find this resource:

Shannon, Claude E. (1949). A Mathematical Theory of Communication. Champaign, IL: University of Illinois Press.Find this resource:

Simard, Michel, Cyril Goutte, and Pierre Isabelle (2007). ‘Statistical Phrase-Based Post-Editing’. In Human Language Technologies: Proceedings of the 2007 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Rochester, NY, 508–515. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul (2006). ‘A Study of Translation Edit Rate with Targeted Human Annotation’. In Proceedings of the 7th Biennial Association for Machine Translation in the America, Cambridge, MA, 223–231. Association for Machine Translation in the Americas.Find this resource:

Snover, Matthew, Nitin Madnani, Bonnie J. Dorr, and Richard Schwartz (2009). ‘Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric’. In Proceedings of the 4th Workshop on Statistical Machine Translation, 259–268. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Song, Xingyi, Trevor Cohn, and Lucia Specia (2013). ‘BLEU Deconstructed: Designing a Better MT Evaluation Metric’, International Journal of Computational Linguistics and Applications, 4(2): 29–44.Find this resource:

Soricut, Radu and Abdessamad Echihabi (2010). ‘Trustrank: Inducing Trust in Automatic Translations via Ranking’. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 612–621. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Specia, Lucia (2011). ‘Exploiting Objective Annotations for Measuring Translation Post-Editing Effort’. In Proceedings of the 15th Annual Conference of the European Association for Machine Translation (EAMT-2014), Leuven, Belgium, 73–80. European Association for Machine Translation.Find this resource:

Specia, Lucia, Dhwaj Raj, and Marco Turchi (2010). ‘Machine Translation Evaluation versus Quality Estimation’, Machine Translation, 24(1): 39–50.Find this resource:

Specia, Lucia, Baskaran Sankaran, and Maria Das Graças Volpe Nunes (2008). ‘n-best reranking for the efficient integration of word sense disambiguation and statistical machine translation’. In Proceedings of the 9th International Conference on Computational Linguistics and Intelligent Text Processing, Haifa, Israel, 399–410. Berlin and Heidelberg: Springer.Find this resource:

Stroppa, Nicolas, Antal van den Bosch, and Andy Way (2007). ‘Exploiting Source Similarity for SMT Using Context-Informed Features’. In Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation, 231–240. Skövde, Sweden: University of Skövde.Find this resource:

Tan, Ming, Wenli Zhou, Lei Zheng, and Shaojun Wang (2011). ‘A Large-Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation’. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, OR, 201–210. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Tiedemann, Jörg (2010). ‘To Cache or Not to Cache? Experiments with Adaptive Models in Statistical Machine Translation’. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, Uppsala, Sweden, 189–194. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Venkatapathy, Sriram and Srinivas Bangalore (2009). ‘Discriminative Machine Translation Using Global Lexical Selection’, ACM Transactions on Asian Language Information Processing, 8(2): article 8.Find this resource:

Vogel, Stephan, Hermann Ney, and Christoph Tillmann (1996). ‘HMM-Based Word Alignment in Statistical Translation’. In Proceedings of the 16th Conference on Computational Linguistics, Copenhagen, Denmark, 836–841. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Watanabe, Taro, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki (2007a). ‘Online Large-Margin Training for Statistical Machine Translation’. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, 764–773. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Watanabe, Taro, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki (2007b). ‘Online Large-Margin Training for Statistical Machine Translation’. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, 764–773. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Way, Andy (2010a). ‘Machine Translation’. In A. Clark, C. Fox and S. Lappin (eds), The Handbook of Computational Linguistics and Natural Language Processing, 531–573. Chichester: Wiley Blackwell.Find this resource:

Way, Andy (2010b). ‘Panning for EBMT Gold, or “Remembering Not to Forget”’, Machine Translation, 24: 177–208.Find this resource:

Wilks, Yorick (1973a). ‘An Artificial Intelligence Approach to Machine Translation’. In R. Schank and K. Kolby (eds), Computer Models of Thought and Language. San Francisco, CA: W. H. Freeman.Find this resource:

Wilks, Yorick (1973b). ‘The Stanford Machine Translation and Understanding Project’. In R. Rustin (ed.), Natural Language Processing, 243–290. New York: Algorithmics Press.Find this resource:

Wilks, Yorick (2009). Machine Translation: Its Scope and Limits. New York: Springer.Find this resource:

Wu, Dekai (1997). ‘Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora’, Computational Linguistics, 23(3): 377–403.Find this resource:

Wu, Dekai and Pascale Fung (2009). ‘Semantic Roles for SMT: A Hybrid Two-Pass Model’. In Human Language Technologies: Proceedings of the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, Boulder, CO, 13–16. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Xiong, Deyi, Yang Ding, Min Zhang, and Chew Lim Tan (2013). ‘Lexical Chain Based Cohesion Models for Document-Level Statistical Machine Translation’. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’13), Seattle, WA, 1563–1573. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Yamada, Kenji and Kevin Knight (2001). ‘A Syntax-Based Statistical Translation Model’. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, Toulouse, France, 523–530. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Zhang, Hui, Min Zhang, Haizhou Li, Aiti Aw, and Chew Lim Tan (2009). ‘Forest-Based Tree Sequence to String Translation Model’. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, 172–180. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Zhang, Min, Hongfei Jiang, Ai Ti Aw, Jun Sun, Sheng Li, and Chew Lim Tan (2007). ‘A Tree-to-Tree Alignment-Based Model for Statistical Machine Translation’. In Proceedings of the Machine Translation Summit XI, Copenhagen, Denmark, 535–542.Find this resource:

Zhang, Min, Hongfei Jiang, Aiti Aw, Haizhou Li, Chew Lim Tan, and Sheng Li (2008). ‘A Tree Sequence Alignment-Based Tree-to-Tree Translation Model’. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics and the Human Language Technology Conference, Columbus, OH, 559–567. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Zhou, Bowen, Xiaodan Zhu, Bing Xiang, and Yuqing Gao (2008). ‘Prior Derivation Models for Formally Syntax-Based Translation Using Linguistically Syntactic Parsing and Tree Kernels’. In Proceedings of the ACL-08: HLT Second Workshop on Syntax and Structure in Statistical Translation, Columbus, OH, 19–27. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Zollmann, Andreas and Ashish Venugopal (2006). ‘Syntax Augmented Machine Translation via Chart Parsing’. In Proceedings of the Workshop on Statistical Machine Translation, New York, 138–141. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

Zollmann, Andreas, Ashish Venugopal, and Stephan Vogel (2008). ‘The CMU Syntax-Augmented Machine Translation System: SAMT on Hadoop with n-Best Alignments’. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), Honolulu, HI, 18–25. Kyoto, Japan: National Institute of Information and Communications Technology.Find this resource:

## Notes:

(38) <http://www.statmt.org/survey/>. As of November 2014, it was said to contain 3173 publications.