Recent Applications of Machine Translation
Abstract and Keywords
Many large companies have included methods of controlling the input language to minimize problems of disambiguation to improve the quality of machine translation (MT). In the large-scale enterprise systems, MT is used to produce drafts, which are then edited by bilingual personnel. A significant development has been the introduction of specialized systems, designed for Internet service providers and for large corporations to supply and edit translations of their own webpages localized to their domain, and for cross-language communication with customers. MT also finds its application in healthcare communication, the military field, and translation for foreign tourists. The future for MT lies in developing hybrid systems combining the best of the statistical and rule-based approaches. A specific target of MT for immigrants or minorities has been the translation of subtitles for television programmes. Apart from minorities and immigrants, there are other disadvantaged members of society now beginning to be helped by MT-related systems.
Until the middle of the 1990s there were just two basic types of machine translation (MT) system. The first and oldest was the traditional large-scale system mounted on mainframe computers in large companies. Its purpose was to produce publishable translations, so its results were revised (‘post-edited’) by human translators or editors familiar with both source and target languages. There was opposition from translators (particularly those with the task of post-editing) to the use of this system, but the advantages of fast and consistent output has made large-scale MT cost-effective. In order to improve the quality of the raw MT output, many large companies included methods of ‘controlling’ the input language (by restricting vocabulary and syntactic structures): by such means, the problems of disambiguation and alternative interpretations of structure could be minimized and the quality (p. 442) of the output could be improved. Companies such as Xerox used MT systems with a ‘controlled language’ from the early 1990s: many companies followed their example, and the Smart Corporation specializes to this day in setting up controlled language MT systems for large companies in North America.1 In a few cases, it was possible to develop systems specifically for the particular ‘sublanguage’ of the texts to be translated (as in the Météo system for weather forecasts: Grimaila and Chandioux 1992). Indeed, nearly all systems operating in large organizations are in some way ‘adapted’ to the subject areas they operate in: earth moving machines (Caterpillar: Nyberg, Mitamura, and Huijsen 2003), job applications (JobBank in Canada: McIntosh 2009), health reports (Global Health Intelligence Network: Blench 2007), patents (Japan Patent Information Office: Bani 2009), health and social welfare (Pan American Health Organization: Vasconcellos and Leon 1985), police data (ProLingua),2 and many more. These large-scale applications of MT continue to expand and develop, and they are certain to do so into the foreseeable future.3
Included in such expansion will undoubtedly be the application of MT to the localization of products. Localization became a specialist application of MT and translation memories in the early 1990s.4 Initially stimulated by the need of software producers to market versions of their systems in other languages, simultaneously or very closely following the launch of the version in the original language (usually English), localization has become a necessity in the global markets. Given the time pressures, and the many languages to be translated into, MT seemed the obvious solution. In addition, the documentation (e.g. software manuals) was both internally repetitive and changed little from one product to another and from one edition to the next. It was possible to use translation memories and to develop controlled terminologies for MT systems. The process involves more than just translation of texts. Localization means the adaptation of products to particular circumstances, e.g. dates (day-month-year vs. month-day-year), times (12-hour vs. 24-hour), address conventions and abbreviations, reformatting (re-paragraphing), and even restructuring complete texts to suit expectations of recipients. (See Chapters 18 and 27.)
The second utilization of MT before the mid 1990s was software on personal computer (PC) systems.5 The first such systems appeared in the early 1980s soon after the appearance of PCs. They were followed by many companies marketing PCs—including most of the Japanese manufacturers of PCs—and covering an increasingly wide range of language pairs on an increasingly wide range of operating systems. While desktop PCs continue to be manufactured and used, (p. 443) this method of delivering MTwill continue. What has always been uncertain is how purchasers have been using these systems. In the case of large-scale (mainframe) ‘enterprise’ systems, it is clear that MT is used to produce drafts which are then edited by bilingual personnel. This may also be the case for PC-based systems, i.e. it may be that they have been and are used to create draft translations which users edit to a higher quality. On the other hand, it seems more likely that some users want just to get some idea of the contents (the basic ‘message’) of foreign-language texts and are not concerned about the quality of translations. This usage is generally referred to as ‘assimilation’ (in contrast to the aim of translating texts into a ‘foreign’ language, referred to as ‘dissemination’). We know (anecdotally) that some users of PC-based MT systems have trusted them too much and have sent out ‘raw’ (unedited) MT translations as if they were as good as human translations. However, it is an unfortunate fact that we do not know in any detail how PC systems have been and are being used. We know that sales of systems continue to be high enough for manufacturers to remain in business over many years, but it is suspected by many observers that purchasers use systems rarely after initial enthusiasm, once they learn how poor the quality of MT output can be.
Mainframe, client-server, and PC systems are overwhelmingly ‘general purpose’ systems, i.e. they are built to deal with texts in any subject domain. Of course, ‘enterprise’ systems (particularly controlled-language systems) are over time focused on particular subject areas, and adaptation to new areas is offered by most large MT systems (such as Systran). A few PC-based systems are available for texts in specific subject areas. Examples are the English/Japanese Transer systems for medical texts and patents.6 On the whole, however, PC systems deal with specific subjects by the availability of subject glossaries, which can be ranked in preference by users. For some systems the range of dictionaries is very wide, embracing most engineering topics, computer science, business and marketing, law, sports, cookery, music, etc.
29.2 Special Devices, On-line MT
From the middle of the 1990s onwards, these two basic types of MT systems have been joined by a range of other types. First should be mentioned the obvious development from PC systems: the numerous systems for hand-held devices. There are a bewildering variety of ‘pocket translators’ in the marketplace. Many, such as the Ectaco range of special devices,7 are in effect computerized versions of the familiar phrasebook or pocket dictionary, and they are clearly marketed primarily (p. 444) to the tourist and business traveller. The dictionary sizes are often quite small, and where they include phrases, they are obviously limited. However, they are sold in large numbers and for a very wide range of language pairs. As with PC-based systems, there is no indication of how successful in actual use they may be: it cannot be much different from the ‘success’ of traditional printed phrase books. (Users may be able to ask their way to the bus station, for example, but they may not be able to understand the answer.) Since the early 2000s, many of these handheld devices have included voice output of phrases, an obvious attraction for those users unfamiliar with the pronunciation of the phrases which may be output.
While many of these automated phrasebooks and dictionaries are purchased on special-purpose devices, there are an increasing number of manufacturers of software for mobile telephones. This software is seen as an obvious extension oftheir text facilities. Text messages can be translated and sent in other languages. The range of languages is not so far very wide, limited on the whole to the commercially dominant languages: English, French, German, and Spanish. It can be predicted that software for mobile telephones will eventually supersede software for special-purpose devices, particularly as more of them provide direct access to on-line MT services.
This has been the second major change since the middle of the 1990s: the availability of free MT services on the Internet (Gaspari and Hutchins 2007). Online MT services appeared in the early 1990s but they were not free. In 1988 Systran in France offered a subscription to its translation software using the French postal service's Minitel network. At about the same time, Fujitsu made its Atlas English-Japanese and Japanese-English systems available through the on-line service Nifty-serve. Then in 1992 CompuServe launched its MT service (based on the Intergraph DP/Translator), initially restricted to selected forums but proving highly popular, and in 1994 Globalink offered an on-line subscription service: texts were submitted on-line and translations returned by e-mail. A similar service was provided by Systran Express. However, it was the launch of AltaVista's Babelfish service in 1997 (based on the various Systran MT systems) that caused the greatest publicity. Not only was it free, but results were (virtually) immediate. Within the next few years, the Babelfish service was joined by FreeTranslation (using the Intergraph system), Gist-in-Time, ProMT, PARS, and many others; in most cases, these were on-line versions of already existing PC-based (or mainframe) systems. The great attraction of these services was (and is) that they are free to users (even if not to providers). It is evidently the expectation of the developers that free on-line use will lead to sales of translation software—although the evidence for this has not been shown—or that it will encourage the use of the fee-based ‘valued-added’ postediting services offered to users by some providers (e.g. FreeTranslation). While on-line MT has undoubtedly raised the profile of MT for the general public, there have of course been drawbacks.
To most users ‘discovering’ on-line MT services, the idea of automatic translation has usually been something completely new, despite the availability of (p. 445) translation software. Attracted by the possibilities, many users have ‘tested’ the service by inputting for translation sentences containing idiomatic phrases, ambiguous words, and complex structures, and even proverbs and deliberately opaque sayings. A favourite method of ‘evaluation’ was and continues to be ‘back-and-forth’ or ‘round-trip’ translation, i.e. translation of a text into another language and then back into the original—a method which might appear valid to the uninitiated but which is not at all satisfactory (see Chapter 28, and Somers 2007a). Not surprisingly, they often found that the results were unintelligible, that MT was liable to ‘faulty’ and ‘inaccurate’ results, and that it suffered from many limitations, findings all well known to company users and to purchasers of translation software, not to mention researchers and developers. Numerous commentators have enjoyed finding fault with on-line MT and, by implication, with MT itself. Users have undoubtedly been gravely disappointed by the poor quality of much of MT, where they are capable of judging it. There is no doubt that the less knowledge users have of the language of the original texts the more value they attach to the MT output; and some users must have found that on-line MT enabled them to read texts which they would have previously had to pass over.
However, we know very little (indeed almost nothing) about who uses on-line MT and what for. We do not know their ages, backgrounds, knowledge of languages; we do not know how many translate only into their native language, how many use on-line MT to translate into an unknown foreign language, how many are translators using MTas rough drafts, how many use the subject glossaries available, and so forth. Almost all that we do know are the surprising facts that translation of webpages is very much a minor use (no more than about 15 per cent at best), that the average length of texts submitted is just twenty words, and that more than 50 per cent of submissions are one- or two-word phrases (Gaspari and Hutchins 2007). It had been anticipated by the providers of these services that longer texts would be submitted—the usual limitation to 150 words is clearly no impediment—and that much of the translation would be of webpages. The surprisingly low submission of texts longer than a few words seems to suggest that online MT is being used primarily for dictionary consultation, and perhaps therefore by people with some familiarity with foreign languages, despite the availability of many free on-line dictionaries, and the inherent unsuitability of MT for this task, since it generally offers just one translation for any given input word, whereas a dictionary will offer a range of alternatives. Whatever the ways people are using them, overall usage of on-line MT continues to increase exponentially (e.g. FreeTranslation from 50,000 in 1999 to 3.4 million in 2006; the totals for Babelfish are much higher).
The translation of webpages—a facility provided by PC-based systems before the on-line MT services became available—has complications in addition to the obvious problems of satisfactorily and intelligibly rendering the often colloquial and culture-dependent nature of the texts. Many webpages include text in graphic format, which no MT system can deal with, and therefore often much of the (p. 446) webpage will be untranslated. This may account for the low usage of on-line MT systems for webpage translation. However, it is all the more surprising that so many website developers and owners recommend users to use on-line MT services for translation of their webpages (Gaspari and Somers 2007). It is clear that they do not appreciate the potentially poor results of MT, nor are they aware of consequent negative impacts on their company or products.
A recent development is systems designed for website localization. As mentioned above, localization became a specialist application of MT and translation memories in the early 1990s, and it has become a major application of MT. The extension into website localization was an obvious move, which did not come, however, until after 2000. The most significant development has been the introduction of specialized systems, notably IBM Websphere,8 which are designed for Internet service providers and for large corporations to supply and edit translations of their own webpages localized to their specific domain, as well as for cross-language communication with customers and for providing ‘gist’ translations internally.
The limitations of MT when dealing with colloquial and elliptical ‘normal’ language—as opposed to the formal written texts of books and magazines—is highlighted by its problems with e-mail. Just as most translation software has provided facilities for translating webpages, many systems seek to embrace e-mail text as well, though with what success or user satisfaction is unknown. Few researchers have focused specifically on this type of text—those that have are mainly in Japan and Korea—and even fewer have marketed such systems. An exception is Translution,9 which offers on-line translation of e-mails for companies. Subscriptions vary according to the level of service, and whether web-based or located on a client-server system.
Even more challenging perhaps is the language of chatroom and social networking sites. Some tentative attempts have been made to deal with chatroom conversation: Condon and Miller (2002) illustrate the similarities of such texts with spoken language and the similarities of their shared problems. But the huge possibilities of devising MT for social networking in general appear to have not yet been tackled—perhaps because all users expect everything to be in (some variant of) English.
29.3 Speech Translation
As mentioned earlier, an increasing number of phrasebook systems offer voice output. This facility is also increasingly available for translation software—it seems (p. 447) that Globalink in 1995 was the earliest—and it is likely that it will be an additional feature for on-line MT some time in the future. But automatic speech synthesis of text-to-text translation is not at all the same as genuine speech-to-speech translation, the focus of research efforts in Japan (ATR), USA (Carnegie-Mellon University), Germany (Verbmobil project), and Italy (ITC-irst, NESPOLE) since the late 1980s,10 and many more recent projects besides. The research in speech translation is beset with numerous problems, not just variability of voice input but also the nature of spoken language. By contrast with written language, spoken language is colloquial, elliptical, context-dependent, interpersonal, and primarily in the form of dialogues. MT has focused mainly on grammatically well-formed technical and scientific language and has tended to neglect informal modes of communication. Speech translation therefore represents a radical departure from traditional MT. Some of the difficulties of speech translation may be overcome by adding visual clues to reduce ambiguities, i.e. as multimodal systems to aid dialogue communication (e.g. Burger, Costantini, and Pianesi 2003). Complexities of speech translation are, however, generally reduced by restricting communication to relatively narrow domains: a favourite for many researchers has been cooperative dialogues as in business communication, booking of hotel rooms, negotiating dates of meetings, etc. From these long-term projects no commercial systems have appeared yet. There are, however, other areas of speech translation which do have working (but not yet commercial) systems. These are communications between patient and doctor and other healthcare specialists, communication by soldiers with civilians in military (field) operations, and communication in the tourism domain.
The potentialities of healthcare communication applications are obvious, particularly for communication involving immigrant and other ‘minority’ language speakers. However, there are different views of the most effective and most appropriate methods. In some cases this may be one-way communication, e.g. from a doctor or medical professional (nurse, paramedic, pharmacist, etc.) asking the patient a question, which might be answered nonverbally or by a simple ‘yes’ or ‘no’. In other cases, communication may be two-way or interactive, e.g. patient and doctor consulting a screen displaying possible health conditions, or communication may be via a phrasebook-type system with voice input to locate phrases and spoken output of the translated phrase (Rayner and Bouillon 2002), and/or with interactive multimodal assistance (Seligman and Dillinger 2006). Nearly all systems are currently somewhat inflexible and limited to specific narrow domains. Speech translation itself may be only one factor in successful healthcare-related consultation, since cultural and environmental issues are also involved; and whether medical personnel should be the initiators and ‘in control’ is another issue: in (p. 448) some circumstances the patients are likely to be regular users and could be more familiar with a language-specific device than the medical professional, and might also use it in other than health-related situations.11
However, before such issues of usability and appropriateness can be resolved, the robustness of speech translation even in highly constrained domains has to be satisfactory: the weakest point is still automatic speech recognition, even though domain-specific translation itself is also still inadequate.
In the military field, the MT team at Carnegie-Mellon University developed a speech translation system (DIPLOMAT) 12 which can be quickly adapted to new languages, i.e. languages spoken in areas where the US Army is deployed (Serbo-Croat, Haitian Creole, Korean, Arabic). The system was based on an example-based MT approach; spoken language was matched against phrases (examples) in the database and the translations output by a speech synthesis module. An evaluation in the field concluded that the speech components were satisfactory but the MT component was not adequate: translation was far too slow in practice, and a feedback (‘back translation’) module enabling users to check the appropriateness of the translation introduced additional errors. Further development was not pursued. In the same domain, however, it seems that another system on a hand-held PDA device has been more successful. This device (Phraselator, from VoxTec), contains a database of phrases in the foreign language which the English-speaking user can select from a screen of English phrases.13 Output is not synthesized speech but prerecorded by native speakers. The device has been used by the US Army in various operations in Croatia, Iraq, and Indonesia, including civilian emergency situations (e.g. the tsunami relief in 2005), by the US Navy, by law enforcement officers, etc. A wide range of languages is now covered, and the device and its software are now more widely available commercially. Adaptation to medical domains is being planned.
One of the most obvious applications of speech translation is by tourists in foreign countries. Many of the organizations mentioned earlier are involved in developing systems—most utilizing the Basic Travel Expression Corpus of Japanese-English developed by ATR—and often extending investigation to Chinese-English, Arabic-English and Italian-English. A welcome feature of this activity is the collaborative efforts and the exchange of resources by research groups.14 In many cases, translation is restricted to standard phrases extracted from corpora of dialogues and interactions in tourist situations. However, in recent years, researchers have moved to systems capable of dealing with spontaneous speech, i.e. something more like real-life applications. Despite the amount of research in an apparently highly restricted domain, it is clear that commercially viable products lie some way in the future. In (p. 449) the meantime, for some years yet, the market will see only the voice-output phrase-book devices and systems mentioned above.
29.4 Rapid Development, Open Source, Hybrid Systems
As mentioned already, the rapid development of systems is becoming recognized as important for MT applications. One of the advantages of statistical MT (SMT)— the focus of most MT research in the early decades of the present century (see further Chapter 28)—is claimed to be the rapid production of systems in new language pairs.15 Researchers do not need to know the languages involved as long as they have confidence in the reliability of the corpora which they work with. This is in contrast to the slower development of rule-based (RBMT) systems which require careful lexical and grammatical analyses by researchers familiar with both source and target languages. Nearly all commercially available MT systems (whether for mainframe, client-server, or PC) are rule-based systems, the result of many years of development (cf. Hutchins 1986). SMT systems have only recently appeared on the marketplace. The Language Weaver company,16 an offshoot of the research group at the University of Southern California, began marketing SMT systems in 2002. It began with Arabic-English and has now added many other language pairs. Many users of these systems are US government agencies involved in information gathering and analysis operations (see below). Perhaps more significantly, the online translation service offered by Google is based on SMT systems.17
Increasingly, resources for MT, both rule-based and statistical, are widely available as ‘open source’ materials. The Apertium system from Spain has been the basis for freely available rule-based MT systems for Spanish, Portuguese, Galician, Catalan, etc.18 There are other open-source translation systems (less widely used), such as GPLTrans for Dutch, French, German, Indonesian, Italian, Spanish, etc.19 but it is to be expected that many more will be available in the coming years. Most of the resources needed to build an SMT system are freely available, for example the Moses system,20 developed by a consortium of many of the leading SMT researchers.
(p. 450) Many researchers believe that the future for MT lies in the development of hybrid systems combining the best of the statistical and rule-based approaches. In the meantime, however, until a viable framework for hybrid MT appears, experiments are being made with ‘multi-engine’ systems and with adopting statistical techniques with rule-based (and example-based) systems. The multi-engine approach involves the translation of a given text by two or more different MT architectures (e.g. SMT and RBMT) and the integration of outputs for the selection of the best output, for which statistical techniques can be used. The idea is attractive and quality improvements have been achieved, but it is difficult to see this approach as a feasible economic method for large-scale or commercial MT. An example of appending statistical techniques to RBMT is the experiment (by a number of researchers in Spain, Japan, and Canada) of ‘statistical post-editing’ (Diaz de Illaraza, Labaka, and Sarasola 2008). In essence, the method involves the submission of the output of an RBMT system to a ‘language model’ of the kind found in SMT systems. One advantage of the approach is that the deficiencies of RBMT for less-resourced languages may be overcome.
29.5 Language Coverage, Especially ‘Minority’ Languages
The language pairs most often in demand and available commercially are those from and to English. At the time of writing, the most frequent pairs (for on-line MT services and apparently for PC-based systems) are English-Spanish and English-Japanese. These are followed by (in no particular order) English coupled with French, German, Italian, Chinese and Korean, and French-German. Other European languages such as Czech, Polish, Bulgarian, Romanian, Latvian, Lithuanian, Estonian, and Finnish are more rarely found on the market. Until the middle of the 1990s, Arabic-English and Arabic-French were also rare, but this situation has changed for obvious political reasons. Other Asian languages have also been relatively neglected: Malay, Indonesian, Thai, Vietnamese, and even the major languages of India—Hindi, Urdu, Bengali, Punjabi, Tamil, etc.—though this situation is slowly changing. African languages have been mostly ignored, apart from relatively recent work in South Africa on languages with official status in that country. In terms of numbers of speakers, these are not ‘minor’ languages: many are among the world's most spoken languages. The reason is a combination of low commercial viability and lack of language resources (whether for rule-based lexicons and grammars or for statistical MT corpora).
(p. 451) The categorization of a language as a ‘minority language’ is determined geographically. In the UK, world languages such as Hindi, Punjabi, and Bengali are minority languages, because the major language is English. In the context of the European Union, languages such as Welsh, Irish, Estonian, Lithuanian are minor, whether official languages of a country or not. From a global point of view, ‘minor’ languages are those which are not commercially or economically significant. The language coverage of MT systems reflects this global perspective, and so the problems and needs of speakers of ‘lesser’ languages were long ignored, although recently they have had more attention: in Spain with MT systems for Catalan, Basque, and Galician; in Eastern Europe with systems for Czech, Estonian, Latvian, and Bulgarian; and in South and Southeast Asia with MT activity on Bengali, Hindi, Tamil, Thai, Vietnamese, etc. This growing interest is reflected in the holding of regular workshops on minority-language MT.21 The problems for minority and immigrant languages are many and varied: there is often no word-processing software (indeed, some languages lack scripts), no spellcheckers (sometimes languages lack standard spelling conventions), no dictionaries (monolingual or bilingual), indeed a general lack of language resources (e.g. corpora of translations) and of qualified/experienced researchers.22 Before MT can be contemplated, these resources must be created, and the Internet may help to some extent with glossaries and bilingual corpora. There is, in addition, the question whether the communication needs of immigrants and minorities are best met with MT or with lower-level technologies, as indicated above with reference to speech translation.
One specific target of MT for immigrants or minorities has been the translation of captions (or subtitles) for television programmes. The most ambitious experiment is at the Institute for Language and Speech Processing (Athens) involving speech recognition, English text analysis, and caption generation in English, Greek, and French (Piperidis et al. 2004). Usually, however, captions in foreign languages are generated from caption texts produced as a normal service for the deaf or hearing impaired by television companies. A group at Simon Fraser University in Canada has investigated the translation of English television captions into Spanish and Portuguese (Turcato et al. 2000), and a group at the Electronics and Telecommunications Research Institute in Korea is developing CaptionEye/EK, an MT system for translating English television captions into Korean (Seo et al. 2001). In both cases, translation is based on pattern matching of short phrases (in systems of the example-based MT type).
Apart from minorities and immigrants, there are other disadvantaged members of society now beginning to be helped by MT-related systems. In recent years, researchers have looked at translating into sign languages for the deaf. The (p. 452) problems go, of course, beyond those encountered with text translation. The most obvious one is that signs are made by complex combinations of face, hand, and body movements which have to be notated for translation and reproduced by computer. In most cases, conventional rule-based approaches are adopted, but Morrissey et al. (2007) have experimented with hybrid statistical and example-based methods. Experiments have reported work on translating from English text into American, British, or Irish Sign Languages (Huenerfauth 2005, Marshall and Sáfár 2003, Morrissey et al. 2007), while Stein et al. (2007) also refer to work on systems translating from German, Chinese, and Spanish to their respective sign languages. The same report discusses translation in the opposite direction—a task involving the processing of moving images, arguably even more difficult than speech processing. We may expect more in the future.
29.6 Information Retrieval and Extraction
Translation is rarely an isolated activity; it is usually a means for accessing, acquiring. and imparting information. This is clearly the case with many examples already mentioned: translation in healthcare-related communication, translation of patents and technical documentation, translation of television subtitles, etc. MT systems are therefore often integrated with (combined or linked with) various other language-processing activities: information retrieval (IR), information extraction and analysis, question answering, summarization, technical authoring.
Multilingual access to information in documentary sources (articles, conferences, monographs, etc.) was a major interest in the early years of MT, but as IR became more statistics-oriented and MT became more rule-based the reciprocal relations diminished. However, since the mid 1990s with the increasing interest in SMT the relations have revived, and cross-language information retrieval (CLIR) is now a vigorous area of research with strong links to MT: both fields share the task of retrieving words and phrases in foreign languages which match (exactly or ‘fuzzily’) with words and phrases of input ‘texts’ (queries in IR, source texts in MT), and both combine linguistic resources (dictionaries, thesauri) and statistical techniques. There are extensions ofCLIR to images and to spoken ‘documents’, e.g. the experiments by Flank (2000) and by Etzioni et al. (2007) on multilingual image retrieval, and by Meng et al. (2001) for retrieving Chinese broadcast stories which are ‘similar’ to a given input English text (not just a query).23
(p. 453) Information extraction has similar close links to MT, strengthened likewise by the growing statistical orientation of MT. Many government-funded (international and national) organizations have to scrutinize foreign-language documents for information relevant to their activities (from commercial and economic to surveillance, intelligence, and espionage). The scanning (skimming) of documents received—previously an onerous human task—is now routinely performed automatically. The cues for relevant information include not just keywords such as export, strategic, attack, (and their foreign language equivalents), but also the names of persons, companies, and organizations. Where languages use different orthography, the systems need to incorporate transliteration facilities which can convert, say, a Japanese version of a politician's name into its (perhaps original) English form. The identification of names (or ‘named entities’) and their transliteration has become an increasingly active field in the last few years.24
Information analysis and summarization is frequently the second stage after information extraction. These activities have also, until recently, been performed by human analysts. Now at least drafts can be obtained by statistical means: methods for summarization have been researched since the 1960s. The development of working systems that combine MT and summarization is apparently still something for the future (Siddharthan and McKeown 2005, Saggion 2006) 25. The major problems are the unreliability of MT (incorrect translations, distorted syntax, etc.) and the imperfections of current summarization systems, which are based on the detection of sentences important as indicators of content (paragraph-initial sentences, sentences containing lexical clues, particular names, etc.). Combining MT and summarization would be a desirable development in many areas, not just for information gathering by government bodies but also for managers of large corporations and most researchers with no knowledge of the original language. Such potential users of MT rarely want to read the whole of a document; what they want is to extract information for a specific need.
The field of question answering involves retrieving answers in text form from databases in response to (ideally) natural-language questions. Like summarization, this is a difficult task; but the possibility of multilingual question-answering is attracting more attention in recent years.26
Finally, the impetus in large corporations to produce documentation in multiple languages in as short timescales as possible has led to the closer integration of the processes of authoring (technical writing) and translating. This is true not only where companies have decided to adopt ‘controlled languages’ for their (p. 454) documentation—as we have seen above—but also where writers make use of rough translations as aids. Surveys of the use of Systran at the European Union have shown that much of its use is by administrators and other officials when writing documents in languages they are not fully fluent in: a draft translation from a text in their own language is used as the basis for writing in another (Senez 1995). Perhaps this is what some users of on-line MT and of PC-based systems are doing; if translation systems are used as aids to writing in another relatively poorly known language, this may explain to some extent the frequency (mentioned above) with which on-line MT systems are used to translate individual words and short phrases.
What these examples of MT applications illustrate is that MT is being used not for ‘pure’ translation but to aid bilingual communication in an ever-widening range of situations; and it is becoming just one component of multilingual, multimodal document (text) and image (video) extraction and analysis systems. The future scope of MT and its applications seems to be without limit.
Further Reading and Relevant Resources
The main sources of information on the uses of machine translation are the proceedings of conferences held by the Association for Machine Translation in the Americas (http://www.amtaweb.org), the European Association for Machine Translation (http://www.eamt.org), and the Asia-Pacific Association for Machine Translation (http://www.aamt.info), the biennial ‘Machine Translation Summit’ conferences, and the annual series of ‘Translating and the Computer’ conferences organized by Aslib (http://www.aslib.com). Organizations specifically concerned with MT usage and holding regular conferences include the Localization Industry Standards Association (http://www.lisa.org) and the Translation Automation User Society (http://translationautomation.com). Important journals include Machine Translation, published by Springer (mainly research-oriented, however) and ‘Multilingual Computing & Technology’ (http://www.multilingual.com), and a valuable source of information and opinion about translation aids of all kinds is to be found in the ‘Toolkit’ newsletter (http://www.internationalwriters.com/toolkit). A general resource for articles on all aspects of MT (current and historical) is the ‘Machine Translation Archive’ (http://www.mt-archive.info).
(3) Research on controlled languages and MT has been regularly reported at the CLAW (Controlled Language Workshop) conference series, started in 1996.
(14) e.g. the series of International Workshops on Spoken Language Translation, launched in 2004.
(15) The main impediment in most cases is the lack of text corpora (bilingual and monolingual) in electronic form, although the growth of Internet (website) resources is gradually filling the gaps.
(21) SALTMIL (Speech And Language Technology for Minority Language) Workshops have been held since 1998.
(23) Workshops on CLIR have taken place regularly and frequently since 1996.
(25) The MMIES (Multi-source, Multilingual Information Extraction and Summarization) workshops on this topic have taken place since 2007.
(26) See e.g. the proceedings of the Workshop on Multilingual Question Answering (MLQA’06), held in April 2006 in conjunction with the EACL conference in Trento, Italy.