Michael P. Oakes
Author profiling is the analysis of people’s writing in an attempt to find out which classes they belong to, such as gender, age group or native language. Many of the techniques for author profiling are derived from the related task of Author Identification, so we will look at this topic first. Author identification is the task of finding out who is most likely to have written a disputed document, and there are a number of computational approaches to this. The three main subtasks are the compilation of corpora of texts known to be written by the candidate authors, the selection of linguistic features to represent those texts, and statistics for discriminating between those features which are most indicative of a particular author’s writing style. Plagiarism is the unacknowledged use of another author’s original work, and we will look at software for its detection. The chapter will cover the types of text obfuscation strategies used by plagiarists, commercial plagiarism detection software and its shortcomings, and recent research systems. Strategies have been developed for both external plagiarism detection (where the original source is searched for in a large document collection) and intrinsic plagiarism detection (where the source text is not available, necessitating a search for inconsistencies within the suspicious document). The specific problems of plagiarism by translation of an original in another language, and the unauthorized copying of sections of computer code, are described. Evaluation forums and publicly available test data sets are covered for each of the main topics of this chapter.
The ability to communicate in writing is an essential skill in modern society. But ability in writing varies considerably; and no matter what their existing level of competence, most writers would acknowledge that what they write could often be improved. Given that the output of the writing process is natural language, it seems plausible that natural language processing techniques might be used to analyse this output and to suggest ways to improve it. In various guises, this has indeed been an application of NLP at least since the 1960s. In this chapter, we survey the different kinds of assistance to authors that NLP makes possible; we describe what can be done today, and explore what might be possible in the future.
This chapter surveys methods of analysing phonological change that rely on computers because they require lengthy operations, mathematical precision, and reproducibility. Applications include techniques for discovering and verifying sound correspondences, modelling the course of sound change, computing the most likely genetic tree consistent with a set of innovations, testing the significance of the phonetic evidence for genetic relationship between languages, and exploring the relationships between dialects via quantification of phonetic and phonological differences.
This chapter introduces the fields of Computational Linguistics (CL)—the computational modelling of linguistic representations and theories—and Natural Language Processing (NLP)—the design and implementation of tools for automated language understanding and production—and discusses some of the existing tensions between the formal approach to linguistics and the current state of the research and development in CL and NLP. The paper goes on to explain the specific challenges faced by CL and NLP for Persian, much of it derived from the intricacies presented by the Perso-Arabic script in automatically identifying word and phrase boundaries in text, as well as difficulties in automatic processing of compound words and light verb constructions. The chapter then provides an overview of the state of the art in current and recent CL and NLP for Persian. It concludes with areas for improvement and suggestions for future directions.
Temporality in computational linguistics and natural language processing can be considered from two aspects. One concerns the use of linguistic and philosophical theories of temporality in computational applications. The other concerns the use of computational theory in its own right to define new kinds of theories of dynamical systems including natural language and its temporal semantics. As in the case of nominal expressions in natural language, we should be careful to distinguish temporal semantics, or the question of what kinds of objects and relations temporal categories denote, from the question of temporal reference to particular times or events that the discourse context affords. It is useful to draw a further distinction within the semantics between temporal ontology, or the types of temporal entity that the theory entertains, such as instants, intervals, events, states, or whatever, temporal quantification over such entities, and the temporal relations over them which it countenances, such as priority or posteriority, causal dependence, and the like. This article examines computational linguistics, focusing on temporal semantics, and also considers ontologies, quantifiers, relations, and temporal reference.
Computational linguistics grew out of early projects in machine translation. Initially it was conceived of as a branch of artificial intelligence with the goal of complete human-like language understanding, and was concerned with symbolic methods of parsing and semantic analysis. In recent years, because of more powerful computers, the development of machine-learning algorithms, and the rise of the World Wide Web, computational linguistics has taken an empiricist view of language processing that is based on corpora and statistical methods. It emphasizes practical applications with a tolerance for some degree of error.
Sentence comprehension draws on multiple levels of linguistic knowledge, including the phonological, orthographic, lexical, syntactic, and discoursal. This article focuses on the computational models of second language sentence processing. Understanding the computational mechanisms responsible for using this knowledge in real time provides basic insights into how language and the mind work. For a cognitive theory of second language acquisition, a better understanding of how the second language learner develops the capacity to process sentences fluently also has important implications for theories of acquisition and instruction. This article examines two perspectives on written sentence comprehension in the second language. The two approaches considered are syntax based and constraint based. The approaches make fundamentally different assumptions concerning the nature of linguistic representation and how the human speech processing mechanism uses this knowledge in online comprehension. The two perspectives also represent a basic division between formalist and functionalist/usage based approaches to second language learning and use.
Edward P. Stabler
While research in the ‘principles and parameters’ tradition can be regarded as attributing as much as possible to universal grammar (UG) in order to understand how language acquisition is possible, Chomsky characterizes the ‘minimalist program’ as an effort to attribute as little as possible to UG while still accounting for the apparent diversity of human languages. These two research strategies aim to be compatible, and ultimately should converge. Several of Chomsky's own early contributions to the minimalist program have been fundamental and simple enough to allow easy mathematical and computational study. Among these are (i) the characterization of ‘bare phrase structure’; and (ii) the definition of a structure building operation Merge which applies freely to lexical material, with constraints that ‘filter’ the results only at the phonetic form and logical form interfaces. The first studies inspired by (i) and (ii) are ‘stripped down’ to such a degree that they may seem unrelated to minimalist proposals, but this article shows how some easy steps begin to bridge the gap. It briefly surveys some proposals about (iii) syntactic features that license structure building; (iv) ‘locality’, the domain over which structure building functions operate,; (v) ‘linearization’, determining the order of pronounced forms; and (vi) the proposal that Merge involves copying.
This chapter presents a characterisation of the field of computational pragmatics, discusses some of the fundamental issues in the field, and provides a survey of recent developments. Central to computational pragmatics is the development and use of computational tools and models for studying the relations between utterances and their context of use. Essential for understanding these relations are the use of inference and the description of language use as actions inspired by the context, and intended to influence the context. The chapter therefore focuses on recent work in the use of inference for utterance interpretation and in dialogue modeling in terms of dialogue acts, viewed as context-changing actions. The chapter concludes with a survey of recent activities concerning the construction and use of resources in computational pragmatics, in particular annotation schemes, annotated corpora, and tools for corpus construction and use.
Carlos Ramisch and Aline Villavicencio
In natural-language processing, multiword expressions (MWEs) have been the focus of much attention in their many forms, including idioms, nominal compounds, verbal expressions, and collocations. In addition to their relevance for lexicographic and terminographic work, their ubiquity in language affects the performance of tasks like parsing, word sense disambiguation, and natural-language generation. They lend a mark of naturalness and fluency to applications that can deal with them, ranging from machine translation to information retrieval. This chapter presents an overview of their linguistic characteristics and discusses a variety of proposals for incorporating them into language technology, covering type-based discovery, token-based identification, and MWE-aware language technology applications.
Carol A. Chapelle
Computer-assisted language learning, defined as “the search for and study of applications of the computer in language teaching and learning”, covers a broad spectrum of concerns, but the central issues are the pedagogies implemented through technology and their evaluation. In view of the range of complex materials included under the umbrella of CALL, research and practice in this area draws from other areas within and beyond applied linguistics for conceptual and technical tools to develop practices and evaluate success. Like technologies for language learning, theories of instructed SLA have evolved dramatically over the past twenty years. One change is the evolution in the input theory that Underwood drew upon. Whereas that theory asserts that the second language is acquired unconsciously, Schmidt claims the opposite: that subliminal language learning is impossible, and that is what learners consciously notice. This requirement of noticing is meant to apply equally to all aspects of language.
In this chapter the use of corpora in natural-language processing (NLP) is overviewed. The chapter begins by defining what a corpus is. In doing so it introduces different types of corpora such as monolingual, parallel and comparable corpora. It also discusses key issues in corpus design, notably balance and representativeness. The chapter then overviews the history of corpus linguistics, from its early beginnings in the pre computer age to its current digital form. Following this there is a brief survey of the current state of corpora, taking into account recent innovations in corpus construction, notably the development of the notion of the ‘Web as corpus’. The chapter concludes by briefly considering the use of corpora in a range of NLP systems.
Deep learning has rapidly gained huge popularity among researchers in natural-language processing and computational linguistics in recent years. This chapter gives a comprehensive and detailed overview of recent deep-learning-based approaches to challenging problems in natural-language processing, specifically focusing on document classification, language modelling, and machine translation. At the end of the chapter, new opportunities in natural-language processing made possible by deep learning are discussed, which are multilingual and larger-context modelling.
Research on dialogue deals with the study of language as it is used in spontaneous conversation. Dialogue is a multi-agent activity that takes place in real time, with speakers interacting with each other in an online fashion. This makes conversational language markedly different from the kind of language found in texts and brings in new challenges for computational linguistics. This chapter introduces the main phenomena that characterize language in dialogue interaction—including disfluencies, dialogue acts, alignment, grounding, and turn taking—and discusses some of the key approaches to modelling dialogue that are fundamental in computational research, such as dialogue act taxonomies and dynamic semantic theories of dialogue.
Discourse is the area of linguistics concerned with the aspects of language use that go beyond the sentence—and in particular, with the study of coherence and salience. In this chapter we present a few key theories of these phenomena. We distinguish between two main types of coherence: entity coherence, primarily established through anaphora; and relational coherence, expressed through connectives and other relational devices. Our discussion of anaphora and entity coherence covers the basic facts about anaphoric reference and introduces the dynamic approach to the semantics of anaphora implemented in theories such as Discourse Representation Theory, based on the notion of discourse model and its updates. With regards to relational coherence, we review some of the main claims about the relational structure of discourse—such as the claim that coherent discourses have a tree structure, or the right frontier hypothesis—and four main theoretical approaches: Rhetorical Structure Theory, Grosz and Sidner’s intentional structure theory, the inference-based approach developed by Hobbs and expanded in Segmented DRT, and the connective-based account. Finally we cover theories of local and global salience and its effects, including Gundel’s Activation Hierarchy theory and Grosz and Sidner’s theory of the local and global focus.
Rebecca Passonneau and Inderjeet Mani
This chapter introduces a conceptual framework for the evaluation of natural language processing systems. It characterizes evaluation in terms of four dimensions: intrinsic versus extrinsic evaluation, stand-alone systems versus components, manual versus automated methods, and laboratory versus real-world conditions. A comparative overview of evaluation methods in major areas of NLP is provided, covering distinct applications such as information extraction, machine translation, automatic summarization, and natural language generation. The discussion of these applications emphasizes commonalities across evaluation methods. Next, evaluation of particular component technologies is discussed, addressing coreference, word sense disambiguation and semantic role labelling, and finally referring-expression generation. The chapter concludes with a brief assessment of the status of evaluation in NLP.
Finite-state machines—automata and transducers—are ubiquitous in natural-language processing and computational linguistics. This chapter introduces the fundamentals of finite-state automata and transducers, both probabilistic and non-probabilistic, illustrating the technology with example applications and common usage. It also covers the construction of transducers, which correspond to regular relations, and automata, which correspond to regular languages. The technologies introduced are widely employed in natural language processing, computational phonology and morphology in particular, and this is illustrated through common practical use cases.
Information extraction constructs a structured knowledge representation from unstructured text, so that the knowledge may be further used for search, inference, and analysis. Given a specification of select types of entities, semantic relations, and events, it builds a database from instances of this information in text. This chapter describes the stages of processing involved and considers how such systems may be built using hand-coded rules, supervised training, and semi-supervised training.
Qiaozhu Mei and Dragomir Radev
This chapter is a basic introduction to text information retrieval. Information Retrieval (IR) refers to the activities of obtaining information resources (usually in the form of textual documents) from a much larger collection, which are relevant to an information need of the user (usually expressed as a query). Practical instances of an IR system include digital libraries and Web search engines. This chapter presents the typical architecture of an IR system, an overview of the methods corresponding to the design and the implementation of each major component of an information retrieval system, a discussion of evaluation methods for an IR system, and finally a summary of recent developments and research trends in the field of information retrieval.
This article focuses on the current state of affairs in the field of Arabic computational linguistics. It begins by briefly monitoring relevant trends in phonetics and phonology, morphology, syntax, lexicology, semantics, stylistics, and pragmatics. Then, the chapter describes changes or special accents within formal Arabic syntax. After some evaluative remarks about the approach opted for, it continues with a linguistic description of literary Arabic for analysis purposes as well as an introduction to a formal description, pointing to some early results. The article hints at further perspectives for ongoing research and possible spinoffs such as a formalized description of Arabic syntax in formalized dependency rules as well as a subset thereof for information retrieval purposes.