Show Summary Details

A more recent version of this content exists; this version was replaced on 6 September 2017. The version that replaced it can be found here.
Page of

PRINTED FROM OXFORD HANDBOOKS ONLINE (www.oxfordhandbooks.com). © Oxford University Press, 2018. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a title in Oxford Handbooks Online for personal use (for details see Privacy Policy and Legal Notice).

date: 17 February 2020

Issues in Arabic Computational Linguistics

Abstract and Keywords

This article focuses on the current state of affairs in the field of Arabic computational linguistics. It begins by briefly monitoring relevant trends in phonetics and phonology, morphology, syntax, lexicology, semantics, stylistics, and pragmatics. Then, changes or special accents within formal Arabic syntax are described. After some evaluative remarks about the approach opted for, it continues with a linguistic description of literary Arabic for analysis purposes as well as an introduction to a formal description, pointing to some early results. The article hints at further perspectives for ongoing research and possible spin-offs such as a formalized description of Arabic syntax in formalized dependency rules as well as a subset thereof for information retrieval purposes.

Keywords: phonetics, phonology, morphology, syntax, lexicology, semantics, stylistics, pragmatics

9.1 Introduction

At a meeting in Doha (Qatar 2011), experts discussed the challenges for Natural Language Processing (NLP)1 applied to (and, if possible, in) Arabic, concerning technologies, resources, and applications in cultural, social, educational, medical, and touristic areas, in the region concerned, for the near future. Interestingly enough, there was a consensus (by majority of votes)2 about more focus on large-scale caption of written Arabic (OCR) in view of preservation and the accessibility of the Arabic and Islamic cultural heritage; the spoken varieties of Arabic in view of the development of all kinds of conversion and answering systems (AS) to and from a standard, speech-to-speech3 (STS) as well as to speech-to-text (STT) and text-to-speech (TTS)4 conversion; and, (p. 214) finally, on multisimultaneous signal processing (subtitling, visualization, and instant translation),5 if possible, with event (EE) or factoid (FE) extraction for information retrieval (IR), document routing (DR), archiving purposes, mass storage, Aboutness suggestions, and different forms of retracing facilities.

NLP overlaps to a large degree with computational linguistics (CL), especially when both are applied to standard Arabic or spoken varieties. The former (NLP) usually centers on the interaction between man and machine, deals with “processing” and “automated processes,” and is as exact as possible in nature. It therefore remains measurable and verifiable (NLP is eager for applications). The latter (CL) concentrates exclusively on linguistic theory and language modeling, while using any computational means it can exploit, for an adequate, coherent, and consistent linguistic description or language model.

CL is usually characterized as a subsection of artificial intelligence (AI). However, I would like to underline its communicative (action and reaction) perspective against a purely cognitive environment of AI. Moreover, the communicative aspect of CL points to a reference to reality. Even when formalized and in its most abstract and logically implemented form, semantics still remains an open domain. Moreover, I would like to underline two complementary aspects of CL concerning Arabic, one the application of computer sciences to Arabic; and the otherArabic linguistics making as much use as possible of computational as well as linguistic means and techniques. The former is more striking, and the latter is more basic.

For information on relevant trends in Arabic NLP and CL, we do not need to start from scratch. General trends in NLP are adequately described in Manning and Schütze (2003) and Jurafsky and Martin (2009);6 for Arabic NLP, see Habash (2010) and the references therein. For CL in general, one should certainly consult Bunt et al. (2004, 2010).7 Soudi et al. (2007) and Farghaly (2010) offer valuable contributions in the field of Arabic CL but also deal with Arabic NLP. Levelt (1974) is a must for formal grammars (and psycholinguistics). On (more linguistically oriented) main and subentries, there is valuable information in Versteegh (2006–2009).8 Needless to say, the Internet is always a good, if not the best, starting point for a literature search.

The main topic of interest here is the current state of affairs in the field of Arabic CL. Relevant trends in phonetics and phonology, morphology, syntax, lexicology, semantics, and stylistics and pragmatics will be briefly examined. Then changes or special accents within the field of interest, namely, formal Arabic syntax, will be noted. After some evaluative remarks about the approach of this chapter, it continues with a linguistic description of MSA for analysis purposes as well as an introduction to a formal description. Some early results will be highlighted. Further perspectives are then (p. 215) offered for ongoing research and possible spinoffs such as a formalized description of Arabic syntax in formalized dependency rules as well as a subset thereof for IR purposes. Appendix 1 contains a list of acronyms frequently encountered in NLP and CL. Appendix 2, found only in the online version, gives a glossary of frequently-used terms in NLP and CL.

9.2 Arabic CL

Defined as a statistical or rule-based modeling of natural language from a computational point of view,9 CL should always contain a linguistic dimension in the form of a specific theory combined with a descriptive model together with a formal implementation in which that linguistic theory about a natural language or its adequate, coherent, and consistent description is entered in a processing environment for analysis or synthesis purposes. Jurafsky and Martin (2009) adequately describe a modern “toolkit” for CL, but we limit ourselves here mainly to rule-based modeling by means of a relational programming algorithm using a nondeterministic formalism of two interwoven context-free grammars, resulting in a bottom-up unification-based parser for Arabic.10 Levelt (1974) provides a descriptively and didactically good introduction in the field of Formal Grammars and Psycholinguistics.

Arabic CL thus combines a linguistic and a computational part. The linguistic part exploits the most recent developments in the field of adequate, coherent, and consistent language description. The formal part tests the (natural or programming) language description concerned with computational viability. Nowadays, linguistic description testing usually takes place in the framework of corpus linguistics (CoL) using large collections of authentic language data, as such serving as a reliable test bed and learning model for refining the linguistic description. The formal part of such a linguistic implementation can be tested using personal computers.

Authentic text data contain both stylistic and pragmatic elements. Then we are dealing with a form of semantics, hidden in structured combinations (syntax) of lexical elements (lexicon). Relations and dependencies between elements may be underlined with morphemes (morphology) that should be accounted for in a description of the language concerned. Such a description comprises the inventory of the smallest unit of linguistic description, which is called the phoneme (phonology), or its orthographic, the grapheme (orthography). In this way, authentic (Arabic) data represent single multiple-layer syntax, distinguished in modules only for practical reasons.

(p. 216) 9.2.1 Computational Phonetics and Phonology

Here the term phonetics indicates the study of the physical properties of the smallest units of linguistic description in the Arabic language (i.e., phonemes), whether for analysis (recognition) or synthesis (generation) purposes.11 At an early stage, this study was extended with the study of its graphic counterpart, the grapheme. Later, research started on the development of all kinds of remedial support such as Arabic Braille (AB), text-to-speech, and speech-to-text conversion, or combinations thereof (for blind, deaf, and dumb).

Phonology, on the other hand, is more concerned with the generalized, grammatical characterization of the Arabic phoneme and grapheme inventory of the language system. Computational phonology is the use of models in phonological theory. For the description of the typical nonconcatenative root and pattern system of Semitic languages in general and Arabic phonology and morphology in particular, McCarthy (1981) proposes a representation of different layers, further developed by Kay (1987), Beesley (1996), Kiraz (1994, 1997), Kiraz and Anton (2000, 2001), and Ratcliffe (1998) [Ratcliffe, “Morphology”]. More recent developments go in the direction of optimality theory (OT; Prince and Smolenski 2004) and syllabification (Kiraz and Möbius 1998). There already are some specialized studies in this field on Arabic phonology [Hellmuth, “Phonology”].

Müller (2002) adds an NLP flavor with his probabilistic context-free grammars for phonology.12 Computational phonology is the basis of many NLP applications, such as the previously mentioned AS systems and STT, TTS, and STS conversion and others such as Arabic speech recognition (SR), OCR, and text-to-text conversion (TTT) or machine translation (MT).13

In computational phonetics and phonology, we are confronted with terms such as tiers, distinct layers of representation (two-four, three), finite-state automata (FSA or FSM), transducers, programming languages, tables, tagging, ± deterministic, and a few other technical terms. Sometimes, a decisive discussion about progression in the field of research concerned is worded using general (Arabic) linguist terms, but for certain entries in Appendix 9.2 it was necessary to employ less frequently used linguistic terms.

9.2.2 Arabic Computational Morphology

As Richard Sproat correctly mentioned (Soudi et al. 2007: viii), Kaplan and Kay (1981) and Kay (1987), in line with Koskenniemi (1983), paved the way for Kenneth Beesley’s (1989) (p. 217) research on Arabic computational morphology, which led to the work of many others as well as to the development of applications in the field of Arabic morphology.

One of the pioneers in computational Arabic morphology, Tim Buckwalter, developed BAMA, an Arabic morphological analyzer (Buckwalter 2010). Initially, his research was oriented toward automated corpus-based Arabic lexicology. Later, three lexicons, compatibility scripts, and an algorithm in the feature-rich dynamically typed14 programming language called Perl were combined in a software package for the morphological analysis of Arabic words (Buckwalter 2002, 2004), used, inter alia, for morphological and part-of-speech (POS, part of speech) tagging as well as for syllabification of authentic data in existing Arabic Treebanks15 (Maamouri and Bies 2010; Smrž and Hajič 2010) for morphological or syntactic annotation.

It is not surprising that research on Arabic computational morphology is easily adopted, adapted, and incorporated into general approaches to computational phonetics, phonology, and morphophonemics. Al-Sughaiyer and Al-Kharashi (2004) classify a number of Arabic morphological analyzers (analyzers) and synthesizers (generators) according to the approach employed regarding table lookup, linguistic (two-level, FSA or FSM, traditional applications), combinatorial, and pattern-based approaches. As Köprüand Miller (2009) point out, “Very few of the available systems are evaluated using systematic and scientific procedures.” This is perhaps a bit too harsh a criticism. However, it is always worthwhile to scrutinize and evaluate the advantages and disadvantages as well as the adequacy, coherency, and consistency of a chosen approach.

Evaluating 20-odd Arabic morphological analyzers and synthesizers, Al-Sughaiyer and Al-Kharashi (2004: 198, Table 4) mention their algorithm name and type: some “brand” names and even one “Sebawai”;16 many “linguistics”; and one “rule based.” Smrž (2007: 5–6) qualifies Beesley (2001), Ramsay and Mansur (2001), and Buckwalter (2002) as “lexical” in nature. Habash (2004) calls his own work “lexical-realizational” in nature. Finally, Cavalli-Forza et al. (2000), Habash et al. (2005), Dada and Ranta (2006), and Forsberg and Ranta (2004) are rather “inferential-realizational.”

For his ElixirFM, Smrž (2007: 2) emphasizes its implementation within the Prague framework of function generative dependency (FGD) in functional programming (Haskell), contrasting with the dynamic programming (Perl) of Buckwalter (2002) and resulting in “a yet more refined linguistic model.”

Partly based on the operational tagging system of Buckwalter’s BAMA morphological analyzer for Arabic, Otakar Smrž developed “description of [Arabic] surface syntax in the dependency framework” (Smrž and Hajič 2010). This brings us to the doorstep (p. 218) between Arabic phonology–morphology and Arabic computational syntax at least as far as the representation of the analysis results in the form of dependency trees is concerned. These results are obtained on the basis of a pretagged corpus. The Prague linguists opted for a functional dependency grammar approach. Nonetheless, also for the computational description of Arabic morphology and syntax, a programming language, Haskell,17 is being used.18 There is an important difference between the use of a programming language and a formalism for implementable and operational descriptions of a natural language.19

9.2.3 Arabic Computational Syntax

Syntax is the description of the overall organization of a natural language in which different complementary building blocks such as phonology, morphology, lexicology, semantics, stylistics and pragmatics come together to convey a particular message between an A and a B. To describe the general structure of this organization for natural languages in general or for a specific language such as Arabic is the objective of the linguistic part of the description. To find an implementable formal model for such a description is the objective of the computational part of that description.

9.2.3.1 Linguistic Part

For a historical overview of language description, I refer to HSK 18.3 (2006). Here we limit ourselves to the century of the dominance of immediate constituency (IC) and the rise of many other linguistic theories and descriptive models such as dependency grammar, of importance or used for the linguistic description of Standard and spoken Arabic varieties.

It is evident that the splitting up of a natural language system into its largest and smallest units of linguistic description, as well as the description of mutual relationships and (p. 219) dependencies between these units, form an excellent starting point for any research on the fundamentals of human communication in general, and of the organisation of a specific natural language system in particular (Habash 2010: chapter 6). Computational linguistics (cf. Winograd 1983) started with the annotation (POS tagging) of formal (e.g., parts of speech; word and phrasal categories, sentences, sections, chapters, volumes, and the marking of nontextual insertions);20 functional elements (e.g., cases, clitics, determiners, heads and modifiers, slots and fillers) in authentic text data (CoL); and continued later in the presentation of derivation trees or labeled bracketing, extracted from this (earlier inserted) information.

9.2.3.2 Formal part

For a historical overview of computational language description in general, I refer to Winograd (1983). Here we speak about the current state of syntactic parsing of Arabic text data wherein different steps can be distinguished. Usually, they are labeled with terms such as tokenization, diacritization, POS tagging, morphological disambiguation (Marton et al. 2010), base phrase chunking, semantic role labeling, lemmatization, stemming, and the like (cf. Appendix 9.2; cf. also Mesfar 2010). Most of these processes have been automated by now, but all the existing collections of syntactically analyzed Arabic text data (Habash 2010: section 6.2) such as the Penn Arabic Treebank (Maamouri et al. 2004), the Prague Arabic Dependency Treebank (Hajič et al. 2001), and the Columbia Arabic Treebank (Habash and Roth 2009) have been manually checked. This “forest of treebanks” (Habash 2010: 111) can now be used as learning models for the development of new statistical parsers, evaluating parsers and general Arabic parsers.

9.2.4 Arabic Computational Lexicology

Computational lexicology is the branch of linguistics, which is concerned with the use of computers in the study of machine-readable dictionaries (lexicon). Sometimes this term is synonymous with computational lexicography, though the latter is more specifically for the use of computers in the construction of dictionaries (Al-Shalabi and Kanaan 2004).21

Piek Vossen, a well-known computational lexicologist, founder and president of the Global Wordnet Association, worked on the first WordNet project (Fellbaum 1998) and supervised parallel projects such as EuroWordNet (Vossen 1998) and Arabic WordNet (Black et al. 2006; Elkateb et al. 2006). He is thinking in terms of (multi)lingual lexical (p. 220) databases with lexical semantic networks. We come close to a distinction in form, function, meaning, and contextual realization of a lexical entry. Besides this distinction we always have the linguistic and the formal part.

Linguistic part

Relevant here are studies such as those of Dévényi et al. (1993) on Arabic lexicology and lexicography as well as other research with valuable bibliographical references (Bohas 1997; Bohas and Dat 2008; Hassanein 2008; Hoogland 2008; Seidensticker 2008). Moreover, one should include the studies about affixes, features (Dichy 2005; Ditters 2007), or parameters hinting at theta, thematic, or semantic roles.

Formal part

On the formal side I would like to mention the tag sets (Habash 2010: 79–85; Maamouri et al. 2009; Diab et al. 2004; Diab 2007; Kulick et al. 2006; Habash and Roth 2009; Khoja et al. 2001; Hajič et al. 2005) used for the annotation of the corpora of Arabic text data as well as the by then enriched corpora (treebanks) from which all kinds of relevant information can be extracted. Here should also be included studies on Arabic semantic labeling (Diab et al. 2007) and Arabic semantic roles (Diab et al. 2008).

9.2.5 Arabic Computational Semantics, Stylistics, and Pragmatics

Computational syntax, at the academic level, is still not common practice. Computational semantics, stylistics, and pragmatics are even at a more rudimentary stage,22 not only as far as the Arabic language is concerned but also even for more intensively studied natural languages. It is worthwhile here to refer to the HSK volumes on dependency and valency (HSK 25, 2003–2006), and in particular to contributions of interest for our discussion23 (Owens 2003; Msellek 2006; Bielický and Smrž 2008, 2009).

According to Wikipedia:24

Issues in Arabic Computational Linguistics

Figure 9.1 An example from the Arabic Propbank (Habash 2010, 115).

Computational semantics is the study of how to automate the process of constructing and reasoning with meaning representations of natural language expressions. It consequently plays an important role in natural language processing and computational linguistics. Some traditional topics of interest are: semantic analysis, semantic (p. 221) underspecification, anaphora resolution, presupposition projection, and quantifier scope resolution. Methods employed usually draw from formal semantics or statistical semantics.

In a note on Arabic semantics, Habash (2010: chapter 7) mentions the Arabic Proposition Bank (Propbank) and Arabic WordNet. A propbank (Palmer et al. 2005) is, in contrast with a treebank, a semantically annotated corpus. On the basis of predicate– argument information of the complement structure of a verb so-called frameset definitions (Baker et al. 1998) are associated with each entry of the verbal lexicon (Palmer et al. 2008). The description of the nature, number, and role of the arguments can be as detailed and specific as a linguistic description of the semantics of a language allows. It may be clear that here lies the greatest challenge in the development of adequate, coherent, and consistent parsers for any natural language text data, MSA, and spoken varieties of Arabic included.

Figure 9.1 presents information about the linguistic unity of description (S), type of sentence (NP-TPC1), with a verb phrase (VP) as comment. The topic (ARG0) is realised by a noun phrase (NP). The comment, which may be termed “predicate” is realized by a finite transitive verb (PRED) with an implicit subject (ARG0), a direct object noun phrase (ARG1), and a prepositional time adverbial (ARGM-TMP). There is a form of subcategorization at the phrasal level (NP-TPC1, NP, NP-SBJ1, NP-OBJ). Nouns are by subscripts divided into common nouns and subcategories. There is some form of description in terms of functions and categories, but it is not maintained in a consistent and coherent way until the final lexical entries have been reached.

Arabic WordNet has been mentioned earlier in relation to computational lexicology (§1.4). Here I want to explicitly underline the importance of electronically available collections of text data and nowadays also parallel corpora, for linguistic research (Col = corpus linguistics). In the following section I defend a purely linguistic approach (p. 222) of Arabic language description exploiting as much computational means as possible on the basis of authentic data and within a long-standing linguistic Arabic grammatical tradition.

Other studies in the fields of semantics,25 dialogue (Hafez 1991), discourse,26 stylistics (Mitchell 1985; Mohammad 1985; Somekh, 1979, 1981), and pragmatics27 remain mainly theoretical. Little can be found on heuristics. Something like a dialogue act annotation system allowing the ranking of communicative functions of utterances in terms of their subjective importance (Włodarczak 2012) for Arabic has still to be written. Computational semantics has points of contact with the areas of lexical semantics (word sense disambiguation and semantic role labeling), discourse semantics, knowledge representation, and automated reasoning (in particular, automated theorem proving).

9.3 A Formalized Linguistic Description of Arabic Syntax

There are, in my opinion, some basic concepts and rules important for long-term linguistic research. The first point is that syntax encompasses a number of subfields, including phonology, morphology, lexicology, semantics, stylistics, pragmatics, and heuristics, together with their respective branches, including the use of computational tools. Moreover, linguistic research should positively and negatively improve the field: positively in the sense of enriching the discipline as well as socially relevant; and negatively in the sense of convincingly demonstrating that a specific approach did not and will not lead to any useful results or meaningful research.

A rather important rule is that any account of linguistic research, with some additional information and footnotes, should be readable and understandable foras well as verifiable by colleagues. Finally, scientific linguistic research should not be presented encoded in machine language or a programming language printout or even in PDF form and, moreover, should not be superficial, as are many of the presentations of commercial researchers and product developers.

(p. 223) The description of the syntactic structure28 of Standard Arabic, readable and understandable foras well as verifiable by colleagues, may have the form of a hypothesis, to be tested against authentic language data. After refining and renewed testing, this leads to a theory about the syntactic structure of the same layer of data of Standard Arabic as tested in the data. The same approach can be followed by further research on other Arabic text data.

Earlier, a listing was made of useful and (moreover) operational computational instruments, including machine-readable resources of all kinds, for the automated linguistic research on Arabic. We discussed the difference between Arabic NLP and Arabic CL, underlining the independence of the linguistic and formal parts in this research, while acknowledging a bias in favor of the linguistic part. Here I will defend an approach to an adequate, consistent, and coherent description of, in this case, MSA for the automated analysis29 of authentic Arabic text data.

First, we position this section in Arabic NLP history (9.3.1). Then I say something about linguistic and formal concepts for language description within the Arabic grammatical tradition (9.3.2). Finally, I present a sample of a linguistic (9.3.3) and a formal part (9.3.4) of a description of MSA. Finally (9.4), I say something about perspectives on the basis of options chosen.

9.3.1 Evaluating Remarks about the Approach Opted for

When Smrž (2007: 68) says, “The tokens30 with their disambiguated grammatical information enter the annotation of analytical syntax,” we are in the linguistic part of our discussion about computational Arabic syntax. The same is the case with Topologische Dependenz-grammatik fürs Arabische (Odeh 2004). In both, the results of the analysis of some interesting syntactic peculiarities of the Arabic language, processed in a language-independent dependency-oriented environment (Debusmann 2006), are presented in the form of unambiguous, rather very nice tectogrammatical dependency trees on the basis of an analytical representation in the case of Smrž (2007), and, except for the labeling, in almost identical ID (immediate dominancy) and LP (linear precedence) representations (Odeh 2004).31

Issues in Arabic Computational Linguistics

Figure 9.2a Tectogrammatic representation of the analysis of a sentence (Smrž 2007b, 73).

(p. 224) The sentence can be interpreted as containing a predicate (PRED) with an agent (ACT), no expressed addressee (ADDR), and an object (PAT). This object consists of two coordinated topics (ID), both further specified by an attributive modifier (RSTR). The second modifier does not have an agent but does govern an object (PAT). A positional apposition (LOC) plays the role of sentence adverbial. The second part of Figure 9.2b lists the Arabic word and its English translation as well as the tags used for the analytical representation (column 1). Column 2 lists the values for some variables used in the analysis. The third column (3) gives the values (in upper case) and representation of the variables (in lower case) I use in my approach of two-level description of the same target language.

Odeh (2004: Figure 9.3) presents the ID and the LP representation of a sentence with a finite verb form in first position. The abbreviations in Figure 9.3a are self-evident. Those in Figure 9.3b refer to the topological fields: sentence field (sf) and article field (artf).32

Notwithstanding the vagueness of the dismissal (Smrž 2007: 6), we like to comment on:

Issues in Arabic Computational Linguistics

Figure 9.2b The analyzed sentence: ‘and in the section on literature, the magazine presented the issue on the Arabic language and the dangers that threaten it.’ (Smrž 2007b, 72–73).

The outline of formal grammar (Ditters, 2001), for instance, works with grammatical categories like number, gender, humanness, definiteness, but one cannot see which of the existing systems could provide for this information correctly, as they (p. 225) misinterpret some morphs for bearing a category, and underdetermine lexical morphemes in general as to their intrinsic morphological functions.33

This is a correct remark, as far as Ditters (2001) is concerned. I am working with a description in terms of grammatical functions and categories, final lexical entries, dependency relations, and, additionally, an opening toward a description of semantic features as well.34

Grammatical in this context involves, as said earlier, the phonological, morphological, structural, and lexical modules for language description with rudimentary extensions to semantics, stylistics, and pragmatics as well as. It is necessary to remain understandable for and verifiable by colleagues. Serious semantic extensions are awaiting further computationally more coordinated research under supervision of the linguistic twin part in this kind of research. Let us continue with some words about the outline of formal grammar.

Issues in Arabic Computational Linguistics

Figure 9.3a Immediate dominance (ID) representation.

Issues in Arabic Computational Linguistics

Figure 9.3b Linear precedence (LP) representation.

(p. 226)

Computational means can be used to test a hypothesis about the linguistic structure of MSA sentences35 in an efficient but linguistically understandable wording of nonterminals and terminals, applying only a context-free grammar formalism (with room for some additional context-free layers in the second level of description for semantics) but considering the availability of compilers for one, two, or more levels of context-free attribute grammar formalisms. Until now this approach has proved to be promising.

9.3.2 Linguistic and Formal Concepts about Language Description in the Arabic Grammatical Tradition

Immediate constituency (IC) has dominated descriptive linguistics for a long time. Dependency grammar (DG) concepts were already (according to Carter 1973, 1980; Owens 1988) familiar to Arab and Arabic grammarians and became a welcome, and needed, addition to an implementable descriptive power for natural languages in general and MSA in particular. Moreover, one can always explore other valuable suggestions.

(p. 227) The basis for the description of parts of speech, functions, and categories in phrasal categories in MSA can be found in the Kitāb of Sībawayhi (d. 798).36 Carter (1973: 146)37 calls it “a type of structuralist analysis unknown to the West until the 20th century.” He ends his abstract with:

Utterances are analysed not into eight Greek-style “parts” but into more than seventy function classes. Each function is normally realized as a binary unit containing one active ‘operator’ (the speaker himself or an element of his utterance) and one passive component operated on (not “governed”) the active member of the unit. Because every utterance is reduced to binary units, Sībawayhi’s method is remarkably similar to immediate constituent Analysis, with which it shares both common techniques and inadequacies, as is shown.

As a first example of such a class of functions, Carter (1973: 151) presents the triad of “grammatical effect” (ʿamal), comprising a “grammatically affecting” (ʿāmil) and a “grammatically affected” ( maʿmūl) component. Similar triads (possibly considered as dependency rules) can be drafted from the other function classes listed. Moreover, the function classes could be arranged in subcategories, for example, accounting for relationships between constituents, sentence types, or word formation. Other function classes deal with phonology and morphology, have discourse functions, or are related to stylistics. Finally, on the basis of Sībawayhi’s comprehensive (exhaustive?) description of Arabic syntax, the function classes could easily be extended with more of this kind of “dependency rules,” for example, with triads accounting for transitivity, or the subject–object–predicate relations.

Owens’ (1988) historical overview with an apparent DG related perspective is broader. In section 2, all the relevant issues are discussed: constituency (IC) in Arabic theory (§2.9); dependency (DG) (§2.9.2); and dependency in Sībawayhi (§2.9.3). There are also chapters about markedness in Arabic theory (§8) and syntax, semantics, and pragmatics (§9).

9.3.3 Linguistic Description of MSA for Analysis Purposes38

An ideal compromise between IC and DG seems to be, for me, a context-free grammar description of MSA in IC terms accounting for the horizontal sequential order, enriched with a second, context-free grammar level attached to the nonterminals (Appendix 9.2) of the first level, accounting for relationships and dependencies (DG) in the vertical relational order between elements of a constituent or between constituents of one of the two sentence types (nominal and verbal). A third context-free grammar layer, only (p. 228) describing semantic extensions and properties, proved to be locatable (for the moment) within the two-layer context-free grammar frame.39

Working with two semantically and therefore also syntactically different sentence types in MSA, for both types one can distinguish categorial and functional nonterminals, categorial and functional terminals, lexical terminals, dependencies, and relationships between elements of a constituent and between constituents at sentence level. The syntactic consequences of a semantic property of a lexical entry will, of course, be of importance for the analysis and generation of text data of the language concerned.40

The syntax of sentence types in MSA, Sn, and Sv is described by means of alternating layers of categories, functions, and categories until terminals (here: lexical entries) have been reached. In the Sn, we distinguish two obligatory slots: a topic and a comment. The filler for a topic function characteristically belongs to the class of N’s (here: head) or to the category NP’s (optionally also containing modifiers to the head). In the Sv, we are dealing with a single obligatory slot, a predicate. The filler of a predicate function typically belongs to the class of V’s (head) and is usually realized by a VP (optionally also containing modifiers to the head). The comment function in the Sn as well as optional slots, in both the Sn and the Sv, may be filled with entries of the different classes or with different phrasal categories.

In line with the first words of the Kitāb of Sībawayhi, as parts of speech we distinguish elements of the open word classes nouns (N) and verbs (V) and the closed class of particles (P).41 Elements of these classes realize the head function in phrasal constituents such as noun phrases (NP), verb phrases (VP), and particle phrases (PP) as well as properties of elements of the word classes (N, V, and P), including morphological, syntactic, and semantic features (valency indications) and some pragmatic ones as well.42 The following section introduces a simple sample implementation.

9.3.4 The Formal Description of MSA

For an adequate, coherent, and consistent description of MSA I examined different linguistic theories and models (Ditters 1992) for the best products, implementable in a processing environment for analysis purposes. For example, generalizations in the form of transformations in a transformational generative framework (TG) are too powerful a tool for an overall linguistic description;, when a simple machine like a computer failed to decide in a finite time whether or not a certain TG structure was described in the formal implementation of that linguistic description, I opted for a different formalism (AGFL)43 to implement a description (Ditters 2001, 2007, 2011) in terms of a combination of IC and DG.

(p. 229) In contrast to a programming language (dynamic, functional, or relational in nature), a formalism (Ditters 1992: 134) is exclusively for the static, nondeterministic, and declarative description of structures such as programming languages (e.g., Algol-68, Affix- Attribute-Feature- Logical Grammars), also suited for the description of natural languages (e.g., Arabic, Dutch, English, Hebrew, Latin, Spanish). The objective is to test such a hypothetical description of, in our case, lMSA syntax structure on real data, for example, an Arabic text corpus. It is the machine-readable text data that determine whether a match should occur or not. Once tested, corrected, and refined, the hypothesis has become a theory, certainly for the language structure represented in the test bed in the form of a new hypothesis to be tested on other Arabic text corpora. Briefly, AGFL is an interwoven two-level context-free grammar formalism44 with almost context-sensitive properties.45

On the Chomskian hierarchy of grammars (Levelt 1974) scale, context-free grammars are, for the description and automated testing of natural language descriptions, really rather nonproblematic. As a matter of fact, Chomsky qualified a context-free grammar as an inadequate descriptive tool for natural languages. However, he never showed, as far as I know, any interest in a combination of two (or even more) context-free grammars (with an almost context-sensitive descriptive power),46 enough to describe, for example, most of Standard Arabic, including at least parts of its semantic richness. Furthermore, he never tested, as far as I know, his linguistic hypotheses and theoretical models practically by computational means.

A rule-based context-free grammar rewrites one single nonterminal at the left-hand side into one or more nonterminals or lexical entries on the right-hand side. AGFL, successor of EAG (Appendix 9.1), is a formalism for the description of programming languages or natural languages in which large context-free grammars can be described in a compact way. Along with attribute grammars47 and DCGs, AGFLs belong to the family of two-level grammars, the first, context-free level, which accounts for the description of sequential word order of surface natural language elements, is augmented with set-valued features for expressing agreement between constituents and between elements of a constituent as well as linguistic properties (including semantic features). AGFL is implemented in CDL3 and C.48

Notational AGFL conventions include the rewrite symbol (:), the marking of alternatives (;), the separation of sequences (,), the end of a rule (.), the layout of nonterminals and terminals of the first level and of the nonterminals and final values of variables of the second level of description in terms of lower- and uppercase representation. (p. 230) Besides that, there is no longer any capacity problem for the storage of electronic data. Therefore, the choice of names and terminal values for the elements of the first and second level of description of, for example, MSA, may be as linguistically recognizable as one prefers.

We use four types of rules within the AGFL formalism: the so-called hyperrules; metarules; predicate rules; and lexical rules:49

  • Hyperrules formally describe the occurrence of elements of word classes, in single or phrasal form. Variation in the sequence of those elements is dealt with by the formalism.

  • Metarules define the nonterminals of the second level of description to a finite set of terminal values.

  • Predicate rules describe and, if needed, condition relationships and dependencies between phrasal constituents or between elements of a constituent.

  • Lexical rules describe final or terminal values of the first level of description, if possible with semantic features, some colocational information (including additional remarks about nonregular and unexpected occurrences in compositional semantics).

However, as is well-known, natural languages go further than the addition of the meaning of individual elements for capturing the real meaning of the linguistic data concerned.50

In Figure 9.4, Jaszczolt (2005) illustrates the process of utterance interpretation within the compositional theory of default semantics. It may be clear that the meaning of an utterance does not equal the sum of the meaning of its constituents.

Issues in Arabic Computational Linguistics

Figure 9.4 Utterance interpretation in default semantics (Jaszczolt 2005, 73).

The following presents a sample grammar51 of a two-level context-free rewrite AGFL52 for Modern Standard Arabic:53 (p. 231)

  • GRAMMARsentence.

  • ROOT sentence(topic).

  • # 1 Meta rules

  • # These define the finite domain of values for non-terminals of the second level of description.

  • # This second level enables accounting for relationships and dependencies.

  • CASE::acc|gen|nom|invar.

  • DECLEN::defec|dipt|CASE.

  • DEFNESS::def|indef.

  • GENDER::fem|masc.

  • HEADREAL::com|pers|prop.

  • MODE::nominal|verbal.

  • MOOD::imper|indic|juss|subj.

  • NUMBER::coll|dual|PLUR|sing.

  • ORDER::topic|elliptic_topic.

  • PERSON::first|second|third.

  • PLUR::explu|inplu.

  • TENSE::perfect|MOOD.

  • TYPES::direc|finalintr|place|timeprep.

  • VOICE::active|passive.

# 2 Hyper rules

# They describe functions and categories of non-terminals at the first level of description,

(p. 232) # until lexical values have been reached. This level enables accounting, in an efficient way,

# for the generalization of word order and sentence structure.54 sentence(topic):

  •  topic(GENDER,NUMBER),

  •   topic comp(GENDER,NUMBER).

topic(GENDER,NUMBER):

  •  nounphrase(def,GENDER,NUMBER,PERSON,nom|invar);

  •  prep(finalintr),

  •   np(HEADREAL,def,GENDER,NUMBER,PERSON,gen);

  •  bound prep(finalintr) +

  •   np(HEADREAL,def,GENDER,NUMBER,PERSON,gen).

topic comp(GENDER,NUMBER):

  •  predicate(MODE,DEFNESS,third,GENDER,NUMBER);

  •  np(HEADREAL,DEFNESS,GENDER,NUMBER,PERSON,nom);

  •  adjp(DEFNESS,GENDER,NUMBER,CASE);

  • ap;

  • pp.

nounphrase(def,GENDER,NUMBER,PERSON,nom|invar):

  •  np(HEADREAL,def,GENDER,NUMBER,PERSON, nom|invar).

np(HEADREAL,def,GENDER,NUMBER,PERSON,CASE):

  •  head(HEADREAL,DEFNESS,GENDER,NUMBER,PERSON,CASE), DEF is(HEADREAL,DEFNESS).

predicate(verbal,DEFNESS,third,GENDER,NUMBER):

  •  vp(TENSE,PERSON,GENDER,NUMBER).

predicate(nominal,DEFNESS,PERSON,GENDER,NUMBER):

  •  np(HEADREAL,DEFNESS,GENDER,NUMBER,PERSON,nom),

  •  headreal is(HEADREAL).

head(HEADREAL,DEFNESS,GENDER,NUMBER,PERSON,CASE):

  •  noun(DECLEN,GENDER,NUMBER).

noun(DECLEN,GENDER,NUMBER):

  •  common noun(DECLEN,GENDER,NUMBER);

  •  pers pronoun(GENDER,NUMBER,PERSON,CASE);

  •  proper noun(DECLEN,DEFNESS,GENDER,NUMBER).

vp(TENSE,PERSON,GENDER,NUMBER):

  •  verb(TENSE,VOICE,PERSON,GENDER,NUMBER).

# 3 Predicate rules

# They describe, or even determine, relations and dependencies between values of elements

(p. 233) # of the second level of description by means of the conditioning of specific values.

# This type of rules sometimes is called “ empty rules.”

# DEF is(HEADREAL,DEFNESS).

DEF is(com,DEFNESS):.

DEF is(pers,def):.

DEF is(prop,def):.

#Headrealis(HEADREAL).

headreal is(com):.

headreal is(pers):.

headreal is(prop):.

# 4 Lexical rules

# They describe the final or lexical value of entries in the lexicon.

# Adjp(DEFNESS,GENDER,NUMBER,CASE) and ap.

“adjp” adjp(DEFNESS,GENDER,NUMBER,CASE)

“ap”    ap

# Common noun(norm,masc,sing), including pronouns and proper nouns.

“raǧul”

common noun(norm,masc,sing)

“riǧāl”

common noun(norm,masc,inplu)

“bint”

common noun(norm,fem,sing)

“banāt”

common noun(norm,fem,inplu)

# Perspronoun(fem|masc,sing,first,nom).

“ʾanā”

perspronoun(fem|masc,sing,first,nom)

“naḥnu”

pers pronoun(fem|masc,inplu,first,nom)

“ī”

pers pronoun(fem|masc,sing,first,gen|acc)

“nī”

pers pronoun(fem|masc,sing,first,gen|acc)

“nā”

pers pronoun(fem|masc,inplu,first,gen|acc)

# Proper noun(dipt,def,masc,sing).

“ʾaḥmad”

proper noun(dipt,def,masc,sing)

“muḥammad

proper noun(norm,def,masc,sing)

“fātimaIssues in Arabic Computational Linguistics

proper noun(norm,def,fem,sing)

# Prep(TYPES).

“la”

bound prep(finalintr)

“la”

prep(finalintr)

“pp”

pp

# Verb(TENSE,VOICE,PERSON,GENDER,NUMBER).

“kataba”

verb(perfect,active,third,masc,sing)

“kutiba”

verb(perfect,passive,third,masc,sing)

“yaktubu”

verb(indic,active,third,masc,sing)

“yuktabu”

verb(indic,passive,third,masc,sing)

# Remark 55

9.4 Perspectives for Further Linguistic and Formal Research on Arabic Syntax

In a joint contribution to the Nemlar conference (Ditters and Koster 2004),56 we explored the potentiality of the existing approach to MSA syntax for other, socially equally relevant, spinoffs of my description of Arabic for corpus-linguistic analysis purposes. I prefer to finish testing the current implementation hypothesis about MSA syntax on Arabic text data. Second, I should like to refine the verified theory by means of a formal description of MSA syntax for generative purposes. Dependency grammar worded implementation could be extracted from such research. Finally, I like to make an, within the AGFL processing environment, implementable subset of DG rules for research on aboutness57 in Arabic text data.

Appendix: Frequently used abbreviations

(p. 234) (p. 235) (p. 236)

Symbol

Meaning

A

Aboutness

AB

Arabic Braille

ABP

Arabic proposition bank

AC

Automatic correction

AGFL

Affix grammar over finite lattices

AI

Artificial intelligence

AS

Answering system

ASL

Arabic Sign Language

ASR

Automatic speech recognition

BP

Base phrase

CALL

Computer-assisted language learning

CFG

Context-free grammar

CL

Computational linguistics

CoL

Corpus linguistics

DM

Data mining

DR

Document routing

EAG

Extended affix grammar

EBL

Example-based learning cf. MBL

EE

Event extraction

FE

Factoid extraction

FGD

Function generative dependency

FSA

Finite state automaton

FSM

Finite state machine

FST

Finite state transducer

HR

Handwriting recognition

IBL

Instance-based learning cf. MBL

IC

Immediate constituency

ID

Immediate dominancy

IR

Information retrieval

LL

Lazy learning cf. MBL

LM

Language modeling

LP

Linear precedence

MBC

Morphological behavior class

MBL

Memory-based learning

MFH

Morphological form harmony

MLA

Machine learning approach

MSA

Modern Standard Arabic

MT

Machine translation

(S)MT

(Statistical) MT

NER

Named entity recognition

NET

Named entity translation

NL

Natural language

NLP

Natural language processing

OCR

Optical character recognition

OT

Optimality theory

ON

Orthographic normalization

POS

Parts of speech

POS-T

POS tagging

QAS

Question answering system

RBA

Rule-based approach

SA

Statistical approach

SBA

Stem-based approach

SC

Spelling correction

SG

Speech generation

SLA

Supervised learning approach

SP

Signal processing

SP

Speech processing

SR

Speech recognition

STS

Speech-to-speech

STT

Speech-to-text

SVM

Support vector machines

TC

Text categorization

TDT

Topic detection and tagging

TG

Text generation

TM

Text mining

TP

Text processing

TTS

Text-to-speech

TTT

Text-to-text

ULA

Unsupervised learning approach

Selected Bibliography58

Al-Shalabi, Riyad, and Ghassan Kanaan. 2004. Constructing an automatic lexicon for Arabic language. International Journal of Computing and Information Sciences 2: 114–128.Find this resource:

Al-Sughaiyer, Imad A., and Ibrahim A. Al-Kharashi. 2004. Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology 55(3): 189–213.Find this resource:

Baker, Collin, Charles Fillmore, and John Lowe. 1998. The Berkeley FrameNet project. (COLING-ACL’98): Proceedings of the University of Montréal Conference, 86–90.Find this resource:

Beesley, Kenneth R. 1989. Computer analysis of Arabic morphology: A two-level approach with Detours. In Perspectives on Arabic linguistics III: Papers from the 3rd annual symposium on Arabic linguistics, ed. Bernard Comrie and Mushira Eid, 155–172. Amsterdam: John Benjamins.Find this resource:

——. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th international conference on computational linguistics (COLING-96), Copenhagen, Denmark, 89–94. Copenhagen: Center for Sprogteknologi.Find this resource:

——. 2001. Finite-state morphological analysis and generation of Arabic at Xerox research: Status and plans in 2001. In Proceedings of the EACL workshop on language processing: Status and prospects. Toulouse, France, 1–8. Available at: www.xrce.xerox.com/content/download/20547/147632/file/

Bielick ý, Viktor, and Otakar Smrž. 2008. Building the valency lexicon of Arabic verbs. In Proceedings of the 6th international conference on language resources and evaluation (LREC), Marrakech, Morocco.Find this resource:

——. 2009. Enhancing the ElixirFM lexicon with verbal valency frames. In Proceedings of the 2nd international conference on Arabic language resources and tools, ed. Khalid Choukri and Bente Maegaard. Cairo: The MEDAR Consortium.Find this resource:

Black, William, Sabri Elkateb, Horacio Rodriguez, Musa Alkhalifa, Piek Vossen, Adam Pease, and Christiane Fellbaum. 2006. Introducing the Arabic WordNet project. In Proceedings of the 3rd international WordNet conference, Jeju Island, Korea.Find this resource:

Bohas, Georges. 1997. Matrices, étymons, racines: Éléments d’une theorie lexicographique du vocabulaire arabe. Leuven: Peeters.Find this resource:

(p. 237) Bohas, Georges, and Mihai Dat. 2008. Lexicon: Matrix and etymon model. In EALL vol. 3, ed. Kees Versteegh, Associate Editors: Mushira Eid, Alaa Elgibali, Manfred Woidich, Andrzej Zaborski, 45–52. Leiden: Brill.Find this resource:

Buckwalter, Timothy. 2002. Arabic morphological analyzer version 1.0. Philadelphia: Linguistic Data Consortium.Find this resource:

——. 2004. Issues in Arabic orthography and morphology analysis, in Proceedings of the workshop on computational approaches to Arabic script-based languages, (COLING 2004), Geneva, Switzerland. ed. M. Farghaly and K. Megerdoomian, 31–34. Stroudsburg: Association for Computational Linguistics.Find this resource:

——. 2010. The Buckwalter Arabic morphological analyzer, in ed. Farghaly, 85–101.Find this resource:

Bunt, Harry, John Carroll, and Giorgio Satta (eds.). 2004. New developments in parsing technology. Dordrecht: Kluwer Academic Publishers.Find this resource:

Bunt, Harry, Paola Merlo and Joakim Nivre. 2010. Trends in parsing technology. Springer: Dordrecht.Find this resource:

Carter, Michael G. 1973. An Arab grammarian of the eight century A.D.: A contribution to the history of linguistics. Journal of the American Oriental Society 93: 146–157.Find this resource:

——. 1980. Sibawayhi and modern linguistics. Histoire Épistémologique 2(1): 21–26.Find this resource:

Cavalli-Sforza, Violetta, Abdelhadi Soudi, and Teruko Mitamura. 2000. Arabic morphology generation using a concatenative strategy. In Proceedings of the 1st meeting of the North American Chapter of the Association for computational linguistics (NAACL 2000), Seattle, WA, 86–93.Find this resource:

Dada, Ali, and Aarne Ranta. 2006. Arabic resource grammar. In Proceedings of the Arabic language processing conference (JETALA), Rabat, Morocco: IERA.Find this resource:

Debusmann, Ralph. 2006. Extensible dependency grammar: A modular grammar formalism based on multigraph description. PhD diss., Saarland University.Find this resource:

Dévényi, Kinga, Tamás Iványi, and Avihai Shivtiel (eds.). 1993. Proceedings of the colloquium on Arabic lexicology and lexicography (C.A.L.L.): Budapest 1–7 September. Part one: Papers in European languages. Part two: Papers in Arabic. Budapest: Eötvös Loránd University Chair for Arabic Studies.Find this resource:

Diab, Mona T. Kadri Hacioglu, and Daniel Jurafsky. 2004. Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceedings of the 5th meeting of the North American chapter of the Association for computational linguistics/human language technologies conference (HLT-N/LICL04), Boston, MA, 149–152.Find this resource:

——. 2007. Towards an optimal POS tag set for modern standard Arabic processing. In Proceedings of recent advances in natural language processing (RANLP), Borovets, Bulgaria.Find this resource:

Diab, Mona, Alessandro Moschitti, and Daniele Pighin. 2008. Semantic role labeling systems for Arabic language using Kernel methods. In Proceedings of ACL-08: HLT, Columbus, Ohio, 798–806.Find this resource:

Diab, Mona, Musa Alkhalifa, Sabry ElKateb, Christiane Fellbaum, Aous Mansouri, and Martha Palmer. 2007. Arabic Semantic Labeling. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007 18), Prague, Czech Republic, 93–98.Find this resource:

Dichy, Joseph. 2005. Spécificateurs engendrés par les traits [±anim é], [±humain], [±concret] et structures d’arguments en arabe et en français. In De la mesure dans les termes. Actes du colloque en hommage à Philippe Thoiron, ed. Henri Bé joint and François Maniez, 151–181. Travaux Centre de Recherche en Terminologie et Traduction (CRTT), Lyon: Presses Universitaires de Lyon.Find this resource:

Ditters, Everhard. 1992. A formal approach to Arabic syntax: The noun phrase and the verb phrase. PhD diss., Nijmegen University.Find this resource:

——. 2001. A formal grammar for the description of sentence structure in modern standard Arabic. In EACL 2001 Workshop Proceedings on Arabic language processing: Status and prospects, Toulouse, France, 31–37.Find this resource:

(p. 238) ——. 2006. Computational linguistics. In EALL I, ed. Kees Versteegh et al., 511–518.Find this resource:

——. 2007. Featuring as a disambiguation tool in Arabic NLP, in Ditters and Motzki (eds.), 367–402.Find this resource:

——. 2011. A formal description of sentences in Modern Standard Arabic. In Studies in Semitic languages and linguistics, ed. T. Muraoka, A. Rubin, and C. Versteegh, 511–551. Leiden: Brill.Find this resource:

Ditters, Everhard, and Cornelis H. A. Koster. 2004. Transducing Arabic phrases into head-modifier (HM) pairs for Arabic information retrieval. In Proceedings of the NEMLAR 2004 international conference on Arabic language resources and tools, September 22–23, Cairo, 148–154.Find this resource:

Duchier, Denys and Ralph Debusmann. 2001. Topological dependency trees: A constraint-based account of linear precedence. In Proceedings of the Association for Computational Linguistics. Toulouse, France. 180–187. http://www.aclweb.org/anthology-new/P/P01/P01-1024.pdf

Elkateb, Sabri, William Black, Horacio Rodríguez, Musa Alkhalifa, Piek Vossen, Adam Pease, and Christiane Fellbaum. 2006. Building a WordNet for Arabic. In Proceedings of the 5th international conference on language resources and evaluation (LREC 2006), Genoa, Italy.Find this resource:

Farghaly, Ali, and Karine Megerdoomian (eds.). 2004. Proceedings of the workshop on computational approaches to Arabic script-based languages, (COLING 2004), Geneva, Switzerland. Stroudsburg: Association for Computational Linguistics.Find this resource:

—— (ed.). 2010. Arabic computational linguistics. Stanford: CSLI Publications.Find this resource:

Fellbaum, Christiane (ed.). 1998. WordNet: An electronic lexical database. Cambridge, MA: MIT Press.Find this resource:

Forsberg, Markus, and Aarne Ranta. 2004. Functional morphology. In Proceedings of the 9th ACM SIGPLAN international conference on functional programming, ICFP 2004, ACM Press, 213–223.Find this resource:

Habash, Nizar. 2004. Large scale lexeme based Arabic morphological generation. In Proceedings of traitement automatique des Langues Naturelles (TALN-04), Fez, Morocco, 271–276.Find this resource:

——. 2010. Introduction to Arabic natural language processing. San Raphael, CA: Morgan & Claypool Publishers.Find this resource:

Habash, Nizar, and Ryan Roth. 2009. CATIB: The Colombia Arabic treebank. In Proceedings of the ACL-IJCNLP 2009 conference Short Papers, Suntec, Singapore, 221–224.Find this resource:

Habash, Nizar, Owen Rambow, and George Kiraz. 2005. Morphological analysis and generation for Arabic dialects. In Proceedings of the Workshop on computational approaches to Semitic languages at 43rd meeting of the Association for computational linguistics (ACL’05), Ann Arbor, MI, 17–24.Find this resource:

Hafez, Ola M. 1991. Turn-taking in Egyptian Arabic: Spontaneous speech vs. drama dialogue. Journal of Pragmatics 15: 59–81.Find this resource:

Hassanein, Ahmed T.. 2008. Lexicography: Monolingual dictionaries. In EALL III, ed. Kees Versteegh et al., 37–45.Find this resource:

Hoogland, Jan. 2008. Lexicography: Bilingual dictionaries. In EALL III, ed. Kees Versteegh et al., 21–30.Find this resource:

Jaszczolt, Katarzyna M. 2005. Default semantics: Foundations of a compositional theory of acts of communication. Oxford: Oxford University Press.Find this resource:

Jurafsky, Daniel, and James H. Martin. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, NJ: Prentice Hall.Find this resource:

Kaplan, Ronald, and Martin Kay. 1981. Phonological rules and finite-state transducers. In Linguistic Society of America Meeting Handbook, 56th Annual Meeting. New York.Find this resource:

(p. 239) Kay, Martin. 1987. Noncatenative finite-state morphology. In Workshop on Arabic morphology. Stanford: Stanford University Press, 2–10.Find this resource:

Khoja, Shereen, Roger Garside, and Gerry Knowles. 2001. A tagset for the morphosyntactic tagging of Arabic. In Proceedings of corpus linguistics 2001. Lancaster, UK, 341–353.Find this resource:

Kiraz, George Anton. 1994. Multi-tape two-level morphology: A case study in Semitic non-linear morphology. In Proceedings of 15th international conference on computational linguistics (COLING-94), Kyoto, Japan, 180–186.Find this resource:

——. 1997. Compiling regular formalisms with rule features into finite-state automata. In ACL/EACL-97, Madrid, Spain, 329–336.Find this resource:

Kiraz, George Anton., 2000. Multi-tiered nonlinear morphology using multi-tape finite automata: A case study on Syriac and Arabic. Computational Linguistics 26: 77–105.Find this resource:

—— 2001. Computational nonlinear morphology with emphasis on Semitic languages. Studies in natural language processing. New York: CUP.Find this resource:

Kiraz, George Anton, and Bernd Mö bius. 1998. Multilingual syllabification using weighted finite-state transducers. In Proceedings of the 3rd ESCA workshop on speech synthesis, Jenolan Caves, Australia, 71–76.Find this resource:

Köprü, Selçuk, and Jude Miller. 2009. A unification based approach to the morphological analysis and generation of Arabic. In 3rd Workshop on computational approaches to Arabic script-based languages, Computational Approaches to Arabic Script-based Languages (CAASL3). Available at http://mt-archive.info/MTS-2009-Kopru.pdf.

Koskenniemi, Kimmo. 1983. Two-level morphology: A general computational model for word-form recognition and production. Publication 11. Helsinki: Department of General Linguistics, University of Helsinki.Find this resource:

Kulick, Seth, Ryan Gabbard, and Mitch Marcus. 2006. Parsing the Arabic treebank: Analysis and improvements. In Proceedings of the Treebanks and Linguistic Theories Conference. Prague, Czech Republic, 31–42.Find this resource:

Levelt, Willem J.M. 1974. Formal grammars in linguistics and psycholinguistics: An introduction to the theory of formal languages. 3 vols. Den Haag: Mouton.Find this resource:

Maamouri, Mohamed, Ann Bies, and Seth Kulick. 2009. Creating a methodology for large-scale correction of treebank annotation: The case of the Arabic treebank. In Proceedings of MEDAR international conference on Arabic language resources and tools. Cairo: Medlar.Find this resource:

Maamouri, Mohamed, Ann Bies, Timothy Buckwalter, and Wigdan Mekki. 2004. The Penn Arabic Treebank: Building a large-scale annotated Arabic corpus. Paper presented at the NEMLAR International Conference on Arabic Language Resources and Tools, September 22–23, Cairo.Find this resource:

Maamouri, Mohamed, and Ann Bies. 2010. The Penn Arabic treebank, in ed. Farghaly, 103–135.Find this resource:

Manning, Christopher D., and Hinrich Schütze. 20036 [1999]. Foundations of statistical natural language processing. Cambridge, MA: MIT Press.Find this resource:

Marton, Yuval, Nizar Habash, and Owen Rambow. 2010. Improving Arabic dependency parsing with lexical and inflectional morphological features. In Proceedings of the NAACL HLT 2010 1st workshop on statistical parsing of morphologically-rich languages. Los Angeles, CA, 13–21.Find this resource:

McCarthy, John. 1981. A prosodic theory of non-concatenative morphology. Linguistic Inquiry 12: 373–418.Find this resource:

Mesfar, Slim. 2010. Towards a cascade of morpho-syntactic tools for Arabic natural language processing. In Computational linguistics and intelligent text processing: Proceedings of the 11th international conference CICLing 2010, Iaşi, Romania, ed. Alexander Gelbukh. 150–162. Berlin: Springer.Find this resource:

(p. 240) Mitchell, Terence F. 1985. Sociolinguistic and stylistic dimensions of the educated spoken Arabic of Egypt and the Levant. In Language standards and their codification: Process and application, ed. J. Douglas Woods, 42–57. Exeter: University of Exeter.Find this resource:

Mohammad, Mahmud D. 1985. Stylistic rules in classical Arabic and the levels of grammar. Studies in African Linguistics 9: 228–232.Find this resource:

Msellek, Abderrazaq. 2006. Kontrastive Fallstudie: Deutsch—Arabisch. HSK 25(2): 1287–1292.Find this resource:

Müller, Karin. 2002. Probabilistic context-free grammars for phonology. In Proceedings of ACL SIGPHON, Association for Computational Linguistics, PA, 70–80.Find this resource:

Odeh, Marwan. 2004. Topologische Dependenzgrammatik fürs Arabische. Abschlussbericht FR6.2-Informatik, Universität des Saarlandes.Find this resource:

Owens, Jonathan. 1988. The foundations of grammar: An introduction to medieval Arabic grammatical theory. Amsterdam: John Benjamins.Find this resource:

——. 2003. Valency-like concepts in the Arabic grammatical tradition. HSK 25(1): 26–32.Find this resource:

Palmer, Martha, Dan Gildea, and Paul Kingsbury. 2005. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics 31: 71–106.Find this resource:

——, Olga Babko-Malaya, Ann Bies, Mona Diab, Mohamed Maamouri, Aous Mansouri, and Wajdi Zaghouani. 2008. A pilot Arabic propbank. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), ed. By Nicoletta Calzolari, 3467–3472. Marrakech, Morocco.Find this resource:

Prince, Alan, and Smolensky, Paul. 2004. Optimality theory: Constraint interaction in generative grammar. Oxford: Blackwell.Find this resource:

Ramsay, Alan, and Hanady Mansur. 2001. Arabic Morphology: A categorical approach. In EACL workshop proceedings on Arabic language processing: Status and prospects. Toulouse, France, 17–22.Find this resource:

Ratcliffe, Robert R. 1998. The “broken” plural problem in Arabic and comparative Semitic: Allomorphy and analogy in non-concatenative morphology. Amsterdam: John Benjamins.Find this resource:

Seidensticker, Tilman. 2008. Lexicography: Classical Arabic, In EALL III, ed. Kees Versteegh et al., 30–37.Find this resource:

Sībawayhi (d. 798). Al-Kitāb. 2 vols. Būlāq 1316 A.H. Reprint Baghdad: al-Muṯannān.d.Find this resource:

Smrž, Otakar. 2007. Functional Arabic morphology: Formal system and implementation. PhD diss., Charles University, Prague, Czech Republic.Find this resource:

Smrž, Otakar, and Jan Hajič. 2010. The other Arabic treebank: Prague dependencies and functions. In ed. Farghaly, 137–168.Find this resource:

Somekh, Sasson. 1979. The emergence of two sets of stylistic norms in the early literary translation into modern Arabic prose. Ha-Sifrut-Literature 8(28): 52–57.Find this resource:

——. 1981. The emergence of two sets of stylistic norms in the early literary translation into modern Arabic prose. Poetics Today 2: 193–200.Find this resource:

Soudi, Abdelhadi, Antal van den Bosch, and Günter Neumann (eds.). 2007. Arabic computational morphology: Knowledge-based and empirical methods. Dortrecht: Springer.Find this resource:

Vossen, Piek. 1998. EuroWordNet: A multilingual database with lexical semantic networks. Dordrecht: Kluwer.Find this resource:

Winograd, Terry. 1983. Language as a cognitive process. Vol. 1: Syntax. Reading MA: Addison-Wesley.Find this resource:

Włodarczak, Marcin. 2012. Ranked multidimensional dialogue act annotation. New Directions in Logic, Language and ComputationLecture Notes in Computer ScienceVolume: ed. By Daniel Lassiter and Marija Slavkovik, 67–77. Berlin: Springer.Find this resource:

Notes:

(1) The Appendix at the end of the chapter lists some abbreviations and technical terms (frequently) used not only in this field in general but also in the current of this contribution, together with some paraphrasing of terms used in this contribution.

(2) I would like to have had some more attention for equally socially relevant matters like pure linguistics.

(3) Cf. Bouillon et al. (2007).

(4) On the program of the annual ALS symposium on Arabic linguistics (2011), more than half of the presentations (17 of 31) dealt with Arabic colloquials, the diglossia situation, and the application of general linguistic theories for the description of Arabic colloquials. Beginning in 1990, this trend can be found in all the issues of Perspectives on Arabic Linguistics. For a while, the Moroccan Linguistic Society had a similar development.

(5) If possible, together with a deaf window as well as a form of simultaneous (Arabic) Braille output.

(6) See also Ali (2003).

(7) The end of this chapter offers suggestions for further reading.

(8) For the main topics of this chapter, see Chenfour (2006) and Ditters (2006). Subtopics and references will be referred to in the body of the text.

(9) The source here is Wikipedia; see also Ditters (2006).

(10) For a paraphrase of descriptive terms, see Appendix 9.2.

(11) There is a difference in interest between the computational subword (phonetics) and the computational word (phonology) level and beyond. This chapter is concerned with remedial and commercial applications.

(12) See also Coleman and Pierrehumbert (1997) on stochastic phonological grammars and acceptability.

(13) Cf. Farghali (2010: chapters 3 and 4) and Habash (2010: chapter 8 and Appendix).

(14) Here in contrast with statically typed. Computer science presently has four main branches of programming languages: imperative oriented; functional oriented; logical oriented; and object oriented. For our purposes this information will be enough.

(15) Wikipedia paraphrases treebank as a text corpus in which each sentence has been parsed, that is, annotated with syntactic structure, which is commonly represented as a tree.

(16) I appreciate Darwish’s (2002) reference to Sībawayhi in his account of “a one-day construction of a shallow Arabic morphological analyzer.”

(17) Wikipedia paraphrases: Haskell is a standardized, general-purpose purely functional programming language, with nonstrict semantics and strong static typing.

(18) Quoting Smrž and Hajič (2010, 140): “these systems misinterpret some morphs for bearing a category, and underspecify lexical morphemes in general as to their intrinsic morphological functions.” I come back on this point while discussing the automated linguistic description of Arabic by means of programming languages or computational formalisms.

(19) A programming language describes a dynamic and deterministic process. It is dynamic because there is a beginning and a series of steps to be taken leading inevitably to an end. It is deterministic because the computer is explicitly being told from the very beginning, how to start, where to find what it needs for the execution of the program, what to do with it, what the next step will be, and when its activity will come to an end. A formalism also is an artificial, formal language but is designed as a medium for the definition or the description of static structures. Such an approach is declarative because in the formal grammar only structures are defined and described. There is a beginning, a series of rules and an end, but there is no logical link between beginning and end. Not the computer but it is the machine-readable data that determine whether a match should occur or not; that is, the parser is dependent on the input string for deciding whether or not its structure can be recognized as defined or described in the formal grammar.

(20) For Arabic, Sībawayhi (d. 798, kitāb) described nouns (N), verbs (V) and non-noun non-verb particles (-N-V) as the basic word categories. He also hinted at greater constituents with an element of one of those categories as head, but the labeling into NPs, VPs, and PaPs here is mine.

(21) See, for example, also the objectives of The Arabic Language Computing Research Group (ALCRG), King Saud University (http://ccis.ksu.edu.sa/ar/en/cs/research/ALCRG).

(22) Carter (2007: 27) discusses an earlier form of pragmatics in Larcher’s approach of ʾinšāʾ ( ibid., 28). See also Larcher (1990).

(23) Cf. Bangalore et al. (2003), Bröker (2003), Fillmore (2003), Hajičovà and Sgall (2003), Hellwig (2003), Hudson (2003), Kahane (2003), Maxwell (2003), Mel’čuk (2003), Oliva (2003), Starosta (2003), Busse (2006), Hellwig (2006), Horacek (2006), and Schubert (2006).

(24) See also Jurafsky and Martin (2009, section 3) and Eijck and Unger (2010).

(25) The references are slightly dated: Al-Najjar (1984), Bahloul (1994), Blohm (1989), DeMiller (1988), Eisele (1988), Gully (1992), Justice (1981, 1987), Mohammad (1983), Ojeda (1992), and Zabbal (2002).

(26) Most references are a bit dated but concern colloquial varieties as well as Standard Arabic: Abu Ghazaleh (1983), Abu Libdeh (1991), Alfalahi (1981), Al-Jubouri (1984), Al-Shabab (1987), Al-Tarouti (1992), Bar-Lev (1986), Daoud (1991), Fakhri (1995, 1998, 2002), Fareh (1988), Ghobrial (1993), Hatim (1987, 1989), Johnstone (1990, 1991), Khalil (1985), Koch (1981), Mughazy (2003), Russell (1977), Ryding (1992), Salib (1979), and Sawaie (1980).

(27) Cf. Dahl and Talmoudi (1979), Ghobrial (1993), Mahmoud (2008), Moutaouakil (1987, 1989), Mughazy (2008), and Suleiman (1989).

(28) Accounting for all the aforementioned branches of syntax, including an opening to a semantic description of language properties.

(29) A similar linguistic (and not heuristic thus pragmatic) description for generation purposes is not yet within reach.

(30) The results of his formal system and the implementation of functional Arabic morphology (Smrž 2007b: 69) are presented in the form of unambiguous dependency trees.

(31) Here is not the most appropriate place to initiate a discussion about the processing of a nondeterministic formal description of a natural language, for instance, in CFG terms, and the processing of a deterministic (each programming language) formal description of a natural language, whether or not the results are presented in IC, DG, HPSG, or any other form.

(32) Duchier and Debusmann (2001) describe a new framework for dependency grammar with a modular decomposition of immediate dependency and linear precedence. Their approach distinguishes two orthogonal yet mutually constraining structures: a syntactic dependency tree; and a topological dependency tree. The former is nonprojective and even nonordered, while the latter is projective and partially ordered.

(33) It is, unmistakingly, my fault not to have been clear enough to explain the basic principles of my approach to language description: the analysis of a linguistic unit in terms of alternating layers of functions and categories until final (lexical) entries have been reached.

(34) For an example, see §2.3.

(35) Maybe, for Arabic, we are not yet ready to think in terms of a linguistic and implementable paragraph, section, text, volume, and, generally applicable, syntax description of MSA or colloquial varieties of Arabic.

(36) See for an overview of Arabic literature in general, among others, Sezgin (1967–2000), in particular volume 8 (lexicology) and 9 (syntax).

(37) Cf. in this perspective also Baalbaki (1979).

(38) See Appendix for a paraphrase of technical terms used.

(39) See §2.3.

(40) I am exclusively working in an analysis perspective of authentic MSA text data.

(41) Superscripts to a nonterminal symbol, such as S, N, V, P, NP, VP, and PP point to categorial distinctions at the first level of description. Subscripts point to categorial, functional, or semantic characteristics at the second level of description.

(42) See §2.3.

(44) The main advantage of a formal description above a formalism, a computer or a programming language is that a linguist is able, after a simple introduction, to read, understand, and comment on your description without first becoming a mathematical, logical, or computational linguist or scientist.

(45) It is important to repeat that I exclusively use the formalism for analysis purposes. This means that the description may really be “liberal.” For synthesis objectives it is quite a different story.

(46) 1 context-free grammar + 1 context-free grammar ≡ 2 context-free grammars ≠ 1 context-sensitive grammar.

(47) Koster, 1971, 1991.

(48) CDL refers to Compiler Description Language. CDL and C are, both, imperative programming languages, but in CDL3 the notational conventions are more suited for AGFL-formatted natural language descriptions to be tested.

(49) For detailed information about probabilistic and frequency accounting properties of the AGFL formalism, I refer to the aforementioned AGFL site.

(50) For more details see the aforementioned references.

(51) A formal two-level context-free rewrite grammar means one describes one and only one nonterminal on the left-hand side and rewrites it into one or more nonterminal or terminal values from the lexicon on the right-hand side. However, this action will take place at the first and second level of description. The second level of description is included between parentheses (()), attached, if desirable this has been considered, to nonterminals of the first level.

(52) AGFL stands for affix grammar over finite lattices. Cf. www.agfl.cs.ru.nl; Koster (1971, 1991).

(53) Notational conventions: a hash (#) introduces a comment line; a double colon (::) rewrites the left-hand side of a nonterminal of the second level of description into one or more final values or another nonterminal of the second level of description; a vertical bar (|) separates alternatives on the right-hand side of the second level of description; a colon (:) rewrites the nonterminal at the left-hand side of the first level of description into one or more nonterminals or terminal values; a comma (,) separates successive elements on the right-hand side; a semicolon (;) separates alternatives on the right-hand side; an addition sign (+) tells the machine to ignore all spaces; a dot (.) ends each rule, with the exception of lexical rules. Nonterminals of the second level are written in uppercase. Terminal values of the second level are written in lower case. Terminal values of the first level are enclosed in double quotes (“”).

(54) The listing of meta-, hyper-, predicate, or empty and lexical rules in reality is slightly longer.

(55) In this sample grammar some alternatives of metarule rewritings of nonterminals at the second level of description occur only for elucidating purposes.

(56) At the AGFL site mentioned as: AP4IR (Arabic [Dependency] Pairs for Information Retrieval).

(57) Words from the open categories (nouns, verbs, and to a lesser extent adjectives and modifiers) carry the aboutness of a text, the others are in fact stop words. Similarly, only triples whose head or modifier are from an open category carry the aboutness of a text, and any other triples can be discarded as stopwords (Koster, 2011).

(58) Only titles that appear in the main text (not the footnotes) are included in the bibliography. A complete bibliography can be found in the online chapter of the Handbook.