Show Summary Details

Page of

PRINTED FROM OXFORD HANDBOOKS ONLINE ( © Oxford University Press, 2018. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a title in Oxford Handbooks Online for personal use (for details see Privacy Policy and Legal Notice).

Subscriber: null; date: 22 July 2018

Multimodal Systems

Abstract and Keywords

Recent years have witnessed a rapid growth in the development of multimodal systems. Improving technology and tools enable the development of more intuitive styles of interaction and convenient ways of accessing large data archives. Starting from the observation that natural language plays an integral role in many multimodal systems, this chapter focuses on the use of natural language in combination with other modalities, such as body gestures or gaze. It addresses the following three issues: (1) how to integrate multimodal input including spoken or typed language in a synergistic manner; (2) how to combine natural language with other modalities in order to generate more effective output; and (3) how to make use of natural language technology in combination with other modalities in order to enable better access to information.

Keywords: multimodal interfaces, human–computer interfaces, multimodal grammar, multimodal reference expressions, embodied conversational agent, human–robot interaction, affective computing

1 Introduction

Whereas traditional user interfaces typically follow the paradigm of direct manipulation combining keyboard, mouse, and screen, novel human–computer interfaces aim at more natural interaction using multiple human-like modalities, such as speech and gestures. The place of natural language as one of the most important means of communication makes natural-language technologies integral parts of interfaces that emulate aspects of human–human communication. An apparent advantage of natural language is its great expressive power. Being one of the most familiar means of human interaction, natural language can significantly reduce the training effort required to enable communication with machines.

On the other hand, the coverage of current natural-language dialogue systems is still strongly limited. This fact is aggravated by the lack of robust speech recognizers. The integration of non-verbal media often improves the usability and acceptability of natural-language components as they can help compensate for the deficiencies of current natural-language technology. In fact, empirical studies by Oviatt (1999) show that properly designed multimodal systems have a higher degree of stability and robustness than those that are based on speech input only. From a linguistic point of view, multimodal systems are interesting because communication by language is a specialized form of communication in general. Theories of natural-language processing have reached sufficiently high levels of maturity so that it is now time to investigate how they can be applied to other ways of communicating, such as touch- and gaze-based interaction.

The objective of this chapter is to investigate the use of natural language in multimodal interfaces. To start with, we shall clarify the basic terminology. The terms medium and modality, especially, have been a constant cause of confusion due to the fact that they are used differently in various disciplines. In this paper, we adopt the distinction by Maybury and Wahlster (1998) between medium, mode, and code. The term mode or modality is used to refer to different kinds of perceptible entities (e.g. visual, auditory, haptic, and olfactory) while the term medium relates to the carrier of information (e.g. paper or CD-ROM), different kinds of physical devices (e.g. screens, loudspeakers, microphones, and printers) and information types (e.g. graphics, text, and video). Finally, the term code refers to the particular means of encoding information (e.g. pictorial languages).

Multimedia/multimodal systems are then systems that are able to analyse and/or generate multimedia/multimodal information or provide support in accessing digital resources of multiple media.

Multimodal input analysis starts from low-level sensing of the single modes relying on interaction devices, such as speech and gesture recognizers and eye trackers. The next step is the transformation of sensory data into representation formats of a higher level of abstraction. In order to exploit the full potential of multiple input modes, input analysis should not handle the single modes independent of each other, but fuse them into a common representation format that supports the resolution of ambiguities and accounts for the compensation of errors. This process is called modality integration or modality fusion. Conventional multimodal systems usually do not maintain explicit representations of the user’s input and handle mode integration only in a rudimentary manner. In section 2, we will show how the generalization of techniques and representation formalisms developed for the analysis of natural language can help overcome some of these problems.

Multimedia generation refers to the activity of producing output in different media. It can be decomposed into the following subtasks: the selection and organization of information, the allocation of media, and content-specific media encoding. As we do not obtain coherent presentations by simply merging verbalization and visualization results into multimedia output, the generated media objects have to be tailored to each other in such a way that they complement each other in a synergistic manner. This process is called media coordination or media fission. While the automatic production of material is rarely addressed in the multimedia community, a considerable amount of research effort has been directed towards the automatic generation of natural language. In case information is presented by synthetic characters, we use the term multimodal generation and accordingly talk about modality coordination or modality fission. Section 3 surveys techniques for building automated multimedia presentation systems drawing upon lessons learned during the development of natural-language generators.

Depending on whether media/modalities are used in an independent or combined manner, the World Wide Web Consortium (W3C) distinguishes between the complementary or supplementary use of modalities/media. Modalities/media that are used in a complementary manner contribute synergistically to a common meaning. Supplementary media/modalities improve the accessibility of applications since the users may choose those modalities/media for communication that meet their requirements and preferences best. Furthermore, we distinguish between the sequential and simultaneous use of modalities/media depending on whether they are separated by time lags or whether they overlap with each other.

Multimedia access to digital data is facilitated by methods for document classification and analysis, techniques to condense and aggregate the retrieved information, as well as appropriate multimodal user interfaces to support search tasks. In particular, commercial multimedia retrieval systems do not always aim at a deeper analysis of the underlying information, but restrict themselves to classifying and segmenting static images and videos and integrate the resulting information with text-based information. In this case, we talk about media integration or media fusion. In section 4, we argue that the integration of natural-language technology can lead to a qualitative improvement of existing methods for document classification and analysis.

2 Multimodal/Multimedia Input Interpretation

Based on the observation that human–human communication is multimodal, a number of researchers have investigated the usage of multiple modalities and input devices in man–machine communication. Earlier systems focused on the analysis of the semantics of multimodal utterances and typically investigated a combination of pointing and drawing gestures and speech. The most prominent example includes the ‘Put-that-there’ system (Bolt 1980) that analyses speech in combination with 3D pointing gestures referring to objects on a graphical display. Since this groundbreaking work, numerous researchers have investigated developed mechanisms for multimodal input interpretation mainly focusing on speech, gestures, and gaze while the trend is moving towards intuitive interactions in everyday environments.

2.1 Mechanisms for integrating modalities

Most systems rely on different components for the low-level analysis of the single modes, such as eye trackers and speech and gesture recognizers, and make use of one or several modality integrators to come up with a comprehensive interpretation of the multimodal input. This approach raises two questions: How should the results of low-level analysis be represented in order to support the integration of the single modalities? How far should we process one input stream before integrating the results of other modality analysis processes?

Basically, two fusion architectures have been proposed in the literature, depending on at which level sensor data have been fused:

  • Low-level fusion

  • In the case of low-level fusion, the input from different sensors is integrated at an early stage of processing. Low-level fusion is therefore often also called early fusion. The fusion input may consist of either raw data or low-level features, such as pitch. The advantage of low-level fusion is that it enables a tight integration of modalities. There is, however, no declarative representation of the relationship between various sensor data which aggravates the interpretation of recognition results.

  • High-level fusion

  • In the case of high-level fusion, low-level input has to pass modality-specific analysers before it is integrated, e.g. by summing recognition probabilities to derive a final decision. High-level fusion occurs at a later stage of processing and is therefore often also called late fusion. The advantage of high-level fusion is that it allows for the definition of declarative rules to combine the interpreted results of various sensors. There is, however, the danger that information gets lost because of a too early abstraction process.

Systems aiming at a semantic interpretation of multimodal input typically use a late fusion approach and process each modality individually. An example of such a system includes the Quickset system (Johnston 1998) that analyses a combination of speech and drawing gestures on a graphically displayed map. The SmartKom system uses a mixture of early fusion for analysing emotions from facial expressions and speech and late fusion for analysing the semantics of utterances (Wahlster 2003).

2.2 Criteria for modality integration

In the ideal case, multimodal systems should not just accept input in multiple modalities, but also support a variety of modality combinations. This requires sophisticated methods for modality integration.

An important prerequisite for modality integration is the explicit representation of the multimodal context. For instance, the interpretation of a pointing gesture often depends on the syntax and semantics of the accompanying natural-language utterance. In ‘Is this number <pointing gesture> correct?’, only referents of type ‘number’ can be considered as candidates for the pointing gesture. The case frame indicated by the main verb of a sentence is another source of information that can be used to disambiguate a referent since it usually provides constraints on the fillers of the frame slots. For instance, in ‘Can I add my travel expenses here <pointing gesture>?’, the semantics of add requires a field in a form where the user can input information.

Frequently, ambiguities of referring expressions can be resolved by considering spatial constraints. For example, the meaning of a gesture shape can be interpreted with respect to the graphical objects which are close to the gesture location. In instrumented rooms, the interpretation of referring expressions may be guided by proximity and visibility constraints. That is, nearby objects that are located in the user’s field of view are preferably considered as potential candidates for referring expressions, such as ‘Switch the device on’.

A further fusion criterion is the temporal relationship between two user events detected on two different modalities. Johnston (1998) considers temporal constraints to resolve referring expressions consisting of speech and 2D gestures in Quickset. Kaiser et al. (2003) use a time-stamped history of objects to derive a set of potential referents for multimodal utterances in Augmented and Virtual Reality. Like Johnston (1998) they employ time-based constraints for modality integration.

Particular challenges arise in a situated environment because the information on the user’s physical context is required to interpret a multimodal utterance. For example, a robot has to know its location and orientation as well as the location of objects in its physical environment, to execute commands, such as ‘Move to the table’. In a mobile application, the GPS location of the device may be used to constrain search results for a natural-language user query. When a user says ‘restaurants’ without specifying an area on the map displayed on the phone, the system interprets this utterance as a request to provide only restaurants in the user’s immediate vicinity. Such an approach is used, for instance, by Johnston et al. (2011) in the mTalk system, a multimodal browser for location-based services.

A fundamental problem of most early systems was that there was no declarative formalism for the formulation of integration constraints. A noteworthy exception was the approach used in QuickSet which clearly separates the statements of the multimodal grammar from the mechanisms of parsing (Johnston 1998). This approach enabled not only the declarative formulation of type constraints, such as ‘the location of a flood zone should be an area’, but also the specification of spatial and temporal constraints, such as ‘two regions should be a limited distance apart’ and ‘the time of speech must either overlap with or start within four seconds of the time of the gesture’. Mehlmann and André (2012) introduce event logic charts to integrate input distributed over multiple modalities in accordance with spatial, temporal, and semantic constraints. The advantage of their approach is the tight coupling of incremental parsing and interaction management and is therefore also suited for the handling of scenarios where analysis and production processes need to be aligned to each other as in the human–robot dialogue described in section 2.4.

Many recent multimodal input systems, such as SmartKom (Wahlster 2003), make use of an XML language for representing messages exchanged between software modules. An attempt to standardize such a representation language has been made by the World Wide Web Consortium (W3C) with EMMA (Extensible MultiModal Annotation mark-up language). It enables the representation of characteristic features of the fusion process: ‘composite’ information (resulting from the fusion of several modalities), confidence scores, timestamps, as well as incompatible interpretations (‘one-of’). Johnston (2009) presents a variety of multimodal interfaces combining speech-, touch- and pen-based input that have been developed using the EMMA standard.

2.3 Natural-language technology as a basis for multimodal analysis

Typically systems that analyse multimodal input rely on mechanisms that have been originally introduced for the analysis of natural language. Johnston (1998) proposed an approach to modality integration for the QuickSet system that was based on unification over typed feature structures. The basic idea was to build up a common semantic representation of the multimodal input by unifying feature structures which represented the semantic contributions of the single modalities. For instance, the system was able to derive a partial interpretation for a spoken natural-language reference which indicated that the location of the referent was of type ‘point’. In this case, only unification with gestures of type ‘point’ would succeed.

Kaiser et al. (2003) applied unification over typed feature structures to analyse multimodal input consisting of speech, 3D gestures, and head direction in augmented and virtual reality. Noteworthy is the fact that the system went beyond gestures referring to objects, but also considered gestures describing how actions should be performed. Among others, the system was able to interpret multimodal rotation commands, such as ‘Turn the table <rotation gesture> clockwise’, where the gesture specified both the object to be manipulated and the direction of rotation.

Usually, multimodal input systems combine several n-best hypotheses produced by multiple modality-specific generators. This leads to several possibilities of fusion, each with a score computed as a weighted sum of the recognition scores provided by individual modalities. Mutual disambiguation is a mechanism used in multimodal input systems in which a modality can help a badly ranked hypothesis to get a better multimodal ranking. Thus, multimodality enables us to use the strength of one modality to compensate for weaknesses of others. For example, errors in speech recognition (see Chapter 30) can be compensated by gesture recognition and vice versa. Oviatt (1999) reported that 12.5% of pen/voice interactions in Quickset could be successfully analysed due to multimodal disambiguation, while Kaiser et al. (2003) even obtained a success rate of 46.4% that could be attributed to multimodal disambiguation.

Another approach that was inspired by work on natural-language analysis used finite-state machines consisting of n+1 tapes which represent the n input modalities to be analysed and their combined meaning (Bangalore and Johnston 2009). When analysing a multimodal utterance, lattices that correspond to possible interpretations of the single input streams are created by writing symbols on the corresponding tapes. Multiple input streams are then aligned by transforming their lattices into a lattice that represents the combined semantic interpretation. Temporal constraints are not explicitly encoded as in the unification-based approaches described above, but implicitly given by the order of the symbols written on the single tapes.

Bangalore and Johnston (2009) present a mobile restaurant guide to demonstrate how such an approach may be used to support multimodal applications combining speech with complex pen input, including free-form drawings as well as handwriting. For illustration, let us suppose the user utters ‘Show me Italian restaurants in this area’, while drawing a circle on the map. To analyse the multimodal input, the system builds up a weighted lattice for possible word strings and a weighted lattice for possible interpretations of the user’s ink. The drawn circle might be interpreted as an area or the handwritten letter ‘O’. To represent this ambiguity, the ink lattice would include two different paths, one indicating an area and one indicating a letter. Due to the speech input, the system would only consider the path referring to the area when building up the lattice for the semantic interpretation and thus be able to resolve the ambiguity of the gestural input.

A particular challenge is the analysis of plural multimodal referring expressions, such as ‘the restaurants in this area’ accompanied by a pointing gesture. To analyse such expressions, Martin et al. (2006) consider perceptual groups which might elicit multiple-object selection with a single gesture, or for which a gesture on a single object might have to be interpreted as a selection of the whole group, such as the group of pictures on the wall (Landragin 2006).

More recent work focuses on the challenge to support speech-based multimodal interfaces on heterogeneous devices, including not only desktop PCs, but also mobile devices, such as smart phones (Johnston 2009). In addition, there is a trend towards less traditional platforms, such as in-car interfaces (Gruenstein et al. 2009) or home-controlling interfaces (Dimitriadis and Schroeter 2011). Such environments raise particular challenges to multimodal analysis due to the increased noise level, the less controlled environment, and multi-threaded conversations. In addition, we need to consider that users are continuously producing multimodal output and not only when interacting with a system. For example, a gesture performed by a user to greet another user should not be mixed up with a gesture to control a system. In order to relieve the users from the burden of explicitly indicating when they wish to interact, a system should be able to distinguish automatically between commands and non-commands.

2.4 Reconsidering phenomena of natural-language dialogue in a multimodal context

In the previous section, we surveyed approaches to the analysis of multimodal utterances independently of a particular discourse. In this section, we discuss multimodality in the context of natural-language dialogue focusing on two phenomena: grounding and turn management.

Multimodal SystemsClick to view larger

Figure 1 Example of a multimodal human–robot dialogue.

A necessary requirement for successful human–computer communication (HCI) is the establishment of common ground between the human and the machine. That is, the human and the machine need to agree upon what a conversation is about and ground their utterances. Grounding requires that all communicative partners continuously indicate that they are following a conversation or else communicate comprehension problems which have impeded this. To illustrate the construct of common ground, consider the human–robot interaction (HRI) shown in Figure 1.

In this dialogue, the robot initiates a referring act by gazing at the target object, pointing at it, and specifying discriminating attributes (colour and shape) verbally. The referring act by the robot is followed by a gaze of the human at the target object which may be taken as evidence that common ground was established successfully by directed gaze. The human then produces a backchannel signal consisting of a brief nod to signal the robot that she has understood the robot’s request, and starts executing the request by the robot, i.e. produces the relevant next contribution to the interaction. After successfully conducting the request by the robot, the human gazes at the robot to receive its feedback. The robot responds to the human’s gaze by looking at her. Thus, the attempt by the human to establish mutual gaze with the robot was successful. The robot then produces a backchannel signal consisting of a head nod and brief verbal feedback to communicate to the human that the action was performed to its satisfaction.

An integrated approach that models direct gaze, mutual gaze, relevant next contribution, and backchannel behaviours as an indicator of engagement in a dialogue has been presented by Rich et al. (2010) and validated for a simple dialogue scenario between a human and a robot. The approach was used for modelling the behaviour of both the robot and the human. As a consequence, it was able to explain failures in communication from the perspective of both interlocutors.

The dialogue example shown in Figure 1 also illustrates how non-verbal behaviours may be employed to regulate the flow of a conversation. By looking at the interlocutor after a contribution to the discourse, the human and the robot signal that they are willing to give up the turn. The role of gaze as a mechanism to handle turn taking in human–robot interaction has been explored, among others, by Mutlu et al. (2012). By means of an empirical study, they were able to show that human-like gaze behaviours implemented in a robot may help handle turn assignment and signal the role to human interlocutors as addressees, bystanders, or non-participants.

2.5 Analysis of emotional signals in natural-language dialogue

In section 2.3, we discussed approaches for analysing the semantics of multimodal input. A number of empirical studies revealed, however, that a pure semantic analysis does not always suffice. Rather, a machine should also be sensitive towards communicative signals that are communicated by a human user in a more unconscious manner. For example, Martinovsky and Traum (2003) demonstrated by means of user dialogues with a training system and a telephone-based information system that many breakdowns in man–machine communication could be avoided if the machine was able to recognize the emotional state of the user and responded to it appropriately.

Inspired by their observations, Bosma and André (2004) presented an approach to the joint interpretation of emotional input and natural-language utterances. Especially short utterances tend to be highly ambiguous when solely the linguistic data is considered. An utterance like ‘right’ may be interpreted as a confirmation as well as a rejection, if intended cynically, and so may the absence of an utterance. To integrate the meanings of the users’ spoken input and their emotional state, Bosma and André combined a Bayesian network to recognize the user’s emotional state from physiological data, such as heart rate, with weighted finite-state machines to recognize dialogue acts from the user’s speech. The finite-state machine approach was similar to that presented by Bangalore and Johnson (2009). However, while Bangalore and Johnston used finite-state machines to analyse the propositional content of dialogue acts, Bosma and André focused on the speaker’s intentions. Their objective was to discriminate a proposal from a directive, an acceptance from a rejection, etc., as opposed to Bangalore and Johnston who aimed at parsing user commands that are distributed over multiple modalities, each of the modalities conveying partial information. That is, Bosma and André did not expect the physiological modality to contribute to the propositional interpretation of an utterance. Instead, the emotional input was used to estimate the probabilities of dialogue acts, which were represented by weights in the finite-state machines.

Another approach that fuses emotional states with natural-language dialogue acts has been presented by Crook et al. (2012) who integrated a system to recognize emotions from speech, developed by Vogt et al. (2008), into a natural-language dialogue system order to improve the robustness of a speech recognizer. Their system fuses emotional states recognized from the acoustics of speech with sentiments extracted from the transcript of speech. For example, when the users employ words that are not included in the dictionary to express their emotional state, the system would still be able to recognize their emotions from the acoustics of speech.

3 Generation of Multimedia Output Including Natural Language

In many situations, information is only presented efficiently through a particular media combination. Multimedia presentation systems take advantage of both the individual strength of each media and the fact that several media can be employed in parallel. Most early systems combine spoken or written language with static or dynamic graphics, including bar charts and tables, such as MAGIC (Dalal et al. 1996), maps, such as AIMI (Maybury 1993) and depictions of three-dimensional objects, such as WIP (André et al. 1993).

While these early systems start from a given hardware equipment, later systems enabled the presentation of multimodal information on heterogeneous devices. For example, the SmartKom system (Wahlster 2003) supports a variety of non-desktop applications including smart rooms, kiosks, and mobile environments. More recent systems exploit the benefits of multiple media and modalities in order to improve the accessibility for a diversity of users. For instance, Piper and Hollan (2008) developed a multimodal interface for tabletop displays that incorporates keyboard input by the patient and speech input by the doctor. To facilitate medical conversations between a deaf patient and a hearing, non-signing physician, the interface made use of movable speech bubbles. In addition, it exploited the affordances of tabletop displays to leverage face-to-face communication. The ambition of this work lies in the fact that it aims at satisfying the needs of several users with very different requirements at the same time.

3.1 Natural-language technology as a basis for multimedia generation

Encouraged by progress achieved in natural-language generation (see Chapter 29), several researchers have tried to generalize the underlying concepts and methods in such a way that they can be used in the broader context of multimedia generation.

A number of multimedia document generation systems make use of a notion of schemata introduced by McKeown (1992) for text generation. Schemata describe standard patterns of discourse by means of rhetorical predicates which reflect the relationships between the parts of a multimedia document. One example of a system using a schema-based approach is COMET (Feiner and McKeown 1991). COMET employs schemata to determine the contents and the structure of the overall document. The result of this process is forwarded to a media coordinator which determines which generator should encode the selected information.

Besides schema-based approaches, operator-based approaches similar to those used for text generation have become increasingly popular for multimedia document generation. Examples include AIMI (Maybury 1993), MAGIC (Dalal et al. 1996), and WIP (André et al. 1993). The main idea behind these systems is to generalize communicative acts to multimedia acts and to formalize them as operators of a planning system. Starting from a generation goal, such as describing a technical device, the planner looks for operators whose effect subsumes the goal. If such an operator is found, all expressions in the body of the operator will be set up as new subgoals. The planning process terminates if all subgoals have been expanded to elementary generation tasks which are forwarded to the medium-specific generators. The result of the planning process is a hierarchically organized graph that reflects the discourse structure of the multimedia material.

The use of operator-based approaches has not only been shown promising for the generation of static documents, but also for the generation of multimodal presentations as in the AutoBriefer system (André et al. 2005). AutoBriefer uses declarative presentation planning strategies to synthesize a narrated multimedia briefing in various presentation formats. The narration employs synthesized audio (see Chapter 31) as well as, optionally, an agent embodying the narrator. From a technical point of view, it does not make a difference whether we plan presentation scripts for the display of static or dynamic media or communicative acts to be executed by animated characters. Basically, we have to define a repertoire of plan operators which control a character’s conversational behaviour. The planning approach also allows us to incorporate models of a character’s personality and emotions by treating them as an additional filter during the selection and instantiation of plan operators. For example, we may define specific plan operators for characters of a specific personality and formulate constraints which restrict their applicability (André et al. 2000).

While the approaches already discussed focus on the generation of multimedia documents or multimodal presentations, various state chart dialects have been shown to be a suitable method for modelling multimodal interactive dialogues. Such an approach has been presented, for example, by Gebhard et al. (2012). The basic idea is to organize the content as a collection of scenes which are described by a multimodal script while the transitions between single scenes are modelled by hierarchical state charts. The approach also supports the development of interactive multimodal scripts since transitions from one scene to another may be elicited by specific user interactions.

3.2 Multimodal/multimedia coordination

Multimedia presentation design involves more than just merging output in different media; it also requires a fine-grained coordination of different modalities/media. This includes distributing information onto different generators, tailoring the generation results to each other, and integrating them into a multimodal/multimedia output.

Modality/media allocation

Earlier approaches, such as modality theory presented by Bernsen (1997), rely on formal representation of modality properties that help find an appropriate modality combination in a particular context. For example, speech may be classified as an acoustic modality which does not require limb (including haptic) or visual activity. As a consequence, spoken commands are appropriate in situations where the user’s hands and eyes are occupied.

While earlier work on media selection focused on the formalization of knowledge that influences the selection process, more recent work on modality selection is guided by empirical studies. For example, Cao et al. (2010) conducted a study to find out adequate media combinations for presenting warnings to car drivers. They recommend auditory modalities, such as speech and beeps, in situations when the visibility is low or the driver is tired while visual media should preferably be employed in noisy environments. A combination of visual and auditory media is particularly suitable when the driver’s cognitive load is very high.

Multimodal SystemsClick to view larger

Figure 2 The MARC character pointing, disapproving and applauding (from left to right).

Particular challenges arise when choosing appropriate modalities for embodied conversational agents (ECAs). According to their functional role in a dialogue, such agents must be able to exhibit a variety of conversational behaviours. Among other things, they have to execute verbal and non-verbal behaviours that express emotions (e.g. show anger by facial displays and body gestures), convey the communicative function of an utterance (e.g. warn the user by lifting the index finger), support referential acts (e.g. look at an object and point at it), regulate dialogue management (e.g. establish eye contact with the user during communication), and articulate what is being said. Figure 2 shows a number of postures and gestures of the MARC character (Courgeon et al. 2011) to express a variety of communicative acts.

The design of multimodal behaviours for embodied conversational agents is usually informed by corpora of human behaviours which include video recordings of multiple modalities, such as speech, hand gesture, facial expression, head movements, and body postures. For an overview of corpus-based generation approaches, we refer to Kipp et al. (2009). A significant amount of work has been conducted on corpus studies that investigate the role of gestures in multimodal human-agent dialogue. Noteworthy is the work by Bergmann et al. (2011) who recorded a multimodal corpus of route descriptions as a basis for the implementation of a virtual agent that is able to convey form and spatial features by gestures and speech. For example, their agent was able to produce multimodal utterances, such as ‘You will pass a U-shaped building’ while forming a U-shape with its hands.

Cross-modality references

To ensure the consistency and coherency of a multimedia document, the media-specific generators have to tailor their results to each other. An effective means of establishing coreferential links between different modalities is the generation of cross-modality referring expressions that refer to document parts in other presentation media. Examples of cross-modality referring expressions are ‘the upper left corner of the picture’ or ‘Fig. x’. To support modality coordination, a common data structure is required which explicitly represents the design decisions of the single generators and allows for communication between them. The EMMA standard introduced earlier includes an XML mark-up language representing the interpretation of multimodal user input.

An algorithm widely used to generate natural-language referring expressions has been presented by Reiter and Dale (1992). The basic idea of the algorithm is to determine a set of attributes that distinguish a reference object from alternatives with which the reference object might be mixed up.

This algorithm has often been used as a basis for the generation of cross-modality references considering the visual and discourse salience, as in the work by Kelleher and Kruiff (2006), or additional modalities, such as gestures, as in the work by van der Sluis and Krahmer (2007).

Synchronization of multimodal behaviours

The synchronization of multimodal behaviours is a main issue in multimodal generation and appears as a major task of several architectures. An example includes mTalk (Johnston et al. 2011), which offers capabilities for synchronized multimodal output generation combining graphical actions with synchronized speech. In addition to the audio stream, the rendering component of mTalk receives specific mark events that indicate the progress of text to speech (Chapter 31) and can thus be used to synchronize speech output with graphical output. To illustrate this feature, the developers of mTalk implemented a Newsreader that highlights phrases in an HTML page of a Newsreader while they are spoken.

A more fine-grained synchronization is required for the generation of verbal and non-verbal behaviours of embodied conversational agents. For example, the body gestures, facial displays, and lip movements of an agent have to be tightly synchronized with the phonemes of a spoken utterance. Even small failures in the synchronization may make the agent appear unnatural and negatively influence how the agent is perceived by a human observer. To synchronize multimodal behaviours, a variety of scheduling approaches have been developed that automatically compose animations sequences following time constraints, such as the PPP system (André et al. 1998) or the BEAT system (Cassell et al. 2001). More recent approaches, such as the SmartBody (Thiebaux et al. 2008) or MARC system (Courgeon et al. 2011), assemble synchronized animations and speech based on performance descriptions in BML (Behavior Markup Language, Vilhjálmsson et al. 2007). This XML language includes specific tags for controlling the temporal relations between modalities. For example, BML allows us to specify that a particular behaviour should start only when another one has finished.

4 Language Processing for Accessing Multimedia Data

Rapid progress in technology for the creation, processing, and storage of multimedia documents has opened up completely new possibilities for building up large multimedia archives. Furthermore, platforms enabling social networking, such as Facebook, Flickr, and Twitter, have encouraged the production of an enormous amount of multimedia content on the Web. As a consequence, tools are required that make this information accessible to users in a beneficial way. Methods for natural processing facilitate the access to multimedia information in at least three ways: (1) information can often be retrieved more easily from meta-data, audio, or closed caption streams; (2) natural-language access to visual data is often much more convenient since it allows for a more efficient formulation of queries; and (3) natural language provides a good means of condensing and summarizing visual information.

4.1 NL-based video/image analysis

Whereas it is still not feasible to analyse arbitrary visual data, a great deal of progress has been made in the analysis of spoken and written language. Based on the observation that a lot of information is encoded redundantly, a number of research projects rely on the linguistic sources (e.g. transcribed speech or closed captions) when analysing image/video material. Indeed a number of projects, such as the Broadcast News Navigator (BNN, Merlino et al. 1997), have shown that the use of linguistic sources in multimedia retrieval may help overcome the so-called semantic gap, i.e. the discrepancy between low-level features and higher-level semantic concepts.

Typically, systems for NL-based video/image processing do not aim at a complete syntactic and semantic analysis of the underlying information. Instead, they usually restrict themselves to tasks, such as image classification, video classification, and video segmentation, employing standard techniques for shallow natural-language processing, such as text-based information retrieval (see Chapter 34) and information extraction (see Chapter 35).

Due to the increasing popularity of social media, novel applications for NL-based video/image analysis have emerged. For example, Firan et al. (2010) take advantage of different kinds of user-provided natural-language content for image classification. In addition, online resources, such as Wikipedia and WordNet, can be exploited for image/video analysis. For example, information extracted from Wikipedia can be used to resolve ambiguities of text accompanying an image or to refine queries for an image retrieval system; see Kliegr et al. (2008).

4.2 Natural-language access to multimodal information

Direct manipulation interfaces often require the user to access objects by a series of mouse operations. Even if the user knows the location of the objects he or she is looking for, this process may still cost of a lot of time and effort. Natural language supports direct access to information and enables the efficient formulation of queries by using simple keyword or free-form text.

The vast majority of information allows the user to input some natural-language keywords that refer to the contents of an image or a video. Such keywords may specify a subject matter, such as ‘sports’ (Aho et al. 1997), but also subjective impressions, such as ‘sad movie’ (Chan and Jones 2005). With the advent of touch-screen phones, additional modalities have become available that allow for more intuitive NL-based interaction in mobile environments. For example, the iMOD system (Johnston 2009) enables users to browse for movies on an iPhone by formulating queries, such as ‘comedy movies by Woody Allen’ and selecting individual titles to view details with tactile gestures, such as touch and pen.

4.3 NL summaries of multimedia information

One major problem associated with visual data is information overload. Natural language has the advantage that it permits the condensation of visual data at various levels of detail according to the application-specific demands. Indeed, a number of experiments performed by Merlino and Maybury (1999) showed that reducing the amount of information (e.g. presenting users just with a one-line summary of a video) significantly reduces performance time in information-seeking tasks, but leads to nearly the same accuracy.

Most approaches that produce summaries for multimedia information combine methods for the analysis of natural language with image retrieval techniques. An early example is the Columbia Digital News System (CDNS, Aho et al. 1997) that provides summaries over multiple news articles by employing methods for text-based information extraction (see Chapter 35) and text generation (see Chapter 29). To select a representative sample of retrieved images that are relevant to the generated summary, the system makes use of image classification tools.

A more recent approach that makes use of natural language parsing techniques in combination with image retrieval techniques has been presented by UzZaman et al. (2011). Their objective, however, is not to deliver summaries in terms of multimedia documents. Instead they focus on the generation of multimedia diagrams that combine compressed text with images retrieved from Wikipedia.

While the approaches just mentioned assume the existence of linguistic channels, systems, such as ROCCO II, which generates natural-language commentaries for games of the RoboCup simulator league, start from visual information and transform it into natural language (André et al. 2000). Here, the basic idea is to perform a higher-level analysis of the visual scene in order to recognize conceptual units at a higher level of abstraction, such as spatial relations or typical motion patterns.

5 Conclusions

Multimodal/multimedia systems pose significant challenges for natural-language processing, which focuses on the analysis or generation of one input or output modality/medium only. A key observation of this chapter is that methods for natural-language processing may be extended in such a way that they become useful for the broader context of multimodal/multimedia systems as well. While unification-based grammars have proven useful for modality orchestration and analysis, text planning methods have been successfully applied to multimedia content selection and structuring. Work done in the area of multimedia information retrieval demonstrates that the integration of natural-language methods enables a deeper analysis of the underlying multimedia information and thus leads to better search results.

The evolution of multimodal/multimedia systems is evidence of the trend away from procedural approaches towards more declarative approaches, which maintain explicit representations of the syntax and semantics of multimodal/multimedia input and output. While earlier systems make use of separate components for processing multiple modalities/media, and are only able to integrate and coordinate modalities/media to a limited extent, more recent approaches are based on a unified view of language and rely on common representation formalism for the single modalities/media. Furthermore, there is a trend towards natural multimodal interaction in situated environments. This development is supported by new sensors that allow us to capture multimodal user data in an unobtrusive manner.

Further Reading and Relevant Resources

The Springer Journal on Multimodal User Interfaces (JMUI), Editor-in-Chief Jean-Claude Martin, publishes regular papers and special issues ( In addition, we recommend having a look at a variety of survey papers on multimodal user interfaces. The survey by Dumas et al. (2009) focuses on guidelines and cognitive principles for multimodal interfaces. The literature overview by Jaimes and Sebe (2007) discusses technologies for body, gesture, gaze, and affective interaction. Another survey paper by Sebe (2009) analyses challenges and perspectives for multimodal user interfaces. The article by Lalanne et al. (2009) provides an overview on multimodal fusion engines.


Aho, Alfred, Shih-Fu Chang, Kathleen McKeown, Dragomir Radev, John Smith, and Kazi Zaman (1997). ‘Columbia Digital News System: An Environment for Briefing and Search over Multimedia Information’. In International Journal on Digital Libraries, 1(4): 377–385, Berlin, Heidelberg: Springer-Verlag.Find this resource:

    André, Elisabeth, Kristian Concepcion, Inderjeet Mani, and Linda van Guilder (2005). ‘Autobriefer: A System for Authoring Narrated Briefings’. In Oliviero Stock and Massimo Zancanaro (eds), Multimodal Intelligent Information Presentation, 143–158. Dordrecht: Springer.Find this resource:

      André, Elisabeth, Wolfgang Finkler, Winfried Graf, Thomas Rist, Anne Schauder, and Wolfgang Wahlster (1993). ‘WIP: The Automatic Synthesis of Multimodal Presentations’. In Mark T. Maybury (ed.), Intelligent Multimedia Interfaces, 75–93. Menlo Park, CA: American Association for Artificial Intelligence.Find this resource:

        André, Elisabeth, Thomas Rist, Susanne van Mulken, Martin Klesen, and Stephan Baldes (2000). ‘The Automated Design of Believable Dialogues for Animated Presentation Teams’. In Justine Cassell, Joseph Sullivan, and Elizabeth Churchill (eds), Embodied Conversational Agents, 220–255. Cambridge, MA: MIT Press.Find this resource:

          André, Elisabeth, Thomas Rist, and Jochen Müller (1998). ‘Integrating Reactive and Scripted Behaviors in a Life-Like Presentation Agent’. In Katia P. Sycara and Michael Wooldridge (eds), Proceedings of the Second International Conference on Autonomous Agents (AGENTS ’98), 261–268. New York: ACM.Find this resource:

            Bangalore, Srinivas and Michael Johnston (2009). ‘Robust Understanding in Multimodal Interfaces’, Computational Linguistics 35(3): 345–397.Find this resource:

              Bergmann, Kirsten, Hannes Rieser, and Stefan Kopp (2011). ‘Regulating Dialogue with Gestures: Towards an Empirically Grounded Simulation with Conversational Agents’. In Proceedings of the SIGDIAL 2011 Conference (SIGDIAL ’11), 88–97. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                Bolt, Richard A. (1980). ‘Put-That-There: Voice and Gesture at the Graphics Interface’. In ACM SIGGRAPH Computer Graphics, 14(3): 262-270. New York: ACM.Find this resource:

                  Bernsen, Niels Ole (1997). ‘Defining a Taxonomy of Output Modalities from an HCI Perspective’, Computer Standards and Interfaces 18(6–7): 537–553.Find this resource:

                    Bosma, Wauter and Elisabeth André (2004). ‘Exploiting Emotions to Disambiguate Dialogue Acts’. In Jean Vanderdonckt, Nuno Jardim Nunes, Charles Rich (eds), Proceedings of the 9th International Conference on Intelligent User Interfaces (IUI ’04), 85–92. New York: ACM.Find this resource:

                      Cao, Yujia, Frans van der Sluis, Mariët Theune, Rieks op den Akker, and Anton Nijholt (2010). ‘Evaluating Informative Auditory and Tactile Cues for In-Vehicle Information Systems’. In Anind K. Dey, Albrecht Schmidt, Susanne Boll, Andrew L. Kuhn (eds), Proceedings of the 2nd International Conference on Automotive User Interfaces and Interactive Vehicular Applications (AutomotiveUI ’10), 102–109. New York: ACM.Find this resource:

                        Cassell, Justine, Hannes Högni Vilhjálmsson, and Timothy Bickmore (2001). ‘Beat: The Behavior Expression Animation Toolkit’. In Lynn Pocock (ed.), Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’01), 477–486. New York: ACM.Find this resource:

                          Chan, Ching Hau and Gareth J. F. Jones (2005). ‘Affect-Based Indexing and Retrieval of Films’. In Hongjiang Zhang, Tat-Seng Chua, Ralf Steinmetz, Mohan Kankanhalli, Lynn Wilcox (eds), Proceedings of the 13th Annual ACM International Conference on Multimedia (MULTIMEDIA ’05), 427–430. New York: ACM.Find this resource:

                            Courgeon, Matthieu, Céline Clavel, Ning Tan, and Jean-Claude Martin (2011). ‘Front View vs. Side View of Facial and Postural Expressions of Emotions in a Virtual Character’, Transactions on Edutainment 6: 132–143.Find this resource:

                              Crook, Nigel, Debora Field, Cameron Smith, Sue Harding, Stephen Pulman, Marc Cavazza, Daniel Charlton, Roger Moore, and Johan Boye (2012). ‘Generating Context-Sensitive ECA Responses to User Barge-in Interruptions’, Journal on Multimodal User Interfaces 6: 13–25.Find this resource:

                                Dalal, Mukesh, Steven Feiner, Kathleen McKeown, Shimei Pan, Michelle X. Zhou, Tobias Höllerer, James Shaw, Yong Feng, and Jeanne Fromer (1996). ‘Negotiation for Automated Generation of Temporal Multimedia Presentations’. In Philippe Aigrain, Wendy Hall, Thomas D. C. Little, and V. Michael Bove Jr (eds), Proceedings of the 4th ACM International Conference on Multimedia, Boston, MA, 18–22 November 1996, 55–64. New York: ACM Press.Find this resource:

                                  Dimitriadis, Dimitrios B. and Juergen Schroeter (2011). ‘Living Rooms Getting Smarter with Multimodal and Multichannel Signal Processing’, IEEE SLTC newsletter, July. Available online at <>.

                                  Dumas, Bruno, Denis Lalanne, and Sharon Oviatt (2009). ‘Multimodal Interfaces: A Survey of Principles, Models and Frameworks’. In Denis Lalanne and Jürg Kohlas (eds), Human Machine Interaction, 3–26. Berlin and Heidelberg: Springer-Verlag.Find this resource:

                                    Feiner, Steven and Kathleen McKeown (1991). ‘Comet: Generating Coordinated Multimedia Explanations’. In Scott P. Robertson, Gary M. Olson, and Judith S. Olson (eds), Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, (CHI ’91), New Orleans, LA, 27 April–2 May 1991, 449–450. New York: ACM.Find this resource:

                                      Firan, Claudiu S., Mihai Georgescu, Wolfgang Nejdl, and Raluca Paiu (2010). ‘Bringing Order to your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge’. In Jimmy Huang, Nick Koudas, Gareth Jones, Xindong Wu, Kevyn Collins-Thompson, Aijun An (eds), Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM ’10), 189–198. New York: ACM.Find this resource:

                                        Gebhard, Patrick, Gregor Mehlmann, and Michael Kipp (2012). ‘Visual Scenemaker: A Tool for Authoring Interactive Virtual Characters’, Journal on Multimodal User Interfaces 6: 3–11.Find this resource:

                                          Gruenstein, Alexander, Jarrod Orszulak, Sean Liu, Shannon Roberts, Jeff Zabel, Bryan Reimer, Bruce Mehler, Stephanie Seneff, James R. Glass, and Joseph F. Coughlin (2009). ‘City Browser: Developing a Conversational Automotive HMI’. In Dan R. Olsen Jr, Richard B. Arthur, Ken Hinckley, Meredith Ringel Morris, Scott E. Hudson, and Saul Greenberg (eds), Proceedings of the 27th International Conference on Human Factors in Computing Systems (CHI 2009), Extended Abstracts Volume, Boston, MA, 4–9 April 2009, 4291–4296. New York: ACM.Find this resource:

                                            Jaimes, Alejandro and Nicu Sebe (2007). ‘Multimodal Human Computer Interaction: A Survey’, Computer Vision and Image Understanding, 108(1–2): 116–134.Find this resource:

                                              Johnston, Michael (1998). ‘Unification-based Multimodal Parsing’. In Christian Boitet and Pete Whitelock (eds), Proceedings of the International Conference on Computational Linguistics and the 36th Annual Meeting of the Association for Computational Linguistics (Coling-ACL), Montreal, Canada, 624–630. San Francisco, CA: Morgan Kaufmann Publishers and Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                Johnston, Michael (2009). ‘Building Multimodal Applications with EMMA’. In James L. Crowley, Yuri A. Ivanov, Christopher Richard Wren, Daniel Gatica-Perez, Michael Johnston, and Rainer Stiefelhagen (eds), Proceedings of the 11th International Conference on Multimodal Interfaces (ICMI 2009), Cambridge, MA, 2–4 November 2009, 47–54. New York: ACM.Find this resource:

                                                  Johnston, Michael, Giuseppe Di Fabbrizio, and Simon Urbanek (2011). ‘mTalk—A Multimodal Browser for Mobile Services’. In INTERSPEECH 2011: 12th Annual Conference of the International Speech Communication Association, Florence, Italy, 27–31 August 2011, 3261–3264. Baixas, France: ISCA.Find this resource:

                                                    Kaiser, Ed, Alex Olwal, David McGee, Hrvoje Benko, Andrea Corradini, Xiaoguang Li, Phil Cohen, and Steven Feiner (2003). ‘Mutual Disambiguation of 3D Multimodal Interaction in Augmented and Virtual Reality’. In Sharon Oviatt, Trevor Darrell, Mark Maybury, Wolfgang Wahlster (eds), Proceedings of the 5th International Conference on Multimodal Interfaces (ICMI ’03), 12–19. New York: ACM.Find this resource:

                                                      Kelleher, John D. and Geert-Jan M. Kruijff (2006). ‘Incremental Generation of Spatial Referring Expressions in Situated Dialog’. In Nicoletta Calzolari, Claire Cardie, Pierre Isabelle (eds), Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, 1041–1048. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                        Kipp, Michael, Jean-Claude Martin, Patrizia Paggio, and Dirk Heylen (2009). Multimodal Corpora: From Models of Natural Interaction to Systems and Applications, Lecture Notes in Computer Science 5509. Berlin and Heidelberg: Springer-Verlag.Find this resource:

                                                          Kliegr, Tomas, Krishna Chandramouli, Jan Nemrava, Vojtech Svatek, and Ebroul Izquierdo (2008). ‘Combining Image Captions and Visual Analysis for Image Concept Classification’. In Alejandro Jaimes, Jia-Yu (Tim) Pan, Maria Luisa Sapino (eds), Proceedings of the 9th International Workshop on Multimedia Data Mining: Held in Conjunction with the ACM SIGKDD 2008 (MDM ’08), 8–17. New York: ACM.Find this resource:

                                                            Lalanne, Denis, Laurence Nigay, Philippe Palanque, Peter Robinson, Jean Vanderdonckt, and Jean-François Ladry (2009). ‘Fusion Engines for Multimodal Input: A Survey’. In James L. Crowley, Yuri Ivanov, Christopher Wren, Daniel Gatica-Perez, Michael Johnston, Rainer Stiefelhagen (eds), Proceedings of the 2009 International Conference on Multimodal Interfaces (ICMI-MLMI ’09), 153–160. New York: ACM.Find this resource:

                                                              Landragin, Frédéric (2006). ‘Visual Perception, Language and Gesture: A Model for their Understanding in Multimodal Dialogue Systems’, Signal Processing 86(12): 3578–3595.Find this resource:

                                                                Martin, Jean-Claude, Stéphanie Buisine, Guillaume Pitel, and Niels Ole Bernsen (2006). ‘Fusion of Children’s Speech and 2D Gestures when Conversing with 3D Characters’, Signal Processing 86(12): 3596–3624.Find this resource:

                                                                  Martinovsky, B. and D. Traum (2003). ‘Breakdown in Human-Machine Interaction: The Error is the Clue’. In Proceedings of the ISCA Tutorial and Research Workshop on Error Handling in Dialogue Systems, 11–16. Château d’Oex, Switzerland: ISCA.Find this resource:

                                                                    Maybury, Mark T. (1993). ‘Planning Multimedia Explanations Using Communicative Acts’. In Mark T. Maybury (ed.), Intelligent Multimedia Interfaces, 75–93. Menlo Park, CA: American Association for Artificial Intelligence.Find this resource:

                                                                      Maybury, Mark T. and Wolfgang Wahlster (1998). ‘Intelligent User Interfaces: An Introduction ’. In Mark T. Maybury, Wolfgang Wahlster (eds), Readings in Intelligent User Interfaces, San Francisco, CA, USA: Morgan Kaufmann.Find this resource:

                                                                        McKeown, Kathleen (1992). Text Generation: Using Discourse Strategies and Focus Constraints to Generate Natural Language Text, Studies in Natural Language Processing. New York: Cambridge University Press.Find this resource:

                                                                          Mehlmann, Gregor and Elisabeth André (2012). ‘Modeling Multimodal Integration with Event Logic Charts’. In Louis-Philippe Morency, Dan Bohus, Hamid Aghajan, Justine Cassell, Anton Nijholt, Julien Epps (eds), Proceedings of the 14th ACM International Conference on Multimodal Interfaces (ICMI ’12), Santa Monica, USA, 22–26 October 2012. New York: ACM.Find this resource:

                                                                            Merlino, Andrew and Mark Maybury (1999). ‘An Empirical Study of the Optimal Presentation of Multimedia Summaries of Broadcast News’. In I. Mani and M. Maybury (eds), Automated Text Summarization, 391–402. Cambridge, MA: MIT Press.Find this resource:

                                                                              Merlino, Andrew, Daryl Morey, and Mark Maybury (1997). ‘Broadcast News Navigation Using Story Segmentation’. In Ephraim P. Glinert, Mark Scott Johnson, Jim Foley, and Jim Hollan (eds), Proceedings of the Fifth ACM International Conference on Multimedia ’97, 381–391. New York: ACM.Find this resource:

                                                                                Mutlu, Bilge, Takayuki Kanda, Jodi Forlizzi, Jessica K. Hodgins, and Hiroshi Ishiguro (2012). ‘Conversational Gaze Mechanisms for Humanlike Robots’, ACM Transactions on Interactive Intelligent Systems (TiiS) 1(2):12:1-12:33.Find this resource:

                                                                                  Oviatt, Sharon L. (1999). ‘Mutual Disambiguation of Recognition Errors in a Multimodel Architecture’. In Marian G. Williams and Mark W. Altom (eds), Proceedings of the CHI ’99 Conference on Human Factors in Computing Systems: The CHI is the Limit, Pittsburgh, PA, 15–20 May 1999, 576–583. New York: ACM.Find this resource:

                                                                                    Piper, Anne Marie and James D. Hollan (2008). ‘Supporting Medical Conversations between Deaf and Hearing Individuals with Tabletop Displays’. In Bo Begole, David W. McDonald (eds), Proceedings of the 2008 ACM Conference on Computer Supported Cooperative Work (CSCW ’08), 147–156. New York: ACM.Find this resource:

                                                                                      Reiter, Ehud and Robert Dale (1992). ‘A Fast Algorithm for the Generation of Referring Expressions’. In Proceedings of the 14th International Conference on Computational Linguistics (COLING 1992), i.232–238. Stroudsburg, PA: Association for Computational Linguistics.Find this resource:

                                                                                        Rich, Charles, Brett Ponsleur, Aaron Holroyd, and Candace L. Sidner (2010). ‘Recognizing Engagement in Human–Robot Interaction’. In Pamela Hinds, Hiroshi Ishiguro, Takayuki Kanda, Peter Kahn (eds), Proceedings of the 5th ACM/IEEE International Conference on Human–Robot Interaction (HRI ’10), 375–382. Piscataway, NJ: IEEE Press.Find this resource:

                                                                                          Sebe, Nicu (2009). ‘Multimodal Interfaces: Challenges and Perspectives’, Journal of Ambient Intelligence and Smart Environments 1(1): 23–30.Find this resource:

                                                                                            Sluis, Ielka van der and Emiel Krahmer (2007). ‘Generating Multimodal Referring Expressions’, Discourse Processes 44(3): 145–174.Find this resource:

                                                                                              Thiebaux, Marcus, Stacy Marsella, Andrew N. Marshall, and Marcelo Kallmann (2008). ‘SmartBody: Behavior Realization for Embodied Conversational Agents’. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS ’08), i.151–158. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems.Find this resource:

                                                                                                UzZaman, Naushad, Jeffrey P. Bigham, and James F. Allen (2011). ‘Multimodal Summarization of Complex Sentences’. In Proceedings of the 16th International Conference on Intelligent User Interfaces (IUI ’11), 43–52. New York: ACM.Find this resource:

                                                                                                  Vilhjálmsson, Hannes, Nathan Cantelmo, Justine Cassell, Nicolas E. Chafai, Michael Kipp, Stefan Kopp, Maurizio Mancini, Stacy Marsella, Andrew N. Marshall, Catherine Pelachaud, Zsofi Ruttkay, Kristinn R. Thórisson, Herwin Welbergen, and Rick J. Werf (2007). ‘The Behavior Markup Language: Recent Developments and Challenges’. In Proceedings of the 7th International Conference on Intelligent Virtual Agents (IVA ’07), 99–111. Berlin and Heidelberg: Springer-Verlag.Find this resource:

                                                                                                    Vogt, Thurid, Elisabeth André, and Nikolaus Bee (2008). ‘Emovoice: A Framework for Online Recognition of Emotions from Voice’. In Elisabeth André, Laila Dybkjær, Wolfgang Minker, Heiko Neumann, Roberto Pieraccini, and Michael Weber (eds), Perception in Multimodal Dialogue Systems: Proceedings of the 4th IEEE Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems (PIT 2008), Kloster Irsee, Germany, 16–18 June 2008, Lecture Notes in Computer Science 5078, 188–199. Berlin and Heidelberg: Springer-Verlag.Find this resource:

                                                                                                      Wahlster, Wolfgang (2003). ‘Towards Symmetric Multimodality: Fusion and Fission of Speech, Gesture, and Facial Expression’. In Andreas Günter, Rudolf Kruse, and Bernd Neumann (eds), KI 2003: Advances in Artificial Intelligence, Proceedings of the 26th Annual German Conference on AI, Hamburg, Germany, 15–18 September 2003, Lecture Notes in Computer Science 2821, 1–18. Berlin and Heidelberg: Springer-Verlag.Find this resource: