SwePub - sökning: WFRF:(Beskow Jonas)

Numrering	Referens	Omslagsbild	Hitta
1.	Agelfors, Eva, et al. (författare) A synthetic face as a lip-reading support for hearing impaired telephone users - problems and positive results 1999 Ingår i: European audiology in 1999. Konferensbidrag (refereegranskat)
2.	Agelfors, Eva, et al. (författare) Synthetic visual speech driven from auditory speech 1999 Ingår i: Proceedings of Audio-Visual Speech Processing (AVSP'99)). Konferensbidrag (refereegranskat)abstract We have developed two different methods for using auditory, telephone speech to drive the movements of a synthetic face. In the first method, Hidden Markov Models (HMMs) were trained on a phonetically transcribed telephone speech database. The output of the HMMs was then fed into a rulebased visual speech synthesizer as a string of phonemes together with time labels. In the second method, Artificial Neural Networks (ANNs) were trained on the same database to map acoustic parameters directly to facial control parameters. These target parameter trajectories were generated by using phoneme strings from a database as input to the visual speech synthesis The two methods were evaluated through audiovisual intelligibility tests with ten hearing impaired persons, and compared to “ideal” articulations (where no recognition was involved), a natural face, and to the intelligibility of the audio alone. It was found that the HMM method performs considerably better than the audio alone condition (54% and 34% keywords correct respectively), but not as well as the “ideal” articulating artificial face (64%). The intelligibility for the ANN method was 34% keywords correct.
3.	Agelfors, Eva, et al. (författare) Two methods for Visual Parameter Extraction in the Teleface Project 1999 Ingår i: Proceedings of Fonetik. - Gothenburg, Sweden. Konferensbidrag (övrigt vetenskapligt/konstnärligt)
4.	Agelfors, Eva, et al. (författare) User evaluation of the SYNFACE talking head telephone 2006 Ingår i: Computers Helping People With Special Needs, Proceedings. - Berlin, Heidelberg : Springer Berlin Heidelberg. - 3540360204 ; , s. 579-586 Konferensbidrag (refereegranskat)abstract The talking-head telephone, Synface, is a lip-reading support for people with hearing-impairment. It has been tested by 49 users with varying degrees of hearing-impaired in UK and Sweden in lab and home environments. Synface was found to give support to the users, especially in perceiving numbers and addresses and an enjoyable way to communicate. A majority deemed Synface to be a useful product.
5.	Al Moubayed, Samer, et al. (författare) A novel Skype interface using SynFace for virtual speech reading support 2011 Ingår i: Proceedings from Fonetik 2011, June 8 - June 10, 2011. - Stockholm, Sweden. ; , s. 33-36 Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract We describe in this paper a support client interface to the IP telephony application Skype. The system uses a variant of SynFace, a real-time speech reading support system using facial animation. The new interface is designed for the use by elderly persons, and tailored for use in systems supporting touch screens. The SynFace real-time facial animation system has previously shown ability to enhance speech comprehension for the hearing impaired persons. In this study weemploy at-home ﬁeld studies on ﬁve subjects in the EU project MonAMI. We presentinsights from interviews with the test subjects on the advantages of the system, and onthe limitations of such a technology of real-time speech reading to reach the homesof elderly and the hard of hearing.
6.	Al Moubayed, Samer, et al. (författare) A robotic head using projected animated faces 2011 Ingår i: Proceedings of the International Conference on Audio-Visual Speech Processing 2011. - Stockholm : KTH Royal Institute of Technology. ; , s. 71- Konferensbidrag (refereegranskat)abstract This paper presents a setup which employs virtual animatedagents for robotic heads. The system uses a laser projector toproject animated faces onto a three dimensional face mask. This approach of projecting animated faces onto a three dimensional head surface as an alternative to using ﬂat, two dimensional surfaces, eliminates several deteriorating effects and illusions that come with ﬂat surfaces for interaction purposes, such as exclusive mutual gaze and situated and multi-partner dialogues. In addition to that, it provides robotic heads with a ﬂexible solution for facial animation which takes into advantage the advancements of facial animation using computer graphics overmechanically controlled heads.
7.	Al Moubayed, Samer, et al. (författare) Animated Faces for Robotic Heads : Gaze and Beyond 2011 Ingår i: Analysis of Verbal and Nonverbal Communication and Enactment. - Berlin, Heidelberg : Springer Berlin/Heidelberg. - 9783642257742 ; , s. 19-35 Konferensbidrag (refereegranskat)abstract We introduce an approach to using animated faces for robotics where a static physical object is used as a projection surface for an animation. The talking head is projected onto a 3D physical head model. In this chapter we discuss the different benefits this approach adds over mechanical heads. After that, we investigate a phenomenon commonly referred to as the Mona Lisa gaze effect. This effect results from the use of 2D surfaces to display 3D images and causes the gaze of a portrait to seemingly follow the observer no matter where it is viewed from. The experiment investigates the perception of gaze direction by observers. The analysis shows that the 3D model eliminates the effect, and provides an accurate perception of gaze direction. We discuss at the end the different requirements of gaze in interactive systems, and explore the different settings these findings give access to.
8.	Al Moubayed, Samer, et al. (författare) Audio-Visual Prosody : Perception, Detection, and Synthesis of Prominence 2010 Ingår i: 3rd COST 2102 International Training School on Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces. - Berlin, Heidelberg : Springer Berlin Heidelberg. - 9783642181832 ; , s. 55-71 Konferensbidrag (refereegranskat)abstract In this chapter, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study a speech intelligibility experiment is conducted, where speech quality is acoustically degraded, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrow raising gestures. The experiment shows that perceiving visual prominence as gestures, synchronized with the auditory prominence, significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a study examining the perception of the behavior of the talking heads when gestures are added at pitch movements. Using eye-gaze tracking technology and questionnaires for 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch movements opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and helpfulness of the talking head.
9.	Al Moubayed, Samer, et al. (författare) Auditory visual prominence From intelligibility to behavior 2009 Ingår i: Journal on Multimodal User Interfaces. - : Springer Science and Business Media LLC. - 1783-7677 .- 1783-8738. ; 3:4, s. 299-309 Tidskriftsartikel (refereegranskat)abstract Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence is one of the prosodic functions that has been shown to be strongly correlated with facial movements. In this work, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence. The experiment shows that presenting prominence as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.
10.	Al Moubayed, Samer, 1982- (författare) Bringing the avatar to life : Studies and developments in facial communication for virtual agents and robots 2012 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract The work presented in this thesis comes in pursuit of the ultimate goal of building spoken and embodied human-like interfaces that are able to interact with humans under human terms. Such interfaces need to employ the subtle, rich and multidimensional signals of communicative and social value that complement the stream of words – signals humans typically use when interacting with each other.The studies presented in the thesis concern facial signals used in spoken communication, and can be divided into two connected groups. The first is targeted towards exploring and verifying models of facial signals that come in synchrony with speech and its intonation. We refer to this as visual-prosody, and as part of visual-prosody, we take prominence as a case study. We show that the use of prosodically relevant gestures in animated faces results in a more expressive and human-like behaviour. We also show that animated faces supported with these gestures result in more intelligible speech which in turn can be used to aid communication, for example in noisy environments.The other group of studies targets facial signals that complement speech. As spoken language is a relatively poor system for the communication of spatial information; since such information is visual in nature. Hence, the use of visual movements of spatial value, such as gaze and head movements, is important for an efficient interaction. The use of such signals is especially important when the interaction between the human and the embodied agent is situated – that is when they share the same physical space, and while this space is taken into account in the interaction.We study the perception, the modelling, and the interaction effects of gaze and head pose in regulating situated and multiparty spoken dialogues in two conditions. The first is the typical case where the animated face is displayed on flat surfaces, and the second where they are displayed on a physical three-dimensional model of a face. The results from the studies show that projecting the animated face onto a face-shaped mask results in an accurate perception of the direction of gaze that is generated by the avatar, and hence can allow for the use of these movements in multiparty spoken dialogue.Driven by these findings, the Furhat back-projected robot head is developed. Furhat employs state-of-the-art facial animation that is projected on a 3D printout of that face, and a neck to allow for head movements. Although the mask in Furhat is static, the fact that the animated face matches the design of the mask results in a physical face that is perceived to “move”.We present studies that show how this technique renders a more intelligible, human-like and expressive face. We further present experiments in which Furhat is used as a tool to investigate properties of facial signals in situated interaction.Furhat is built to study, implement, and verify models of situated and multiparty, multimodal Human-Machine spoken dialogue, a study that requires that the face is physically situated in the interaction environment rather than in a two-dimensional screen. It also has received much interest from several communities, and been showcased at several venues, including a robot exhibition at the London Science Museum. We present an evaluation study of Furhat at the exhibition where it interacted with several thousand persons in a multiparty conversation. The analysis of the data from the setup further shows that Furhat can accurately regulate multiparty interaction using gaze and head movements.
11.	Al Moubayed, Samer, et al. (författare) Effects of Visual Prominence Cues on Speech Intelligibility 2009 Ingår i: Proceedings of Auditory-Visual Speech Processing AVSP'09. - Norwich, England. Konferensbidrag (refereegranskat)abstract This study reports experimental results on the effect of visual prominence, presented as gestures, on speech intelligibility. 30 acoustically vocoded sentences, permutated into different gestural conditions were presented audio-visually to 12 subjects. The analysis of correct word recognition shows a significant increase in intelligibility when focally-accented (prominent) words are supplemented with head-nods or with eye-brow raise gestures. The paper also examines coupling other acoustic phenomena to brow-raise gestures. As a result, the paper introduces new evidence on the ability of the non-verbal movements in the visual modality to support audio-visual speech perception.
12.	Al Moubayed, Samer, 1982-, et al. (författare) Furhat : A Back-projected Human-like Robot Head for Multiparty Human-Machine Interaction 2012 Ingår i: Cognitive Behavioural Systems. - Berlin, Heidelberg : Springer Berlin/Heidelberg. - 9783642345838 ; , s. 114-130 Konferensbidrag (refereegranskat)abstract In this chapter, we first present a summary of findings from two previous studies on the limitations of using flat displays with embodied conversational agents (ECAs) in the contexts of face-to-face human-agent interaction. We then motivate the need for a three dimensional display of faces to guarantee accurate delivery of gaze and directional movements and present Furhat, a novel, simple, highly effective, and human-like back-projected robot head that utilizes computer animation to deliver facial movements, and is equipped with a pan-tilt neck. After presenting a detailed summary on why and how Furhat was built, we discuss the advantages of using optically projected animated agents for interaction. We discuss using such agents in terms of situatedness, environment, context awareness, and social, human-like face-to-face interaction with robots where subtle nonverbal and social facial signals can be communicated. At the end of the chapter, we present a recent application of Furhat as a multimodal multiparty interaction system that was presented at the London Science Museum as part of a robot festival,. We conclude the paper by discussing future developments, applications and opportunities of this technology.
13.	Al Moubayed, Samer, et al. (författare) Furhat goes to Robotville: a large-scale multiparty human-robot interaction data collection in a public space 2012 Ingår i: Proc of LREC Workshop on Multimodal Corpora. - Istanbul, Turkey. Konferensbidrag (refereegranskat)abstract In the four days of the Robotville exhibition at the London Science Museum, UK, during which the back-projected head Furhat in a situated spoken dialogue system was seen by almost 8 000 visitors, we collected a database of 10 000 utterances spoken to Furhat in situated interaction. The data collection is an example of a particular kind of corpus collection of human-machine dialogues in public spaces that has several interesting and specific characteristics, both with respect to the technical details of the collection and with respect to the resulting corpus contents. In this paper, we take the Furhat data collection as a starting point for a discussion of the motives for this type of data collection, its technical peculiarities and prerequisites, and the characteristics of the resulting corpus.
14.	Al Moubayed, Samer, et al. (författare) Human-robot Collaborative Tutoring Using Multiparty Multimodal Spoken Dialogue 2014 Konferensbidrag (refereegranskat)abstract In this paper, we describe a project that explores a novel experi-mental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robotinteraction setup is designed, and a human-human dialogue corpus is collect-ed. The corpus targets the development of a dialogue system platform to study verbal and nonverbaltutoring strategies in mul-tiparty spoken interactions with robots which are capable of spo-ken dialogue. The dialogue task is centered on two participants involved in a dialogueaiming to solve a card-ordering game. Along with the participants sits a tutor (robot) that helps the par-ticipants perform the task, and organizes and balances their inter-action. Differentmultimodal signals captured and auto-synchronized by different audio-visual capture technologies, such as a microphone array, Kinects, and video cameras, were coupled with manual annotations. These are used build a situated model of the interaction based on the participants personalities, their state of attention, their conversational engagement and verbal domi-nance, and how that is correlated with the verbal and visual feed-back, turn-management, and conversation regulatory actions gen-erated by the tutor. Driven by the analysis of the corpus, we will show also the detailed design methodologies for an affective, and multimodally rich dialogue system that allows the robot to meas-ure incrementally the attention states, and the dominance for each participant, allowing the robot head Furhat to maintain a well-coordinated, balanced, and engaging conversation, that attempts to maximize the agreement and the contribution to solve the task. This project sets the first steps to explore the potential of us-ing multimodal dialogue systems to build interactive robots that can serve in educational, team building, and collaborative task solving applications.
15.	Al Moubayed, Samer, et al. (författare) Lip-reading : Furhat audio visual intelligibility of a back projected animated face 2012 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Berlin, Heidelberg : Springer Berlin/Heidelberg. ; , s. 196-203 Konferensbidrag (refereegranskat)abstract Back projecting a computer animated face, onto a three dimensional static physical model of a face, is a promising technology that is gaining ground as a solution to building situated, flexible and human-like robot heads. In this paper, we first briefly describe Furhat, a back projected robot head built for the purpose of multimodal multiparty human-machine interaction, and its benefits over virtual characters and robotic heads; and then motivate the need to investigating the contribution to speech intelligibility Furhat's face offers. We present an audio-visual speech intelligibility experiment, in which 10 subjects listened to short sentences with degraded speech signal. The experiment compares the gain in intelligibility between lip reading a face visualized on a 2D screen compared to a 3D back-projected face and from different viewing angles. The results show that the audio-visual speech intelligibility holds when the avatar is projected onto a static face model (in the case of Furhat), and even, rather surprisingly, exceeds it. This means that despite the movement limitations back projected animated face models bring about; their audio visual speech intelligibility is equal, or even higher, compared to the same models shown on flat displays. At the end of the paper we discuss several hypotheses on how to interpret the results, and motivate future investigations to better explore the characteristics of visual speech perception 3D projected faces.
16.	Al Moubayed, Samer, et al. (författare) Multimodal Multiparty Social Interaction with the Furhat Head 2012 Konferensbidrag (refereegranskat)abstract We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is a human-like interface that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations.
17.	Al Moubayed, Samer, et al. (författare) Perception of Nonverbal Gestures of Prominence in Visual Speech Animation 2010 Ingår i: Proceedings of the ACM/SSPNET 2nd International Symposium on Facial Analysis and Animation. - Edinburgh, UK : ACM. - 9781450305228 ; , s. 25- Konferensbidrag (refereegranskat)abstract It has long been recognized that visual speech information is important for speech perception [McGurk and MacDonald 1976] [Summerfield 1992]. Recently there has been an increasing interest in the verbal and non-verbal interaction between the visual and the acoustic modalities from production and perception perspectives. One of the prosodic phenomena which attracts much focus is prominence. Prominence is defined as when a linguistic segment is made salient in its context.
18.	Al Moubayed, Samer, et al. (författare) Prominence Detection in Swedish Using Syllable Correlates 2010 Ingår i: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010. - Makuhari, Japan. - 9781617821233 ; , s. 1784-1787 Konferensbidrag (refereegranskat)abstract This paper presents an approach to estimating word level prominence in Swedish using syllable level features. The paper discusses the mismatch problem of annotations between word level perceptual prominence and its acoustic correlates, context, and data scarcity. 200 sentences are annotated by 4 speech experts with prominence on 3 levels. A linear model for feature extraction is proposed on a syllable level features, and weights of these features are optimized to match word level annotations. We show that using syllable level features and estimating weights for the acoustic correlates to minimize the word level estimation error gives better detection accuracy compared to word level features, and that both features exceed the baseline accuracy.
19.	Al Moubayed, Samer, et al. (författare) Spontaneous spoken dialogues with the Furhat human-like robot head 2014 Ingår i: HRI '14 Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction. - Bielefeld, Germany : ACM. ; , s. 326- Konferensbidrag (refereegranskat)abstract We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is an anthropomorphic robot head that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations. The dialogue design is performed using the IrisTK [4] dialogue authoring toolkit developed at KTH. The system will also be able to perform a moderator in a quiz-game showing different strategies for regulating spoken situated interactions.
20.	Al Moubayed, Samer, et al. (författare) Studies on Using the SynFace Talking Head for the Hearing Impaired 2009 Ingår i: Proceedings of Fonetik'09. - Stockholm : Stockholm University. - 9789163348921 ; , s. 140-143 Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper wepresent the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focuson measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminaryanalysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it isclear that many subjects benefit from SynFace especially with speech with stereo babble.
21.	Al Moubayed, Samer, et al. (författare) SynFace Phone Recognizer for Swedish Wideband and Narrowband Speech 2008 Ingår i: Proceedings of The second Swedish Language Technology Conference (SLTC). - Stockholm, Sweden.. ; , s. 3-6 Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract In this paper, we present new results and comparisons of the real-time lips synchronized talking head SynFace on different Swedish databases and bandwidth. The work involves training SynFace on narrow-band telephone speech from the Swedish SpeechDat, and on the narrow-band and wide-band Speecon corpus. Auditory perceptual tests are getting established for SynFace as an audio visual hearing support for the hearing-impaired. Preliminary results show high recognition accuracy compared to other languages.
22.	Al Moubayed, Samer, et al. (författare) Talking with Furhat - multi-party interaction with a back-projected robot head 2012 Ingår i: Proceedings of Fonetik 2012. - Gothenberg, Sweden. ; , s. 109-112 Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract This is a condensed presentation of some recent work on a back-projected robotic head for multi-party interaction in public settings. We will describe some of the design strategies and give some preliminary analysis of an interaction database collected at the Robotville exhibition at the London Science Museum
23.	Al Moubayed, Samer, et al. (författare) Taming Mona Lisa : communicating gaze faithfully in 2D and 3D facial projections 2012 Ingår i: ACM Transactions on Interactive Intelligent Systems. - : Association for Computing Machinery (ACM). - 2160-6455 .- 2160-6463. ; 1:2, s. 25- Tidskriftsartikel (refereegranskat)abstract The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for a number of aspects of communication and dialogue, especially for managing the flow of the dialogue and participant attention, for deictic referencing, and for the communication of attitude. When developing embodied conversational agents (ECAs) and talking heads, modeling and delivering accurate gaze targets is crucial. Traditionally, systems communicating through talking heads have been displayed to the human conversant using 2D displays, such as flat monitors. This approach introduces severe limitations for an accurate communication of gaze since 2D displays are associated with several powerful effects and illusions, most importantly the Mona Lisa gaze effect, where the gaze of the projected head appears to follow the observer regardless of viewing angle. We describe the Mona Lisa gaze effect and its consequences in the interaction loop, and propose a new approach for displaying talking heads using a 3D projection surface (a physical model of a human head) as an alternative to the traditional flat surface projection. We investigate and compare the accuracy of the perception of gaze direction and the Mona Lisa gaze effect in 2D and 3D projection surfaces in a five subject gaze perception experiment. The experiment confirms that a 3Dprojection surface completely eliminates the Mona Lisa gaze effect and delivers very accurate gaze direction that is independent of the observer's viewing angle. Based on the data collected in this experiment, we rephrase the formulation of the Mona Lisa gaze effect. The data, when reinterpreted, confirms the predictions of the new model for both 2D and 3D projection surfaces. Finally, we discuss the requirements on different spatially interactive systems in terms of gaze direction, and propose new applications and experiments for interaction in a human-ECA and a human-robot settings made possible by this technology.
24.	Al Moubayed, Samer, et al. (författare) The Furhat Back-Projected Humanoid Head-Lip Reading, Gaze And Multi-Party Interaction 2013 Ingår i: International Journal of Humanoid Robotics. - 0219-8436. ; 10:1, s. 1350005- Tidskriftsartikel (refereegranskat)abstract In this paper, we present Furhat - a back-projected human-like robot head using state-of-the art facial animation. Three experiments are presented where we investigate how the head might facilitate human - robot face-to-face interaction. First, we investigate how the animated lips increase the intelligibility of the spoken output, and compare this to an animated agent presented on a flat screen, as well as to a human face. Second, we investigate the accuracy of the perception of Furhat's gaze in a setting typical for situated interaction, where Furhat and a human are sitting around a table. The accuracy of the perception of Furhat's gaze is measured depending on eye design, head movement and viewing angle. Third, we investigate the turn-taking accuracy of Furhat in a multi-party interactive setting, as compared to an animated agent on a flat screen. We conclude with some observations from a public setting at a museum, where Furhat interacted with thousands of visitors in a multi-party interaction.
25.	Al Moubayed, Samer, et al. (författare) The Furhat Social Companion Talking Head 2013 Ingår i: Interspeech 2013 - Show and Tell. ; , s. 747-749 Konferensbidrag (refereegranskat)abstract In this demonstrator we present the Furhat robot head. Furhat is a highly human-like robot head in terms of dynamics, thanks to its use of back-projected facial animation. Furhat also takes advantage of a complex and advanced dialogue toolkits designed to facilitate rich and fluent multimodal multiparty human-machine situated and spoken dialogue. The demonstrator will present a social dialogue system with Furhat that allows for several simultaneous interlocutors, and takes advantage of several verbal and nonverbal input signals such as speech input, real-time multi-face tracking, and facial analysis, and communicates with its users in a mixed initiative dialogue, using state of the art speech synthesis, with rich prosody, lip animated facial synthesis, eye and head movements, and gestures.
26.	Al Moubayed, Samer, et al. (författare) Tutoring Robots: Multiparty Multimodal Social Dialogue With an Embodied Tutor 2014 Konferensbidrag (refereegranskat)abstract This project explores a novel experimental setup towards building spoken, multi-modally rich, and human-like multiparty tutoring agent. A setup is developed and a corpus is collected that targets the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interactions with embodied agents. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. With the participants sits a tutor that helps the participants perform the task and organizes and balances their interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies were coupled with manual annotations to build a situated model of the interaction based on the participants personalities, their temporally-changing state of attention, their conversational engagement and verbal dominance, and the way these are correlated with the verbal and visual feedback, turn-management, and conversation regulatory actions generated by the tutor. At the end of this chapter we discuss the potential areas of research and developments this work opens and some of the challenges that lie in the road ahead.
27.	Al Moubayed, Samer, et al. (författare) Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-Media Setting 2009 Ingår i: INTERSPEECH 2009. - BAIXAS : ISCA-INST SPEECH COMMUNICATION ASSOC. ; , s. 1443-1446 Konferensbidrag (refereegranskat)abstract In this paper we present recent results on the development of the SynFace lip synchronized talking head towards multilinguality, varying signal conditions and noise robustness in the Hearing at Home project. We then describe the large scale hearing impaired user studies carried out for three languages. The user tests focus on measuring the gain in Speech Reception Threshold in Noise when using SynFace, and on measuring the effort scaling when using SynFace by hearing impaired people. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at inter-subject variability, it is clear that many subjects benefit from SynFace especially with speech with stereo babble noise.
28.	Alexanderson, Simon, et al. (författare) Animated Lombard speech : Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions 2014 Ingår i: Computer speech & language (Print). - : Elsevier BV. - 0885-2308 .- 1095-8363. ; 28:2, s. 607-618 Tidskriftsartikel (refereegranskat)abstract In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a,set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted.
29.	Alexanderson, Simon, et al. (författare) Aspects of co-occurring syllables and head nods in spontaneous dialogue 2013 Ingår i: Proceedings of 12th International Conference on Auditory-Visual Speech Processing (AVSP2013). - : The International Society for Computers and Their Applications (ISCA). ; , s. 169-172 Konferensbidrag (refereegranskat)abstract This paper reports on the extraction and analysis of head nods taken from motion capture data of spontaneous dialogue in Swedish. The head nods were extracted automatically and then manually classified in terms of gestures having a beat function or multifunctional gestures. Prosodic features were extracted from syllables co-occurring with the beat gestures. While the peak rotation of the nod is on average aligned with the stressed syllable, the results show considerable variation in fine temporal synchronization. The syllables co-occurring with the gestures generally show greater intensity, higher F0, and greater F0 range when compared to the mean across the entire dialogue. A functional analysis shows that the majority of the syllables belong to words bearing a focal accent.
30.	Alexanderson, Simon, et al. (författare) Automatic annotation of gestural units in spontaneous face-to-face interaction 2016 Ingår i: MA3HMI 2016 - Proceedings of the Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction. - New York, NY, USA : ACM. - 9781450345620 ; , s. 15-19 Konferensbidrag (refereegranskat)abstract Speech and gesture co-occur in spontaneous dialogue in a highly complex fashion. There is a large variability in the motion that people exhibit during a dialogue, and different kinds of motion occur during different states of the interaction. A wide range of multimodal interface applications, for example in the fields of virtual agents or social robots, can be envisioned where it is important to be able to automatically identify gestures that carry information and discriminate them from other types of motion. While it is easy for a human to distinguish and segment manual gestures from a flow of multimodal information, the same task is not trivial to perform for a machine. In this paper we present a method to automatically segment and label gestural units from a stream of 3D motion capture data. The gestural flow is modeled with a 2-level Hierarchical Hidden Markov Model (HHMM) where the sub-states correspond to gesture phases. The model is trained based on labels of complete gesture units and self-adaptive manipulators. The model is tested and validated on two datasets differing in genre and in method of capturing motion, and outperforms a state-of-the-art SVM classifier on a publicly available dataset.
31.	Alexanderson, Simon, et al. (författare) Can Anybody Read Me? Motion Capture Recordings for an Adaptable Visual Speech Synthesizer 2012 Ingår i: In proceedings of The Listening Talker. - Edinburgh, UK.. ; , s. 52-52 Konferensbidrag (refereegranskat)
32.	Alexanderson, Simon, et al. (författare) Extracting and analysing co-speech head gestures from motion-capture data 2013 Ingår i: Proceedings of Fonetik 2013. - : Linköping University Electronic Press. - 9789175195797 ; , s. 1-4 Konferensbidrag (refereegranskat)
33.	Alexanderson, Simon, et al. (författare) Extracting and analyzing head movements accompanying spontaneous dialogue 2013 Ingår i: Conference Proceedings TiGeR 2013. Konferensbidrag (refereegranskat)abstract This paper reports on a method developed for extracting and analyzing head gestures taken from motion capture data of spontaneous dialogue in Swedish. Candidate head gestures with beat function were extracted automatically and then manually classified using a 3D player which displays timesynced audio and 3D point data of the motion capture markers together with animated characters. Prosodic features were extracted from syllables co-occurring with a subset of the classified gestures. The beat gestures show considerable variation in temporal synchronization with the syllables, while the syllables generally show greater intensity, higher F0, and greater F0 range when compared to the mean across the entire dialogue. Additional features for further analysis and automatic classification of the head gestures are discussed.
34.	Alexanderson, Simon, et al. (författare) Generating coherent spontaneous speech and gesture from text 2020 Ingår i: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, IVA 2020. - New York, NY, USA : Association for Computing Machinery (ACM). Konferensbidrag (refereegranskat)abstract Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data. We illustrate our results by visualising gesture spaces and textspeech-gesture alignments, and through a demonstration video.
35.	Alexanderson, Simon, et al. (författare) Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models 2023 Ingår i: ACM Transactions on Graphics. - : Association for Computing Machinery (ACM). - 0730-0301 .- 1557-7368. ; 42:4 Tidskriftsartikel (refereegranskat)abstract Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest.
36.	Alexanderson, Simon, et al. (författare) Mimebot—Investigating the Expressibility of Non-Verbal Communication Across Agent Embodiments 2017 Ingår i: ACM Transactions on Applied Perception. - : Association for Computing Machinery (ACM). - 1544-3558 .- 1544-3965. ; 14:4 Tidskriftsartikel (refereegranskat)abstract Unlike their human counterparts, artificial agents such as robots and game characters may be deployed with a large variety of face and body configurations. Some have articulated bodies but lack facial features, and others may be talking heads ending at the neck. Generally, they have many fewer degrees of freedom than humans through which they must express themselves, and there will inevitably be a filtering effect when mapping human motion onto the agent. In this article, we investigate filtering effects on three types of embodiments: (a) an agent with a body but no facial features, (b) an agent with a head only, and (c) an agent with a body and a face. We performed a full performance capture of a mime actor enacting short interactions varying the non-verbal expression along five dimensions (e.g., level of frustration and level of certainty) for each of the three embodiments. We performed a crowd-sourced evaluation experiment comparing the video of the actor to the video of an animated robot for the different embodiments and dimensions. Our findings suggest that the face is especially important to pinpoint emotional reactions but is also most volatile to filtering effects. The body motion, on the other hand, had more diverse interpretations but tended to preserve the interpretation after mapping and thus proved to be more resilient to filtering.
37.	Alexanderson, Simon (författare) Performance, Processing and Perception of Communicative Motion for Avatars and Agents 2017 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract Artificial agents and avatars are designed with a large variety of face and body configurations. Some of these (such as virtual characters in films) may be highly realistic and human-like, while others (such as social robots) have considerably more limited expressive means. In both cases, human motion serves as the model and inspiration for the non-verbal behavior displayed. This thesis focuses on increasing the expressive capacities of artificial agents and avatars using two main strategies: 1) improving the automatic capturing of the most communicative areas for human communication, namely the face and the fingers, and 2) increasing communication clarity by proposing novel ways of eliciting clear and readable non-verbal behavior.The first part of the thesis covers automatic methods for capturing and processing motion data. In paper A, we propose a novel dual sensor method for capturing hands and fingers using optical motion capture in combination with low-cost instrumented gloves. The approach circumvents the main problems with marker-based systems and glove-based systems, and it is demonstrated and evaluated on a key-word signing avatar. In paper B, we propose a robust method for automatic labeling of sparse, non-rigid motion capture marker sets, and we evaluate it on a variety of marker configurations for finger and facial capture. In paper C, we propose an automatic method for annotating hand gestures using Hierarchical Hidden Markov Models (HHMMs).The second part of the thesis covers studies on creating and evaluating multimodal databases with clear and exaggerated motion. The main idea is that this type of motion is appropriate for agents under certain communicative situations (such as noisy environments) or for agents with reduced expressive degrees of freedom (such as humanoid robots). In paper D, we record motion capture data for a virtual talking head with variable articulation style (normal-to-over articulated). In paper E, we use techniques from mime acting to generate clear non-verbal expressions custom tailored for three agent embodiments (face-and-body, face-only and body-only).
38.	Alexanderson, Simon, et al. (författare) Real-time labeling of non-rigid motion capture marker sets 2017 Ingår i: Computers & graphics. - : Elsevier. - 0097-8493 .- 1873-7684. ; 69:Supplement C, s. 59-67 Tidskriftsartikel (refereegranskat)abstract Passive optical motion capture is one of the predominant technologies for capturing high fidelity human motion, and is a workhorse in a large number of areas such as bio-mechanics, film and video games. While most state-of-the-art systems can automatically identify and track markers on the larger parts of the human body, the markers attached to the fingers and face provide unique challenges and usually require extensive manual cleanup. In this work we present a robust online method for identification and tracking of passive motion capture markers attached to non-rigid structures. The method is especially suited for large capture volumes and sparse marker sets. Once trained, our system can automatically initialize and track the markers, and the subject may exit and enter the capture volume at will. By using multiple assignment hypotheses and soft decisions, it can robustly recover from a difficult situation with many simultaneous occlusions and false observations (ghost markers). In three experiments, we evaluate the method for labeling a variety of marker configurations for finger and facial capture. We also compare the results with two of the most widely used motion capture platforms: Motion Analysis Cortex and Vicon Blade. The results show that our method is better at attaining correct marker labels and is especially beneficial for real-time applications.
39.	Alexanderson, Simon, et al. (författare) Robust online motion capture labeling of finger markers 2016 Ingår i: Proceedings - Motion in Games 2016. - New York, NY, USA : ACM Digital Library. - 9781450345927 ; , s. 7-13 Konferensbidrag (refereegranskat)abstract Passive optical motion capture is one of the predominant technologies for capturing high fidelity human skeletal motion, and is a workhorse in a large number of areas such as bio-mechanics, film and video games. While most state-of-the-art systems can automatically identify and track markers on the larger parts of the human body, the markers attached to fingers provide unique challenges and usually require extensive manual cleanup. In this work we present a robust online method for identification and tracking of passive motion capture markers attached to the fingers of the hands. The method is especially suited for large capture volumes and sparse marker sets of 3 to 10 markers per hand. Once trained, our system can automatically initialize and track the markers, and the subject may exit and enter the capture volume at will. By using multiple assignment hypotheses and soft decisions, it can robustly recover from a difficult situation with many simultaneous occlusions and false observations (ghost markers). We evaluate the method on a collection of sparse marker sets commonly used in industry and in the research community. We also compare the results with two of the most widely used motion capture platforms: Motion Analysis Cortex and Vicon Blade. The results show that our method is better at attaining correct marker labels and is especially beneficial for real-time applications.
40.	Alexanderson, Simon, et al. (författare) Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows 2020 Ingår i: Computer graphics forum (Print). - : Wiley. - 0167-7055 .- 1467-8659. ; 39:2, s. 487-496 Tidskriftsartikel (refereegranskat)abstract Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off-line applications, novel tools can alter the role of an animator to that of a director, who provides only high-level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning-based motion synthesis method called MoGlow, we propose a new generative model for generating state-of-the-art realistic speech-driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper-body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full-body gesticulation, including the synthesis of stepping motion and stance.
41.	Alexanderson, Simon, et al. (författare) Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar 2015 Ingår i: ACM Transactions on Accessible Computing. - New York, NY, USA : Association for Computing Machinery (ACM). - 1936-7228 .- 1936-7236. ; 7:2, s. 7:1-7:17 Tidskriftsartikel (refereegranskat)abstract Motion capture of signs provides unique challenges in the field of multimodal data collection. The dense packaging of visual information requires high fidelity and high bandwidth of the captured data. Even though marker-based optical motion capture provides many desirable features such as high accuracy, global fitting, and the ability to record body and face simultaneously, it is not widely used to record finger motion, especially not for articulated and syntactic motion such as signs. Instead, most signing avatar projects use costly instrumented gloves, which require long calibration procedures. In this article, we evaluate the data quality obtained from optical motion capture of isolated signs from Swedish sign language with a large number of low-cost cameras. We also present a novel dual-sensor approach to combine the data with low-cost, five-sensor instrumented gloves to provide a recording method with low manual postprocessing. Finally, we evaluate the collected data and the dual-sensor approach as transferred to a highly stylized avatar. The application of the avatar is a game-based environment for training Key Word Signing (KWS) as augmented and alternative communication (AAC), intended for children with communication disabilities.
42.	Ambrazaitis, Gilbert, et al. (författare) Acoustic features of multimodal prominences : Do visual beat gestures affect verbal pitch accent realization? 2017 Ingår i: Proceedings of The 14th International Conference on Auditory-Visual Speech Processing (AVSP2017). - Stockholm : International Speech Communication Association. - 2308-975X. Konferensbidrag (refereegranskat)abstract The interplay of verbal and visual prominence cues has attracted recent attention, but previous findings are inconclusive as to whether and how the two modalities are integrated in the production and perception of prominence. In particular, we do not know whether the phonetic realization of pitch accents is influenced by co-speech beat gestures, and previous findings seem to generate different predictions. In this study, we investigate acoustic properties ofprominent words as a function of visual beat gestures in a corpus of read news from Swedish television. The corpus was annotated for head and eyebrow beats as well as sentence-level pitch accents. Four types of prominence cues occurredparticularly frequently in the corpus: (1) pitch accent only, (2) pitch accent plus head, (3) pitch accent plus head plus eyebrows, and (4) head only. The results show that (4) differs from (1-3) in terms of a smaller pitch excursion and shorter syllable duration. They also reveal significantly larger pitch excursions in (2) than in (1), suggesting that the realization of a pitch accent is to some extent influenced by the presence of visual prominence cues. Results are discussed in terms of the interaction between beat gestures and prosody with a potential functional difference between head and eyebrow beats.
43.	Andréasson, Maia, 1960, et al. (författare) Swedish CLARIN activities 2009 Ingår i: Proceedings of the Nodalida 2009 workshop on CLARIN activities in the Nordic countries. NEALT Proceedings Series. - 1736-6305. ; 5, s. 1-5 Konferensbidrag (refereegranskat)
44.	Andréasson, Maia, et al. (författare) Swedish CLARIN Activities 2009 Ingår i: Proceedings of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources. - : Northern European Association for Language Technology (NEALT). ; , s. 1-5 Konferensbidrag (refereegranskat)abstract Although Sweden has yet to allocate funds specifically intended for CLARIN activities, there are some ongoing activities which are directly relevant to CLARIN, and which are explicitly linked to CLARIN. These activities have been funded by the Committee for Research Infrastructures and its subcommittee DISC (Database Infrastructure Committee) of the Swedish Research Council.
45.	Beskow, Jonas, et al. (författare) A hybrid harmonics-and-bursts modelling approach to speech synthesis 2016 Ingår i: Proceedings 9th ISCA Speech Synthesis Workshop, SSW 2016. - : The International Society for Computers and Their Applications (ISCA). ; , s. 208-213 Konferensbidrag (refereegranskat)abstract Statistical speech synthesis systems rely on a parametric speech generation model, typically some sort of vocoder. Vocoders are great for voiced speech because they offer independent control over voice source (e.g. pitch) and vocal tract filter (e.g. vowel quality) through control parameters that typically vary smoothly in time and lend themselves well to statistical modelling. Voiceless sounds and transients such as plosives and fricatives on the other hand exhibit fundamentally different spectro-temporal behaviour. Here the benefits of the vocoder are not as clear. In this paper, we investigate a hybrid approach to modeling the speech signal, where speech is decomposed into an harmonic part and a noise burst part through spectrogram kernel filtering. The harmonic part is modeled using vocoder and statistical parameter generation, while the burst part is modeled by concatenation. The two channels are then mixed together to form the final synthesized waveform. The proposed method was compared against a state of the art statistical speech synthesis system (HTS 2.3) in a perceptual evaluation, which reveled that the harmonics plus bursts method was perceived as significantly more natural than the purely statistical variant.
46.	Beskow, Jonas, et al. (författare) A Model for Multimodal Dialogue System Output Applied to an Animated Talking Head 2005 Ingår i: SPOKEN MULTIMODAL HUMAN-COMPUTER DIALOGUE IN MOBILE ENVIRONMENTS. - Dordrecht : Springer. - 9781402030758 ; , s. 93-113 Bokkapitel (refereegranskat)abstract We present a formalism for specifying verbal and non-verbal output from a multimodal dialogue system. The output specification is XML-based and provides information about communicative functions of the output, without detailing the realisation of these functions. The aim is to let dialogue systems generate the same output for a wide variety of output devices and modalities. The formalism was developed and implemented in the multimodal spoken dialogue system AdApt. We also describe how facial gestures in the 3D-animated talking head used within this system are controlled through the formalism.
47.	Beskow, Jonas, et al. (författare) Analysis and synthesis of multimodal verbal and non-verbal interaction for animated interface agents 2007 Ingår i: VERBAL AND NONVERBAL COMMUNICATION BEHAVIOURS. - BERLIN : SPRINGER-VERLAG BERLIN. - 9783540764410 ; , s. 250-263 Konferensbidrag (refereegranskat)abstract The use of animated talking agents is a novel feature of many multimodal spoken dialogue systems. The addition and integration of a virtual talking head has direct implications for the way in which users approach and interact with such systems. However, understanding the interactions between visual expressions, dialogue functions and the acoustics of the corresponding speech presents a substantial challenge. Some of the visual articulation is closely related to the speech acoustics, while there are other articulatory movements affecting speech acoustics that are not visible on the outside of the face. Many facial gestures used for communicative purposes do not affect the acoustics directly, but might nevertheless be connected on a higher communicative level in which the timing of the gestures could play an important role. This chapter looks into the communicative function of the animated talking agent, and its effect on intelligibility and the flow of the dialogue.
48.	Beskow, Jonas (författare) ANIMATION OF TALKING AGENTS 1997 Ingår i: Proceedings of International Conference on Auditory-Visual Speech Processing. - Rhodos, Greece. ; , s. 149-152 Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract It is envisioned that autonomous software agents that cancommunicate using speech and gesture will soon be oneverybody’s computer screen. This paper describes anarchitecture that can be used to design and animate characterscapable of lip-synchronised synthetic speech as well as bodygestures, for use in for example spoken dialogue systems. Ageneral scheme for computationally efficient parametricdeformation of facial surfaces is presented, as well as techniques for generation of bimodal speech, facial expressionsand body gestures in a spoken dialogue system. Resultsindicating that an animated cartoon-like character can be asignificant contribution to speech intelligibility, are also reported.
49.	Beskow, Jonas, et al. (författare) Data-driven synthesis of expressive visual speech using an MPEG-4 talking head 2005 Ingår i: 9th European Conference on Speech Communication and Technology. - Lisbon. ; , s. 793-796 Konferensbidrag (refereegranskat)abstract This paper describes initial experiments with synthesis of visual speech articulation for different emotions, using a newly developed MPEG-4 compatible talking head. The basic problem with combining speech and emotion in a talking head is to handle the interaction between emotional expression and articulation in the orofacial region. Rather than trying to model speech and emotion as two separate properties, the strategy taken here is to incorporate emotional expression in the articulation from the beginning. We use a data-driven approach, training the system to recreate the expressive articulation produced by an actor while portraying different emotions. Each emotion is modelled separately using principal component analysis and a parametric coarticulation model. The results so far are encouraging but more work is needed to improve naturalness and accuracy of the synthesized speech.
50.	Beskow, Jonas, et al. (författare) Evaluation of the expressivity of a Swedish talking head in the context of human-machine interaction 2008 Ingår i: Comunicazione parlatae manifestazione delle emozioni. - 9788820740191 Konferensbidrag (refereegranskat)abstract ABSTRACTThis paper describes a first attempt at synthesis and evaluation of expressive visualarticulation using an MPEG-4 based virtual talking head. The synthesis is data-driven,trained on a corpus of emotional speech recorded using optical motion capture. Eachemotion is modelled separately using principal component analysis and a parametriccoarticulation model.In order to evaluate the expressivity of the data driven synthesis two tests wereconducted. Our talking head was used in interactions with a human being in a givenrealistic usage context.The interactions were presented to external observers that were asked to judge theemotion of the talking head. The participants in the experiment could only hear the voice ofthe user, which was a pre-recorded female voice, and see and hear the talking head. Theresults of the evaluation, even if constrained by the results of the implementation, clearlyshow that the visual expression plays a relevant role in the recognition of emotions.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Träfflista för sökning "WFRF:(Beskow Jonas) "

Avgränsa träffmängd

År