SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Gustafsson Joakim professor 1966 ) "

Sökning: WFRF:(Gustafsson Joakim professor 1966 )

  • Resultat 1-10 av 27
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Kontogiorgos, Dimosthenis, 1987- (författare)
  • Mutual Understanding in Situated Interactions with Conversational User Interfaces : Theory, Studies, and Computation
  • 2022
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • This dissertation presents advances in HCI through a series of studies focusing on task-oriented interactions between humans and between humans and machines. The notion of mutual understanding is central, also known as grounding in psycholinguistics, in particular how people establish understanding in conversations and what interactional phenomena are present in that process. Addressing the gap in computational models of understanding, interactions in this dissertation are observed through multisensory input and evaluated with statistical and machine-learning models. As it becomes apparent, miscommunication is ordinary in human conversations and therefore embodied computer interfaces interacting with humans are subject to a large number of conversational failures. Investigating how these inter- faces can evaluate human responses to distinguish whether spoken utterances are understood is one of the central contributions of this thesis.The first papers (Papers A and B) included in this dissertation describe studies on how humans establish understanding incrementally and how they co-produce utterances to resolve misunderstandings in joint-construction tasks. Utilising the same interaction paradigm from such human-human settings, the remaining papers describe collaborative interactions between humans and machines with two central manipulations: embodiment (Papers C, D, E, and F) and conversational failures (Papers D, E, F, and G). The methods used investigate whether embodiment affects grounding behaviours among speakers and what verbal and non-verbal channels are utilised in response and recovery to miscommunication. For application to robotics and conversational user interfaces, failure detection systems are developed predicting in real-time user uncertainty, paving the way for new multimodal computer interfaces that are aware of dialogue breakdown and system failures.Through the lens of Theory, Studies, and Computation, a comprehensive overview is presented on how mutual understanding has been observed in interactions with humans and between humans and machines. A summary of literature in mutual understanding from psycholinguistics and human-computer interaction perspectives is reported. An overview is also presented on how prior knowledge in mutual understanding has and can be observed through experimentation and empirical studies, along with perspectives of how knowledge acquired through observation is put into practice through the analysis and development of computational models. Derived from literature and empirical observations, the central thesis of this dissertation is that embodiment and mutual understanding are intertwined in task-oriented interactions, both in successful communication but also in situations of miscommunication.
  •  
2.
  • Dalmas, T., et al. (författare)
  • Introduction
  • 2014
  • Ingår i: Proceedings 2014 Workshop on Dialogue in Motion, DM 2014. - : Association for Computational Linguistics (ACL).
  • Konferensbidrag (refereegranskat)
  •  
3.
  •  
4.
  • Jonell, Patrik, 1988- (författare)
  • Scalable Methods for Developing Interlocutor-aware Embodied Conversational Agents : Data Collection, Behavior Modeling, and Evaluation Methods
  • 2022
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • This work presents several methods, tools, and experiments that contribute to the development of interlocutor-aware Embodied Conversational Agents (ECAs). Interlocutor-aware ECAs take the interlocutor's behavior into consideration when generating their own non-verbal behaviors. This thesis targets the development of such adaptive ECAs by identifying and contributing to three important and related topics:1) Data collection methods are presented, both for large scale crowdsourced data collection and in-lab data collection with a large number of sensors in a clinical setting. Experiments show that experts deemed dialog data collected using a crowdsourcing method to be better for dialog generation purposes than dialog data from other commonly used sources. 2) Methods for behavior modeling are presented, where machine learning models are used to generate facial gestures for ECAs. Both methods for single speaker and interlocutor-aware generation are presented. 3) Evaluation methods are explored and both third-party evaluation of generated gestures and interaction experiments of interlocutor-aware gestures generation are being discussed. For example, an experiment is carried out investigating the social influence of a mimicking social robot. Furthermore, a method for more efficient perceptual experiments is presented. This method is validated by replicating a previously conducted perceptual experiment on virtual agents, and shows that the results obtained using this new method provide similar insights (in fact, it provided more insights) into the data, simultaneously being more efficient in terms of time evaluators needed to spend participating in the experiment. A second study compared the difference between performing subjective evaluations of generated gestures in the lab vs. using crowdsourcing, and showed no difference between the two settings. A special focus in this thesis is given to using scalable methods, which allows for being able to efficiently and rapidly collect interaction data from a broad range of people and efficiently evaluate results produced by the machine learning methods. This in turn allows for fast iteration when developing interlocutor-aware ECAs behaviors.
  •  
5.
  • Lameris, Harm, 1997-, et al. (författare)
  • Prosody-Controllable Spontaneous TTS with Neural HMMs
  • 2023
  • Ingår i: International Conference on Acoustics, Speech, and Signal Processing (ICASSP). - : Institute of Electrical and Electronics Engineers (IEEE).
  • Konferensbidrag (refereegranskat)abstract
    • Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets, while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we add utterance-level prosody control to an existing neural HMM-based TTS system which is capable of stable, monotonic alignments for spontaneous speech. We objectively evaluate control accuracy and perform perceptual tests that demonstrate that prosody control does not degrade synthesis quality. To exemplify the power of combining prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system’s capability of synthesizing two types of creaky voice.
  •  
6.
  • Lameris, Harm, et al. (författare)
  • Spontaneous Neural HMM TTS with Prosodic Feature Modification
  • 2022
  • Ingår i: Proceedings of Fonetik 2022.
  • Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract
    • Spontaneous speech synthesis is a complex enterprise, as the data has large variation, as well as speech disfluencies nor-mally omitted from read speech. These disfluencies perturb the attention mechanism present in most Text to Speech (TTS) sys-tems. Explicit modelling of prosodic features has enabled intu-itive prosody modification of synthesized speech. Most pros-ody-controlled TTS, however, has been trained on read-speech data that is not representative of spontaneous conversational prosody. The diversity in prosody in spontaneous speech data allows for more wide-ranging data-driven modelling of pro-sodic features. Additionally, prosody-controlled TTS requires extensive training data and GPU time which limits accessibil-ity. We use neural HMM TTS as it reduces the parameter size and can achieve fast convergence with stable alignments for spontaneous speech data. We modify neural HMM TTS to ena-ble prosodic control of the speech rate and fundamental fre-quency. We perform subjective evaluation of the generated speech of English and Swedish TTS models and objective eval-uation for English TTS. Subjective evaluation showed a signif-icant improvement in naturalness for Swedish for the mean prosody compared to a baseline with no prosody modification, and the objective evaluation showed greater variety in the mean of the per-utterance prosodic features.
  •  
7.
  • Székely, Éva, et al. (författare)
  • Breathing and Speech Planning in Spontaneous Speech Synthesis
  • 2020
  • Ingår i: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). - : IEEE. ; , s. 7649-7653
  • Konferensbidrag (refereegranskat)abstract
    • Breathing and speech planning in spontaneous speech are coordinated processes, often exhibiting disfluent patterns. While synthetic speech is not subject to respiratory needs, integrating breath into synthesis has advantages for naturalness and recall. At the same time, a synthetic voice reproducing disfluent breathing patterns learned from the data can be problematic. To address this, we first propose training stochastic TTS on a corpus of overlapping breath-group bigrams, to take context into account. Next, we introduce an unsupervised automatic annotation of likely-disfluent breath events, through a product-of-experts model that combines the output of two breath-event predictors, each using complementary information and operating in opposite directions. This annotation enables creating an automatically-breathing spontaneous speech synthesiser with a more fluent breathing style. A subjective evaluation on two spoken genres (impromptu and rehearsed) found the proposed system to be preferred over the baseline approach treating all breath events the same.
  •  
8.
  • Tånnander, Christina, Doktorand, 1971-, et al. (författare)
  • Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System
  • 2024
  • Ingår i: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings. - : European Language Resources Association (ELRA). ; , s. 14111-14121
  • Konferensbidrag (refereegranskat)abstract
    • In order to investigate the strengths and weaknesses of Audience Response System (ARS) in text-to-speech synthesis (TTS) evaluations, we revisit three previously published TTS studies and perform an ARS-based evaluation on the stimuli used in each study. The experiments are performed with a participant pool of 39 respondents, using a web-based tool that emulates an ARS experiment. The results of the first experiment confirms that ARS is highly useful for evaluating long and continuous stimuli, particularly if we wish for a diagnostic result rather than a single overall metric, while the second and third experiments highlight weaknesses in ARS with unsuitable materials as well as the importance of framing and instruction when conducting ARS-based evaluation.
  •  
9.
  • Wang, Siyang, 1995-, et al. (författare)
  • A comparative study of self-supervised speech representationsin read and spontaneous TTS
  • 2023
  • Ingår i: ICASSPW 2023. - : Institute of Electrical and Electronics Engineers (IEEE).
  • Annan publikation (övrigt vetenskapligt/konstnärligt)abstract
    • Recent work has explored using self-supervised learning(SSL) speech representations such as wav2vec2.0 as the rep-resentation medium in standard two-stage TTS, in place ofconventionally used mel-spectrograms. It is however unclearwhich speech SSL is the better fit for TTS, and whether ornot the performance differs between read and spontaneousTTS, the later of which is arguably more challenging. Thisstudy aims at addressing these questions by testing severalspeech SSLs, including different layers of the same SSL, intwo-stage TTS on both read and spontaneous corpora, whilemaintaining constant TTS model architecture and trainingsettings. Results from listening tests show that the 9th layerof 12-layer wav2vec2.0 (ASR finetuned) outperforms othertested SSLs and mel-spectrogram, in both read and sponta-neous TTS. Our work sheds light on both how speech SSL canreadily improve current TTS systems, and how SSLs comparein the challenging generative task of TTS. Audio examplescan be found at https://www.speech.kth.se/tts-demos/ssr tts
  •  
10.
  • Wang, Siyang, 1994-, et al. (författare)
  • Integrated Speech and Gesture Synthesis
  • 2021
  • Ingår i: ICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction. - New York, NY, USA : Association for Computing Machinery (ACM). ; , s. 177-185
  • Konferensbidrag (refereegranskat)abstract
    • Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG). We also propose a set of models modified from state-of-the-art neural speech-synthesis engines to achieve this goal. We evaluate the models in three carefully-designed user studies, two of which evaluate the synthesized speech and gesture in isolation, plus a combined study that evaluates the models like they will be used in real-world applications - speech and gesture presented together. The results show that participants rate one of the proposed integrated synthesis models as being as good as the state-of-the-art pipeline system we compare against, in all three tests. The model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system, illustrating some of the potential benefits of treating speech and gesture synthesis together as a single, unified problem.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 27

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy