SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Henter Gustav Eje Assistant Professor) "

Sökning: WFRF:(Henter Gustav Eje Assistant Professor)

  • Resultat 1-49 av 49
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  •  
2.
  • Jonell, Patrik, et al. (författare)
  • Multimodal Capture of Patient Behaviour for Improved Detection of Early Dementia : Clinical Feasibility and Preliminary Results
  • 2021
  • Ingår i: Frontiers in Computer Science. - : Frontiers Media SA. - 2624-9898. ; 3
  • Tidskriftsartikel (refereegranskat)abstract
    • Non-invasive automatic screening for Alzheimer's disease has the potential to improve diagnostic accuracy while lowering healthcare costs. Previous research has shown that patterns in speech, language, gaze, and drawing can help detect early signs of cognitive decline. In this paper, we describe a highly multimodal system for unobtrusively capturing data during real clinical interviews conducted as part of cognitive assessments for Alzheimer's disease. The system uses nine different sensor devices (smartphones, a tablet, an eye tracker, a microphone array, and a wristband) to record interaction data during a specialist's first clinical interview with a patient, and is currently in use at Karolinska University Hospital in Stockholm, Sweden. Furthermore, complementary information in the form of brain imaging, psychological tests, speech therapist assessment, and clinical meta-data is also available for each patient. We detail our data-collection and analysis procedure and present preliminary findings that relate measures extracted from the multimodal recordings to clinical assessments and established biomarkers, based on data from 25 patients gathered thus far. Our findings demonstrate feasibility for our proposed methodology and indicate that the collected data can be used to improve clinical assessments of early dementia.
  •  
3.
  • Jonell, Patrik, 1988- (författare)
  • Scalable Methods for Developing Interlocutor-aware Embodied Conversational Agents : Data Collection, Behavior Modeling, and Evaluation Methods
  • 2022
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • This work presents several methods, tools, and experiments that contribute to the development of interlocutor-aware Embodied Conversational Agents (ECAs). Interlocutor-aware ECAs take the interlocutor's behavior into consideration when generating their own non-verbal behaviors. This thesis targets the development of such adaptive ECAs by identifying and contributing to three important and related topics:1) Data collection methods are presented, both for large scale crowdsourced data collection and in-lab data collection with a large number of sensors in a clinical setting. Experiments show that experts deemed dialog data collected using a crowdsourcing method to be better for dialog generation purposes than dialog data from other commonly used sources. 2) Methods for behavior modeling are presented, where machine learning models are used to generate facial gestures for ECAs. Both methods for single speaker and interlocutor-aware generation are presented. 3) Evaluation methods are explored and both third-party evaluation of generated gestures and interaction experiments of interlocutor-aware gestures generation are being discussed. For example, an experiment is carried out investigating the social influence of a mimicking social robot. Furthermore, a method for more efficient perceptual experiments is presented. This method is validated by replicating a previously conducted perceptual experiment on virtual agents, and shows that the results obtained using this new method provide similar insights (in fact, it provided more insights) into the data, simultaneously being more efficient in terms of time evaluators needed to spend participating in the experiment. A second study compared the difference between performing subjective evaluations of generated gestures in the lab vs. using crowdsourcing, and showed no difference between the two settings. A special focus in this thesis is given to using scalable methods, which allows for being able to efficiently and rapidly collect interaction data from a broad range of people and efficiently evaluate results produced by the machine learning methods. This in turn allows for fast iteration when developing interlocutor-aware ECAs behaviors.
  •  
4.
  • Lameris, Harm, 1997-, et al. (författare)
  • Prosody-Controllable Spontaneous TTS with Neural HMMs
  • 2023
  • Ingår i: International Conference on Acoustics, Speech, and Signal Processing (ICASSP). - : Institute of Electrical and Electronics Engineers (IEEE).
  • Konferensbidrag (refereegranskat)abstract
    • Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets, while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we add utterance-level prosody control to an existing neural HMM-based TTS system which is capable of stable, monotonic alignments for spontaneous speech. We objectively evaluate control accuracy and perform perceptual tests that demonstrate that prosody control does not degrade synthesis quality. To exemplify the power of combining prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system’s capability of synthesizing two types of creaky voice.
  •  
5.
  • Lameris, Harm, et al. (författare)
  • Spontaneous Neural HMM TTS with Prosodic Feature Modification
  • 2022
  • Ingår i: Proceedings of Fonetik 2022.
  • Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract
    • Spontaneous speech synthesis is a complex enterprise, as the data has large variation, as well as speech disfluencies nor-mally omitted from read speech. These disfluencies perturb the attention mechanism present in most Text to Speech (TTS) sys-tems. Explicit modelling of prosodic features has enabled intu-itive prosody modification of synthesized speech. Most pros-ody-controlled TTS, however, has been trained on read-speech data that is not representative of spontaneous conversational prosody. The diversity in prosody in spontaneous speech data allows for more wide-ranging data-driven modelling of pro-sodic features. Additionally, prosody-controlled TTS requires extensive training data and GPU time which limits accessibil-ity. We use neural HMM TTS as it reduces the parameter size and can achieve fast convergence with stable alignments for spontaneous speech data. We modify neural HMM TTS to ena-ble prosodic control of the speech rate and fundamental fre-quency. We perform subjective evaluation of the generated speech of English and Swedish TTS models and objective eval-uation for English TTS. Subjective evaluation showed a signif-icant improvement in naturalness for Swedish for the mean prosody compared to a baseline with no prosody modification, and the objective evaluation showed greater variety in the mean of the per-utterance prosodic features.
  •  
6.
  • Malisz, Zofia, et al. (författare)
  • The speech synthesis phoneticians need is both realistic and controllable
  • 2019
  • Ingår i: Proceedings from FONETIK 2019. - Stockholm.
  • Konferensbidrag (refereegranskat)abstract
    • We discuss the circumstances that have led to a disjoint advancement of speech synthesis and phonetics in recent dec- ades. The difficulties mainly rest on the pursuit of orthogonal goals by the two fields: realistic vs. controllable synthetic speech. We make a case for realising the promise of speech technologies in areas of speech sciences by developing control of neural speech synthesis and bringing the two areas into dialogue again.
  •  
7.
  • Székely, Éva, et al. (författare)
  • Breathing and Speech Planning in Spontaneous Speech Synthesis
  • 2020
  • Ingår i: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). - : IEEE. ; , s. 7649-7653
  • Konferensbidrag (refereegranskat)abstract
    • Breathing and speech planning in spontaneous speech are coordinated processes, often exhibiting disfluent patterns. While synthetic speech is not subject to respiratory needs, integrating breath into synthesis has advantages for naturalness and recall. At the same time, a synthetic voice reproducing disfluent breathing patterns learned from the data can be problematic. To address this, we first propose training stochastic TTS on a corpus of overlapping breath-group bigrams, to take context into account. Next, we introduce an unsupervised automatic annotation of likely-disfluent breath events, through a product-of-experts model that combines the output of two breath-event predictors, each using complementary information and operating in opposite directions. This annotation enables creating an automatically-breathing spontaneous speech synthesiser with a more fluent breathing style. A subjective evaluation on two spoken genres (impromptu and rehearsed) found the proposed system to be preferred over the baseline approach treating all breath events the same.
  •  
8.
  • Székely, Éva, et al. (författare)
  • Spontaneous conversational speech synthesis from found data
  • 2019
  • Ingår i: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. - : ISCA. ; , s. 4435-4439
  • Konferensbidrag (refereegranskat)abstract
    • Synthesising spontaneous speech is a difficult task due to disfluencies, high variability and syntactic conventions different from those of written language. Using found data, as opposed to lab-recorded conversations, for speech synthesis adds to these challenges because of overlapping speech and the lack of control over recording conditions. In this paper we address these challenges by using a speaker-dependent CNN-LSTM breath detector to separate continuous recordings into utterances, which we here apply to extract nine hours of clean single-speaker breath groups from a conversational podcast. The resulting corpus is transcribed automatically (both lexical items and filler tokens) and used to build several voices on a Tacotron 2 architecture. Listening tests show: i) pronunciation accuracy improved with phonetic input and transfer learning; ii) it is possible to create a more fluent conversational voice by training on data without filled pauses; and iii) the presence of filled pauses improved perceived speaker authenticity. Another listening test showed the found podcast voice to be more appropriate for prompts from both public speeches and casual conversations, compared to synthesis from found read speech and from a manually transcribed lab-recorded spontaneous conversation.
  •  
9.
  •  
10.
  • Wang, Siyang, 1995-, et al. (författare)
  • A comparative study of self-supervised speech representationsin read and spontaneous TTS
  • 2023
  • Ingår i: ICASSPW 2023. - : Institute of Electrical and Electronics Engineers (IEEE).
  • Annan publikation (övrigt vetenskapligt/konstnärligt)abstract
    • Recent work has explored using self-supervised learning(SSL) speech representations such as wav2vec2.0 as the rep-resentation medium in standard two-stage TTS, in place ofconventionally used mel-spectrograms. It is however unclearwhich speech SSL is the better fit for TTS, and whether ornot the performance differs between read and spontaneousTTS, the later of which is arguably more challenging. Thisstudy aims at addressing these questions by testing severalspeech SSLs, including different layers of the same SSL, intwo-stage TTS on both read and spontaneous corpora, whilemaintaining constant TTS model architecture and trainingsettings. Results from listening tests show that the 9th layerof 12-layer wav2vec2.0 (ASR finetuned) outperforms othertested SSLs and mel-spectrogram, in both read and sponta-neous TTS. Our work sheds light on both how speech SSL canreadily improve current TTS systems, and how SSLs comparein the challenging generative task of TTS. Audio examplescan be found at https://www.speech.kth.se/tts-demos/ssr tts
  •  
11.
  • Wang, Siyang, 1994-, et al. (författare)
  • Integrated Speech and Gesture Synthesis
  • 2021
  • Ingår i: ICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction. - New York, NY, USA : Association for Computing Machinery (ACM). ; , s. 177-185
  • Konferensbidrag (refereegranskat)abstract
    • Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG). We also propose a set of models modified from state-of-the-art neural speech-synthesis engines to achieve this goal. We evaluate the models in three carefully-designed user studies, two of which evaluate the synthesized speech and gesture in isolation, plus a combined study that evaluates the models like they will be used in real-world applications - speech and gesture presented together. The results show that participants rate one of the proposed integrated synthesis models as being as good as the state-of-the-art pipeline system we compare against, in all three tests. The model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system, illustrating some of the potential benefits of treating speech and gesture synthesis together as a single, unified problem.
  •  
12.
  • Alexanderson, Simon, et al. (författare)
  • Generating coherent spontaneous speech and gesture from text
  • 2020
  • Ingår i: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, IVA 2020. - New York, NY, USA : Association for Computing Machinery (ACM).
  • Konferensbidrag (refereegranskat)abstract
    • Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data. We illustrate our results by visualising gesture spaces and textspeech-gesture alignments, and through a demonstration video.
  •  
13.
  • Alexanderson, Simon, et al. (författare)
  • Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models
  • 2023
  • Ingår i: ACM Transactions on Graphics. - : Association for Computing Machinery (ACM). - 0730-0301 .- 1557-7368. ; 42:4
  • Tidskriftsartikel (refereegranskat)abstract
    • Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest.
  •  
14.
  • Alexanderson, Simon, et al. (författare)
  • Robust model training and generalisation with Studentising flows
  • 2020
  • Ingår i: Proceedings of the ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models. ; , s. 25:1-25:9
  • Konferensbidrag (refereegranskat)abstract
    • Normalising flows are tractable probabilistic models that leverage the power of deep learning to describe a wide parametric family of distributions, all while remaining trainable using maximum likelihood. We discuss how these methods can be further improved based on insights from robust (in particular, resistant) statistics. Specifically, we propose to endow flow-based models with fat-tailed latent distributions such as multivariate Student's t, as a simple drop-in replacement for the Gaussian distribution used by conventional normalising flows. While robustness brings many advantages, this paper explores two of them: 1) We describe how using fatter-tailed base distributions can give benefits similar to gradient clipping, but without compromising the asymptotic consistency of the method. 2) We also discuss how robust ideas lead to models with reduced generalisation gap and improved held-out data likelihood. Experiments on several different datasets confirm the efficacy of the proposed approach in both regards.
  •  
15.
  • Alexanderson, Simon, et al. (författare)
  • Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows
  • 2020
  • Ingår i: Computer graphics forum (Print). - : Wiley. - 0167-7055 .- 1467-8659. ; 39:2, s. 487-496
  • Tidskriftsartikel (refereegranskat)abstract
    • Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off-line applications, novel tools can alter the role of an animator to that of a director, who provides only high-level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning-based motion synthesis method called MoGlow, we propose a new generative model for generating state-of-the-art realistic speech-driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper-body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full-body gesticulation, including the synthesis of stepping motion and stance.
  •  
16.
  • Beck, Gustavo, et al. (författare)
  • Wavebender GAN : An architecture for phonetically meaningful speech manipulation
  • 2022
  • Ingår i: 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP). - : IEEE conference proceedings.
  • Konferensbidrag (refereegranskat)abstract
    • Deep learning has revolutionised synthetic speech quality. However, it has thus far delivered little value to the speech science community. The new methods do not meet the controllability demands that practitioners in this area require e.g.: in listening tests with manipulated speech stimuli. Instead, control of different speech properties in such stimuli is achieved by using legacy signal-processing methods. This limits the range, accuracy, and speech quality of the manipulations. Also, audible artefacts have a negative impact on the methodological validity of results in speech perception studies.This work introduces a system capable of manipulating speech properties through learning rather than design. The architecture learns to control arbitrary speech properties and leverages progress in neural vocoders to obtain realistic output. Experiments with copy synthesis and manipulation of a small set of core speech features (pitch, formants, and voice quality measures) illustrate the promise of the approach for producing speech stimuli that have accurate control and high perceptual quality.
  •  
17.
  • De Gooijer, Jan G., et al. (författare)
  • Kernel-based hidden Markov conditional densities
  • 2022
  • Ingår i: Computational Statistics & Data Analysis. - : Elsevier BV. - 0167-9473 .- 1872-7352. ; 169
  • Tidskriftsartikel (refereegranskat)abstract
    • A natural way to obtain conditional density estimates for time series processes is to adopt a kernel-based (nonparametric) conditional density estimation (KCDE) method. To this end, the data generating process is commonly assumed to be Markovian of finite order. Markov processes, however, have limited memory range so that only the most recent observations are informative for estimating future observations, assuming the underlying model is known. Hidden Markov models (HMMs), on the other hand, can integrate information over arbitrary lengths of time and thus describe a wider variety of data generating processes. The KCDE and HMMs are combined into one method. The resulting KCDE-HMM method is described in detail, and an iterative algorithm is presented for estimating its transition probabilities, weights and bandwidths. Consistency and asymptotic normality of the resulting conditional density estimator are proved. The conditional forecast ability of the proposed conditional density method is examined and compared via a rolling forecasting window with three benchmark methods: HMM, autoregressive HMM, and KCDE-MM. Large-sample performance of the above conditional estimation methods as a function of training data size is explored. Finally, the methods are applied to the U.S. Industrial Production series and the S&P 500 index. The results indicate that KCDE-HMM outperforms the benchmark methods for moderate-to-large sample sizes, irrespective of the number of hidden states considered.
  •  
18.
  • Fong, Jason, et al. (författare)
  • Speech Audio Corrector : using speech from non-target speakers for one-off correction of mispronunciations in grapheme-input text-to-speech
  • 2022
  • Ingår i: INTERSPEECH 2022. - : International Speech Communication Association. ; , s. 1213-1217
  • Konferensbidrag (refereegranskat)abstract
    • Correct pronunciation is essential for text-to-speech (TTS) systems in production. Most production systems rely on pronouncing dictionaries to perform grapheme-to-phoneme conversion. Unlike end-to-end TTS, this enables pronunciation correction by manually altering the phoneme sequence, but the necessary dictionaries are labour-intensive to create and only exist in a few high-resourced languages. This work demonstrates that accurate TTS pronunciation control can be achieved without a dictionary. Moreover, we show that such control can be performed without requiring any model retraining or fine-tuning, merely by supplying a single correctly-pronounced reading of a word in a different voice and accent at synthesis time. Experimental results show that our proposed system successfully enables one-off correction of mispronunciations in grapheme based TTS with maintained synthesis quality. This opens the door to production-level TTS in languages and applications where pronunciation dictionaries are unavailable.
  •  
19.
  • Ghosh, Anubhab, et al. (författare)
  • Robust classification using hidden markov models and mixtures of normalizing flows
  • 2020
  • Ingår i: 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP). - : Institute of Electrical and Electronics Engineers (IEEE).
  • Konferensbidrag (refereegranskat)abstract
    • We test the robustness of a maximum-likelihood (ML) based classifier where sequential data as observation is corrupted by noise. The hypothesis is that a generative model, that combines the state transitions of a hidden Markov model (HMM) and the neural network based probability distributions for the hidden states of the HMM, can provide a robust classification performance. The combined model is called normalizing-flow mixture model based HMM (NMM-HMM). It can be trained using a combination of expectation-maximization (EM) and backpropagation. We verify the improved robustness of NMM-HMM classifiers in an application to speech recognition.
  •  
20.
  • Henter, Gustav Eje, Assistant Professor, et al. (författare)
  • Moglow : Probabilistic and controllable motion synthesis using normalising flows
  • 2019
  • Annan publikation (övrigt vetenskapligt/konstnärligt)abstract
    • Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motiondata models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive assumptions such as the motion being cyclic in nature. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method attains a motion quality close to recorded motion capture for both humans and animals.
  •  
21.
  • Henter, Gustav Eje, Assistant Professor, et al. (författare)
  • MoGlow : Probabilistic and controllable motion synthesis using normalising flows
  • 2020
  • Ingår i: ACM Transactions on Graphics. - New York, NY, USA : Association for Computing Machinery (ACM). - 0730-0301 .- 1557-7368. ; 39:6, s. 1-14
  • Tidskriftsartikel (refereegranskat)abstract
    • Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive, task-specific assumptions regarding the motion or the character morphology. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method outperforms task-agnostic baselines and attains a motion quality close to recorded motion capture.
  •  
22.
  • Jonell, Patrik, 1988-, et al. (författare)
  • HEMVIP: Human Evaluation of Multiple Videos in Parallel
  • 2021
  • Ingår i: ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction. - New York, NY, United States : Association for Computing Machinery (ACM). ; , s. 707-711
  • Konferensbidrag (refereegranskat)abstract
    • In many research areas, for example motion and gesture generation, objective measures alone do not provide an accurate impression of key stimulus traits such as perceived quality or appropriateness. The gold standard is instead to evaluate these aspects through user studies, especially subjective evaluations of video stimuli. Common evaluation paradigms either present individual stimuli to be scored on Likert-type scales, or ask users to compare and rate videos in a pairwise fashion. However, the time and resources required for such evaluations scale poorly as the number of conditions to be compared increases. Building on standards used for evaluating the quality of multimedia codecs, this paper instead introduces a framework for granular rating of multiple comparable videos in parallel. This methodology essentially analyses all condition pairs at once. Our contributions are 1) a proposed framework, called HEMVIP, for parallel and granular evaluation of multiple video stimuli and 2) a validation study confirming that results obtained using the tool are in close agreement with results of prior studies using conventional multiple pairwise comparisons.
  •  
23.
  • Jonell, Patrik, et al. (författare)
  • Let’s face it : Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings
  • 2020
  • Ingår i: IVA '20: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. - New York, NY, USA : Association for Computing Machinery (ACM).
  • Konferensbidrag (refereegranskat)abstract
    • To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors. One key aspect of this is generation of appropriate non-verbal behavior for the agent, for example, facial gestures, here defined as facial expressions and head movements. Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synthesizing non-verbal behavior. Those that do, typically use deterministic methods that risk producing repetitive and non-vivid motions. In this paper, we introduce a probabilistic method to synthesize interlocutor-aware facial gestures ś represented by highly expressive FLAME parameters ś in dyadic conversations. Our contributions are: a) a method for feature extraction from multi-party video and speech recordings, resulting in a representation that allows for independent control and manipulation of expression and speech articulation in a 3D avatar; b) an extension to MoGlow, a recent motion-synthesis method based on normalizing flows, to also take multi-modal signals from the interlocutor as input and subsequently output interlocutor-aware facial gestures; and c) a subjective evaluation assessing the use and relative importance of the different modalities in the synthesized output. The results show that the model successfully leverages the input from the interlocutor to generate more appropriate behavior. Videos, data, and code are available at: https://jonepatr.github.io/lets_face_it/
  •  
24.
  • Kucherenko, Taras, 1994-, et al. (författare)
  • A large, crowdsourced evaluation of gesture generation systems on common data : The GENEA Challenge 2020
  • 2021
  • Ingår i: Proceedings IUI '21: 26th International Conference on Intelligent User Interfaces. - New York, NY, USA : Association for Computing Machinery (ACM). ; , s. 11-21
  • Konferensbidrag (refereegranskat)abstract
    • Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual research efforts in the field are difficult to compare: there are no established benchmarks, and each study tends to use its own dataset, motion visualisation, and evaluation methodology. To address this situation, we launched the GENEA Challenge, a gesture-generation challenge wherein participating teams built automatic gesture-generation systems on a common dataset, and the resulting systems were evaluated in parallel in a large, crowdsourced user study using the same motion-rendering pipeline. Since differences in evaluation outcomes between systems now are solely attributable to differences between the motion-generation methods, this enables benchmarking recent approaches against one another in order to get a better impression of the state of the art in the field. This paper reports on the purpose, design, results, and implications of our challenge.
  •  
25.
  • Kucherenko, Taras, 1994- (författare)
  • Developing and evaluating co-speech gesture-synthesis models for embodied conversational agents
  • 2021
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    •  A  large part of our communication is non-verbal:   humans use non-verbal behaviors to express various aspects of our state or intent.  Embodied artificial agents, such as virtual avatars or robots, should also use non-verbal behavior for efficient and pleasant interaction. A core part of non-verbal communication is gesticulation:  gestures communicate a large share of non-verbal content. For example, around 90\% of spoken utterances in descriptive discourse are accompanied by gestures. Since gestures are important, generating co-speech gestures has been an essential task in the Human-Agent Interaction (HAI) and Computer Graphics communities for several decades.  Evaluating the gesture-generating methods has been an equally important and equally challenging part of field development. Consequently, this thesis contributes to both the development and evaluation of gesture-generation models. This thesis proposes three deep-learning-based gesture-generation models. The first model is deterministic and uses only audio and generates only beat gestures.  The second model is deterministic and uses both audio and text, aiming to generate meaningful gestures.  A final model uses both audio and text and is probabilistic to learn the stochastic character of human gesticulation.  The methods have applications to both virtual agents and social robots. Individual research efforts in the field of gesture generation are difficult to compare, as there are no established benchmarks.  To address this situation, my colleagues and I launched the first-ever gesture-generation challenge, which we called the GENEA Challenge.  We have also investigated if online participants are as attentive as offline participants and found that they are both equally attentive provided that they are well paid.   Finally,  we developed a  system that integrates co-speech gesture-generation models into a real-time interactive embodied conversational agent.  This system is intended to facilitate the evaluation of modern gesture generation models in interaction. To further advance the development of capable gesture-generation methods, we need to advance their evaluation, and the research in the thesis supports an interpretation that evaluation is the main bottleneck that limits the field.  There are currently no comprehensive co-speech gesture datasets, which should be large, high-quality, and diverse. In addition, no strong objective metrics are yet available.  Creating speech-gesture datasets and developing objective metrics are highlighted as essential next steps for further field development.
  •  
26.
  • Kucherenko, Taras, 1994-, et al. (författare)
  • GENEA Workshop 2021 : The 2nd Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents
  • 2021
  • Ingår i: Proceedings of ICMI '21: International Conference on Multimodal Interaction. - New York, NY, USA : Association for Computing Machinery (ACM). ; , s. 872-873
  • Konferensbidrag (refereegranskat)abstract
    • Embodied agents benefit from using non-verbal behavior when communicating with humans. Despite several decades of non-verbal behavior-generation research, there is currently no well-developed benchmarking culture in the field. For example, most researchers do not compare their outcomes with previous work, and if they do, they often do so in their own way which frequently is incompatible with others. With the GENEA Workshop 2021, we aim to bring the community together to discuss key challenges and solutions, and find the most appropriate ways to move the field forward.
  •  
27.
  • Kucherenko, Taras, 1994-, et al. (författare)
  • Gesticulator : A framework for semantically-aware speech-driven gesture generation
  • 2020
  • Ingår i: ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction. - New York, NY, USA : Association for Computing Machinery (ACM).
  • Konferensbidrag (refereegranskat)abstract
    • During speech, people spontaneously gesticulate, which plays akey role in conveying information. Similarly, realistic co-speechgestures are crucial to enable natural and smooth interactions withsocial agents. Current end-to-end co-speech gesture generationsystems use a single modality for representing speech: either au-dio or text. These systems are therefore confined to producingeither acoustically-linked beat gestures or semantically-linked ges-ticulation (e.g., raising a hand when saying “high”): they cannotappropriately learn to generate both gesture types. We present amodel designed to produce arbitrary beat and semantic gesturestogether. Our deep-learning based model takes both acoustic andsemantic representations of speech as input, and generates gesturesas a sequence of joint angle rotations as output. The resulting ges-tures can be applied to both virtual agents and humanoid robots.Subjective and objective evaluations confirm the success of ourapproach. The code and video are available at the project page svito-zar.github.io/gesticula
  •  
28.
  • Kucherenko, Taras, 1994-, et al. (författare)
  • Moving Fast and Slow : Analysis of Representations and Post-Processing in Speech-Driven Automatic Gesture Generation
  • 2021
  • Ingår i: International Journal of Human-Computer Interaction. - : Informa UK Limited. - 1044-7318 .- 1532-7590. ; 37:14, s. 1300-1316
  • Tidskriftsartikel (refereegranskat)abstract
    • This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. We provide an analysis of different representations for the input (speech) and the output (motion) of the network by both objective and subjective evaluations. We also analyze the importance of smoothing of the produced motion. Our results indicated that the proposed method improved on our baseline in terms of objective measures. For example, it better captured the motion dynamics and better matched the motion-speed distribution. Moreover, we performed user studies on two different datasets. The studies confirmed that our proposed method is perceived as more natural than the baseline, although the difference in the studies was eliminated by appropriate post-processing: hip-centering and smoothing. We conclude that it is important to take both motion representation and post-processing into account when designing an automatic gesture-production method.
  •  
29.
  • Kucherenko, Taras, 1994-, et al. (författare)
  • Multimodal analysis of the predictability of hand-gesture properties
  • 2022
  • Ingår i: AAMAS '22: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. - : ACM Press. ; , s. 770-779
  • Konferensbidrag (refereegranskat)abstract
    • Embodied conversational agents benefit from being able to accompany their speech with gestures. Although many data-driven approaches to gesture generation have been proposed in recent years, it is still unclear whether such systems can consistently generate gestures that convey meaning. We investigate which gesture properties (phase, category, and semantics) can be predicted from speech text and/or audio using contemporary deep learning. In extensive experiments, we show that gesture properties related to gesture meaning (semantics and category) are predictable from text features (time-aligned FastText embeddings) alone, but not from prosodic audio features, while rhythm-related gesture properties (phase) on the other hand can be predicted from audio features better than from text. These results are encouraging as they indicate that it is possible to equip an embodied agent with content-wise meaningful co-speech gestures using a machine-learning model.
  •  
30.
  • Kucherenko, Taras, 1994-, et al. (författare)
  • Speech2Properties2Gestures : Gesture-Property Prediction as a Tool for Generating Representational Gestures from Speech
  • 2021
  • Ingår i: IVA '21. - New York, NY, USA : Association for Computing Machinery (ACM). ; , s. 145-147
  • Konferensbidrag (refereegranskat)abstract
    • We propose a new framework for gesture generation, aiming to allow data-driven approaches to produce more semantically rich gestures. Our approach first predicts whether to gesture, followed by a prediction of the gesture properties. Those properties are then used as conditioning for a modern probabilistic gesture-generation model capable of high-quality output. This empowers the approach to generate gestures that are both diverse and representational. Follow-ups and more information can be found on the project page:https://svito-zar.github.io/speech2properties2gestures
  •  
31.
  • Kucherenko, Taras, 1994-, et al. (författare)
  • The GENEA Challenge 2020 : Benchmarking gesture-generation systems on common data
  • Annan publikation (övrigt vetenskapligt/konstnärligt)abstract
    • Automatic gesture generation is a field of growing interest, and a key technology for enabling embodied conversational agents. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual research efforts in the field are difficult to compare: there are no established benchmarks, and each study tends to use its own dataset, motion visualisation, and evaluation methodology. To address this situation, we launched the GENEA gesture-generation challenge, wherein participating teams built automatic gesture-generation systems on a common dataset, and the resulting systems were evaluated in parallel in a large, crowdsourced user study. Since differences in evaluation outcomes between systems now are solely attributable to differences between the motion-generation methods, this enables benchmarking recent approaches against one another and investigating the state of the art in the field. This paper provides a first report on the purpose, design, and results of our challenge, with each individual team's entry described in a separate paper also presented at the GENEA Workshop. Additional information about the workshop can be found at https://genea-workshop.github.io/2020/ .
  •  
32.
  • Kucherenko, Taras, 1994-, et al. (författare)
  • The GENEA challenge 2023 : a large-scale evaluation of gesture generation models in monadic and dyadic settings
  • 2023
  • Ingår i: ICMI '23. - : Association for Computing Machinery (ACM). - 9798400700552 ; , s. 792-801
  • Konferensbidrag (refereegranskat)abstract
    • This paper reports on the GENEA Challenge 2023, in which participating teams built speech-driven gesture-generation systems using the same speech and motion dataset, followed by a joint evaluation. This year's challenge provided data on both sides of a dyadic interaction, allowing teams to generate full-body motion for an agent given its speech (text and audio) and the speech and motion of the interlocutor. We evaluated 12 submissions and 2 baselines together with held-out motion-capture data in several large-scale user studies. The studies focused on three aspects: 1) the human-likeness of the motion, 2) the appropriateness of the motion for the agent's own speech whilst controlling for the human-likeness of the motion, and 3) the appropriateness of the motion for the behaviour of the interlocutor in the interaction, using a setup that controls for both the human-likeness of the motion and the agent's own speech. We found a large span in human-likeness between challenge submissions, with a few systems rated close to human mocap. Appropriateness seems far from being solved, with most submissions performing in a narrow range slightly above chance, far behind natural motion. The effect of the interlocutor is even more subtle, with submitted systems at best performing barely above chance. Interestingly, a dyadic system being highly appropriate for agent speech does not necessarily imply high appropriateness for the interlocutor.
  •  
33.
  • Mehta, Shivam, et al. (författare)
  • MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
  • 2024
  • Ingår i: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 11341-11345
  • Konferensbidrag (refereegranskat)abstract
    • We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test.
  •  
34.
  • Mehta, Shivam, et al. (författare)
  • Neural HMMs are all you need (for high-quality attention-free TTS)
  • 2022
  • Ingår i: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). - : IEEE Signal Processing Society. ; , s. 7457-7461
  • Konferensbidrag (refereegranskat)abstract
    • Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing attention in neural TTS with an autoregressive left-right no-skip hidden Markov model defined by a neural network. Based on this proposal, we modify Tacotron 2 to obtain an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximation. We also describe how to combine ideas from classical and contemporary TTS for best results. The resulting example system is smaller and simpler than Tacotron 2, and learns to speak with fewer iterations and less data, whilst achieving comparable naturalness prior to the post-net. Our approach also allows easy control over speaking rate.
  •  
35.
  • Mehta, Shivam, et al. (författare)
  • OverFlow : Putting flows on top of neural transducers for better TTS
  • 2023
  • Ingår i: Interspeech 2023. - : International Speech Communication Association. ; , s. 4279-4283
  • Konferensbidrag (refereegranskat)abstract
    • Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech.
  •  
36.
  • Nyatsanga, S., et al. (författare)
  • A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
  • 2023
  • Ingår i: Computer graphics forum (Print). - : Wiley. - 0167-7055 .- 1467-8659. ; 42:2, s. 569-596
  • Tidskriftsartikel (refereegranskat)abstract
    • Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology for creating believable characters in film, games, and virtual social spaces, as well as for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic nature of human co-speech gesture motion, and by the great diversity of communicative functions that gestures encompass. The field of gesture generation has seen surging interest in the last few years, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep-learning-based generative models that benefit from the growing availability of data. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule-based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text and non-linguistic input. Concurrent with the exposition of deep learning approaches, we chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method (e.g., optical motion capture or pose estimation from video). Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.
  •  
37.
  • Pérez Zarazaga, Pablo, 1993-, et al. (författare)
  • A processing framework to access large quantities of whispered speech found in ASMR
  • 2023
  • Ingår i: ICASSP 2023. - Rhodes, Greece : IEEE Signal Processing Society.
  • Konferensbidrag (refereegranskat)abstract
    • Whispering is a ubiquitous mode of communication that humansuse daily. Despite this, whispered speech has been poorly servedby existing speech technology due to a shortage of resources andprocessing methodology. To remedy this, this paper provides a pro-cessing framework that enables access to large and unique data ofhigh-quality whispered speech. We obtain the data from recordingssubmitted to online platforms as part of the ASMR media-culturalphenomenon. We describe our processing pipeline and a method forimproved whispered activity detection (WAD) in the ASMR data.To efficiently obtain labelled, clean whispered speech, we comple-ment the automatic WAD by using Edyson, a bulk audio annotationtool with human-in-the-loop. We also tackle a problem particular toASMR: separation of whisper from other acoustic triggers presentin the genre. We show that the proposed WAD and the efficient la-belling allows to build extensively augmented data and train a clas-sifier that extracts clean whisper segments from ASMR audio.Our large and growing dataset enables whisper-capable, data-driven speech technology and linguistic analysis. It also opens op-portunities in e.g. HCI as a resource that may elicit emotional, psy-chological and neuro-physiological responses in the listener.
  •  
38.
  • Pérez Zarazaga, Pablo, 1993-, et al. (författare)
  • Speaker-independent neural formant synthesis
  • 2023
  • Ingår i: Interspeech 2023. - : International Speech Communication Association. ; , s. 5556-5560
  • Konferensbidrag (refereegranskat)abstract
    • We describe speaker-independent speech synthesis driven by a small set of phonetically meaningful speech parameters such as formant frequencies. The intention is to leverage deep-learning advances to provide a highly realistic signal generator that includes control affordances required for stimulus creation in the speech sciences. Our approach turns input speech parameters into predicted mel-spectrograms, which are rendered into waveforms by a pre-trained neural vocoder. Experiments with WaveNet and HiFi-GAN confirm that the method achieves our goals of accurate control over speech parameters combined with high perceptual audio quality. We also find that the small set of phonetically relevant speech parameters we use is sufficient to allow for speaker-independent synthesis (a.k.a. universal vocoding).
  •  
39.
  • Sorkhei, Mohammad Moein, 1995-, et al. (författare)
  • Full-Glow : Fully conditional Glow for more realistic image generation
  • 2021
  • Ingår i: Pattern Recognition. - Cham, Switzerland : Springer Nature. ; , s. 697-711
  • Konferensbidrag (refereegranskat)abstract
    • Autonomous agents, such as driverless cars, require large amounts of labeled visual data for their training. A viable approach for acquiring such data is training a generative model with collected real data, and then augmenting the collected real dataset with synthetic images from the model, generated with control of the scene layout and ground truth labeling. In this paper we propose Full-Glow, a fully conditional Glow-based architecture for generating plausible and realistic images of novel street scenes given a semantic segmentation map indicating the scene layout. Benchmark comparisons show our model to outperform recent works in terms of the semantic segmentation performance of a pretrained PSPNet. This indicates that images from our model are, to a higher degree than from other models, similar to real images of the same kinds of scenes and objects, making them suitable as training data for a visual semantic segmentation or object recognition system.
  •  
40.
  • Valentini-Botinhao, Cassia, et al. (författare)
  • Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks
  • 2022
  • Ingår i: INTERSPEECH 2022. - : International Speech Communication Association. ; , s. 471-475
  • Konferensbidrag (refereegranskat)abstract
    • Automatically predicting the outcome of subjective listening tests is a challenging task. Ratings may vary from person to person even if preferences are consistent across listeners. While previous work has focused on predicting listeners' ratings (mean opinion scores) of individual stimuli, we focus on the simpler task of predicting subjective preference given two speech stimuli for the same text. We propose a model based on anti-symmetric twin neural networks, trained on pairs of waveforms and their corresponding preference scores. We explore both attention and recurrent neural nets to account for the fact that stimuli in a pair are not time aligned. To obtain a large training set we convert listeners' ratings from MUSHRA tests to values that reflect how often one stimulus in the pair was rated higher than the other. Specifically, we evaluate performance on data obtained from twelve MUSHRA evaluations conducted over five years, containing different TTS systems, built from data of different speakers. Our results compare favourably to a state-of-the-art model trained to predict MOS scores.
  •  
41.
  • Valle-Perez, Guillermo, et al. (författare)
  • Transflower : probabilistic autoregressive dance generation with multimodal attention
  • 2021
  • Ingår i: ACM Transactions on Graphics. - : Association for Computing Machinery (ACM). - 0730-0301 .- 1557-7368. ; 40:6
  • Tidskriftsartikel (refereegranskat)abstract
    • Dance requires skillful composition of complex movements that follow rhythmic, tonal and timbral features of music. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. In this work we make two contributions to tackle this problem. First, we present a novel probabilistic autoregressive architecture that models the distribution over future poses with a normalizing flow conditioned on previous poses as well as music context, using a multimodal transformer encoder. Second, we introduce the currently largest 3D dance-motion dataset, obtained with a variety of motion-capture technologies, and including both professional and casual dancers. Using this dataset, we compare our new model against two baselines, via objective metrics and a user study, and show that both the ability to model a probability distribution, as well as being able to attend over a large motion and music context are necessary to produce interesting, diverse, and realistic dance that matches the music.
  •  
42.
  • Webber, Jacob J., et al. (författare)
  • Autovocoder: Fast Waveform Generation from a Learned Speech Representation Using Differentiable Digital Signal Processing
  • 2023
  • Ingår i: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. - : Institute of Electrical and Electronics Engineers (IEEE).
  • Konferensbidrag (refereegranskat)abstract
    • Most state-of-the-art Text-to-Speech systems use the mel-spectrogram as an intermediate representation, to decompose the task into acoustic modelling and waveform generation.A mel-spectrogram is extracted from the waveform by a simple, fast DSP operation, but generating a high-quality waveform from a mel-spectrogram requires computationally expensive machine learning: a neural vocoder. Our proposed "autovocoder"reverses this arrangement. We use machine learning to obtain a representation that replaces the mel-spectrogram, and that can be inverted back to a waveform using simple, fast operations including a differentiable implementation of the inverse STFT.The autovocoder generates a waveform 5 times faster than the DSP-based Griffin-Lim algorithm, and 14 times faster than the neural vocoder HiFi-GAN. We provide perceptual listening test results to confirm that the speech is of comparable quality to HiFi-GAN in the copy synthesis task.
  •  
43.
  • Wennberg, Ulme, et al. (författare)
  • Exploring Internal Numeracy in Language Models: A Case Study on ALBERT
  • 2024
  • Ingår i: MathNLP 2024: 2nd Workshop on Mathematical Natural Language Processing at LREC-COLING 2024 - Workshop Proceedings. - : European Language Resources Association (ELRA). ; , s. 35-40
  • Konferensbidrag (refereegranskat)abstract
    • It has been found that Transformer-based language models have the ability to perform basic quantitative reasoning. In this paper, we propose a method for studying how these models internally represent numerical data, and use our proposal to analyze the ALBERT family of language models. Specifically, we extract the learned embeddings these models use to represent tokens that correspond to numbers and ordinals, and subject these embeddings to Principal Component Analysis (PCA). PCA results reveal that ALBERT models of different sizes, trained and initialized separately, consistently learn to use the axes of greatest variation to represent the approximate ordering of various numerical concepts. Numerals and their textual counterparts are represented in separate clusters, but increase along the same direction in 2D space. Our findings illustrate that language models, trained purely to model text, can intuit basic mathematical concepts, opening avenues for NLP applications that intersect with quantitative reasoning.
  •  
44.
  • Wennberg, Ulme, et al. (författare)
  • The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models
  • 2021
  • Ingår i: ACL-IJCNLP 2021. - : ASSOC COMPUTATIONAL LINGUISTICS-ACL. ; , s. 130-140
  • Konferensbidrag (refereegranskat)abstract
    • Mechanisms for encoding positional information are central for transformer-based language models. In this paper, we analyze the position embeddings of existing language models, finding strong evidence of translation invariance, both for the embeddings themselves and for their effect on self-attention. The degree of translation invariance increases during training and correlates positively with model performance. Our findings lead us to propose translation-invariant self-attention (TISA), which accounts for the relative position between tokens in an interpretable fashion without needing conventional position embeddings. Our proposal has several theoretical advantages over existing position-representation approaches. Experiments show that it improves on regular ALBERT on GLUE tasks, while only adding orders of magnitude less positional parameters.
  •  
45.
  • Wolfert, Pieter, et al. (författare)
  • "Am I listening?", Evaluating the Quality of Generated Data-driven Listening Motion
  • 2023
  • Ingår i: ICMI 2023 Companion. - : Association for Computing Machinery (ACM). ; , s. 6-10
  • Konferensbidrag (refereegranskat)abstract
    • This paper asks if recent models for generating co-speech gesticulation also may learn to exhibit listening behaviour as well. We consider two models from recent gesture-generation challenges and train them on a dataset of audio and 3D motion capture from dyadic conversations. One model is driven by information from both sides of the conversation, whereas the other only uses the character's own speech. Several user studies are performed to assess the motion generated when the character is speaking actively, versus when the character is the listener in the conversation. We find that participants are reliably able to discern motion associated with listening, whether from motion capture or generated by the models. Both models are thus able to produce distinctive listening behaviour, even though only one model is truly a listener, in the sense that it has access to information from the other party in the conversation. Additional experiments on both natural and model-generated motion finds motion associated with listening to be rated as less human-like than motion associated with active speaking.
  •  
46.
  • Wolfert, Pieter, et al. (författare)
  • Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal Behaviour
  • 2024
  • Ingår i: Applied Sciences. - : MDPI AG. - 2076-3417. ; 14:4
  • Tidskriftsartikel (refereegranskat)abstract
    • This paper compares three methods for evaluating computer-generated motion behaviour for animated characters: two commonly used direct rating methods and a newly designed questionnaire. The questionnaire is specifically designed to measure the human-likeness, appropriateness, and intelligibility of the generated motion. Furthermore, this study investigates the suitability of these evaluation tools for assessing subtle forms of human behaviour, such as the subdued motion cues shown when listening to someone. This paper reports six user studies, namely studies that directly rate the appropriateness and human-likeness of a computer character's motion, along with studies that instead rely on a questionnaire to measure the quality of the motion. As test data, we used the motion generated by two generative models and recorded human gestures, which served as a gold standard. Our findings indicate that when evaluating gesturing motion, the direct rating of human-likeness and appropriateness is to be preferred over a questionnaire. However, when assessing the subtle motion of a computer character, even the direct rating method yields less conclusive results. Despite demonstrating high internal consistency, our questionnaire proves to be less sensitive than directly rating the quality of the motion. The results provide insights into the evaluation of human motion behaviour and highlight the complexities involved in capturing subtle nuances in nonverbal communication. These findings have implications for the development and improvement of motion generation models and can guide researchers in selecting appropriate evaluation methodologies for specific aspects of human behaviour.
  •  
47.
  • Wolfert, Pieter, et al. (författare)
  • GENEA Workshop 2022 : The 3rd Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents
  • 2022
  • Ingår i: ACM International Conference Proceeding Series. - New York, NY, USA : Association for Computing Machinery (ACM). ; , s. 799-800
  • Konferensbidrag (refereegranskat)abstract
    • Embodied agents benefit from using non-verbal behavior when communicating with humans. Despite several decades of non-verbal behavior-generation research, there is currently no well-developed benchmarking culture in the field. For example, most researchers do not compare their outcomes with previous work, and if they do, they often do so in their own way which frequently is incompatible with others. With the GENEA Workshop 2022, we aim to bring the community together to discuss key challenges and solutions, and find the most appropriate ways to move the field forward.
  •  
48.
  • Yoon, Youngwoo, et al. (författare)
  • GENEA Workshop 2023 : The 4th Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents
  • 2023
  • Ingår i: ICMI 2023. - : Association for Computing Machinery (ACM). ; , s. 822-823
  • Konferensbidrag (refereegranskat)abstract
    • Non-verbal behavior is advantageous for embodied agents when interacting with humans. Despite many years of research on the generation of non-verbal behavior, there is no established benchmarking practice in the field. Most researchers do not compare their results to prior work, and if they do, they often do so in a manner that is not compatible with other approaches. The GENEA Workshop 2023 seeks to bring the community together to discuss the major challenges and solutions, and to identify the best ways to progress the field.
  •  
49.
  • Yoon, Youngwoo, et al. (författare)
  • The GENEA Challenge 2022 : A large evaluation of data-driven co-speech gesture generation
  • 2022
  • Ingår i: ICMI 2022. - New York, NY, USA : Association for Computing Machinery (ACM). - 9781450393904 ; , s. 736-747
  • Konferensbidrag (refereegranskat)abstract
    • This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. This year's dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which previously was a major challenge in the field. The evaluation results are a revolution, and a revelation. Some synthetic conditions are rated as significantly more human-like than human motion capture. To the best of our knowledge, this has never been shown before on a high-fidelity avatar. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-49 av 49
Typ av publikation
konferensbidrag (35)
tidskriftsartikel (9)
annan publikation (3)
doktorsavhandling (2)
Typ av innehåll
refereegranskat (43)
övrigt vetenskapligt/konstnärligt (6)
Författare/redaktör
Henter, Gustav Eje, ... (26)
Henter, Gustav Eje, ... (23)
Beskow, Jonas (19)
Kucherenko, Taras, 1 ... (14)
Székely, Eva (11)
Alexanderson, Simon (9)
visa fler...
Kjellström, Hedvig, ... (8)
Gustafsson, Joakim, ... (7)
Nagy, Rajmund (5)
Malisz, Zofia (5)
Mehta, Shivam (5)
Gustafson, Joakim, p ... (4)
Wennberg, Ulme (3)
Jonell, Patrik, 1988 ... (3)
Hagman, Göran (2)
Håkansson, Krister (2)
Kivipelto, Miia (2)
Neff, Michael (2)
Leite, Iolanda (2)
Belpaeme, Tony (2)
Moell, Birger (2)
King, Simon (2)
Holleman, Jasper (1)
Honore, Antoine (1)
Neff, M. (1)
Holzapfel, Andre (1)
Akenine, Ulrika (1)
Edlund, Jens (1)
Chatterjee, Saikat (1)
Beck, Gustavo (1)
Betz, Simon (1)
Wagner, Petra (1)
Liu, Dong (1)
De Gooijer, Jan G. (1)
Yuan, Ao (1)
Wang, Siyang, 1994- (1)
Tang, Hao (1)
Tånnander, Christina (1)
Wang, Siyang, 1995- (1)
Fong, Jason (1)
Lyth, Daniel (1)
Ghosh, Anubhab (1)
Bonnard, Alexandre (1)
Tu, Ruibo (1)
van Waveren, Sanne (1)
Kucherenko, T. (1)
Rydén, Marie (1)
Stormoen, Sara (1)
Peres, Kristal Moral ... (1)
Juvela, Lauri (1)
visa färre...
Lärosäte
Kungliga Tekniska Högskolan (49)
Umeå universitet (2)
Språk
Engelska (49)
Forskningsämne (UKÄ/SCB)
Naturvetenskap (40)
Teknik (9)
Humaniora (4)
Medicin och hälsovetenskap (1)
Samhällsvetenskap (1)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy