SwePub
Sök i SwePub databas

  Extended search

Träfflista för sökning "WFRF:(Stymne Sara) "

Search: WFRF:(Stymne Sara)

  • Result 1-50 of 81
Sort/group result
   
EnumerationReferenceCoverFind
1.
  • Adams, Allison, et al. (author)
  • Learning with learner corpora : Using the TLE for native language identification
  • 2017
  • In: Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition. ; , s. 1-7
  • Conference paper (peer-reviewed)abstract
    • This study investigates the usefulness of the Treebank of Learner English (TLE) when applied to the task of Native Language Identification (NLI). The TLE is effectively a parallel corpus of Standard/Learner English, as there are two versions; one based on original learner essays, and the other an error-corrected version. We use the corpus to explore how useful a parser trained on ungrammatical relations is compared to a parser trained on grammatical relations, when used as features for a native language classification task. While parsing results are much better when trained on grammatical relations, native language classification is slightly better using a parser trained on the original treebank containing ungrammatical relations.
  •  
2.
  •  
3.
  • Bremin, Sofia, et al. (author)
  • Methods for human evaluation of machine translation
  • 2010
  • In: Proceedings of the Swedish Language Technology Conference (SLTC2010). ; , s. 47-48
  • Conference paper (other academic/artistic)abstract
    • Evaluation of machine translation (MT) is a difficult task, both for humans, and using automatic metrics. The main difficulty lies in the fact that there is not one single correct translation, but many alternative good translation options.MT systems are often evaluated using automatic metrics, which commonly rely on comparing a translation to only a single human reference translation. An alternative is different types of human evaluations, commonly ranking be-tween systems or estimations of adequacy and fluency on some scale, or error analyses.We have explored four different evaluation methods on output from three different statistical MT systems. The main focus is on different types of human evaluation. We compare two conventional evaluation methods, human error analysis and automatic metrics, to two lesser used evaluation methods based on reading comprehension and eye-tracking. These two methods of evaluations are performed without the subjects seeing the source sentence. There have been few previous attempts of using reading comprehension and eye-tracking for MT evaluation.One example of a reading comprehension study is Fuji (1999) who conducted an experiment to compare English-to-Japanese MT to several versions of manual corrections of the system output. He found significant differences be-tween texts with large differences on reading comprehension questions. Doherty and O’Brien (2009) is the only study we are aware of using eye-tracking for MT evaluation. They found that the average gaze time and fixation counts were significantly lower for sentences judged as excellent in an earlier evaluation, than for bad sentences.Like previous research we find that both reading comprehension and eye-tracking can be useful for MT evaluation.The results of these methods are consistent with the other methods for comparison between systems with a big quality difference. For systems with similar quality, however, the different evaluation methods often does not show any significant differences.
  •  
4.
  •  
5.
  • Cerniavski, Rafal, et al. (author)
  • Multilingual Automatic Speech Recognition for Scandinavian Languages
  • 2023
  • In: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa). - Tartu : University of Tartu. - 9789916219997 ; , s. 460-466
  • Conference paper (peer-reviewed)abstract
    • We investigate the effectiveness of multilingual automatic speech recognition models for Scandinavian languages by further fine-tuning a Swedish model on Swedish, Danish, and Norwegian. We first explore zero-shot models, which perform poorly across the three languages. However, we show that a multilingual model based on a strong Swedish model, further fine-tuned on all three languages, performs well for Norwegian and Danish, with a relatively low decrease in the performance for Swedish. With a language classification module, we improve the performance of the multilingual model even further.
  •  
6.
  • Černiavski, Rafal, et al. (author)
  • Uppsala University at SemEval-2022 Task 1 : Can Foreign Entries Enhance an English Reverse Dictionary?
  • 2022
  • In: Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). - Stroudsburg, PA, USA : Association for Computational Linguistics. - 9781955917803 ; , s. 88-93
  • Conference paper (peer-reviewed)abstract
    • We present the Uppsala University system for SemEval-2022 Task 1: Comparing Dictionaries and Word Embeddings (CODWOE). We explore the performance of multilingual reverse dictionaries as well as the possibility of utilizing annotated data in other languages to improve the quality of a reverse dictionary in the target language. We mainly focus on characterbased embeddings. In our main experiment, we train multilingual models by combining the training data from multiple languages. In an additional experiment, using resources beyond the shared task, we use the training data in Russian and French to improve the English reverse dictionary using unsupervised embeddings alignment and machine translation. The results show that multilingual models occasionally but not consistently can outperform the monolingual baselines. In addition, we demonstrate an improvement of an English reverse dictionary using translated entries from the Russian training data set.
  •  
7.
  • Danilova, Vera, et al. (author)
  • UD-MULTIGENRE : a UD-Based Dataset Enriched with Instance-Level Genre Annotations
  • 2023
  • In: Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL). - : Association for Computational Linguistics. - 9798891760561 ; , s. 253-267
  • Conference paper (peer-reviewed)abstract
    • Prior research on the impact of genre on cross-lingual dependency parsing has suggested that genre is an important signal. However, these studies suffer from a scarcity of reliable data for multiple genres and languages. While Universal Dependencies (UD), the only available large-scale resource for cross-lingual dependency parsing, contains data from diverse genres, the documentation of genre labels is missing, and there are multiple inconsistencies. This makes studies of the impact of genres difficult to design. To address this, we present a new dataset, UD-MULTIGENRE, where 17 genres are defined and instance-level annotations of these are applied to a subset of UD data, covering 38 languages. It provides a rich ground for research related to text genre from a multilingual perspective. Utilizing this dataset, we can overcome the data shortage that hindered previous research and reproduce experiments from earlier studies with an improved setup. We revisit a previous study that used genre-based clusters and show that the clusters for most target genres provide a mix of genres. We compare training data selection based on clustering and gold genre labels and provide an analysis of the results. The dataset is publicly available. (https://github.com/UppsalaNLP/UD-MULTIGENRE)
  •  
8.
  • de Lhoneux, Miryam, 1990-, et al. (author)
  • Arc-Hybrid Non-Projective Dependency Parsing with a Static-Dynamic Oracle
  • 2017
  • In: IWPT 2017 15th International Conference on Parsing Technologies. - Pisa, Italy : Association for Computational Linguistics. - 9781945626739 ; , s. 99-104
  • Conference paper (peer-reviewed)abstract
    • We extend the arc-hybrid transition system for dependency parsing with a SWAP transition that enables reordering of the words and construction of non-projective trees. Although this extension potentially breaks the arc-decomposability of the transition system, we show that the existing dynamic oracle can be modified and combined with a static oracle for the SWAP transition. Experiments on five languages with different degrees of non-projectivity show that the new system gives competitive accuracy and is significantly better than a system trained with a purely static oracle.
  •  
9.
  • de Lhoneux, Miryam, 1990-, et al. (author)
  • From raw text to Universal Dependencies : look, no tags!
  • 2017
  • In: Proceedings of the CoNLL 2017 Shared Task. - Vancouver, Canada : Association for Computational Linguistics. - 9781945626708 ; , s. 207-217
  • Conference paper (peer-reviewed)abstract
    • We present the Uppsala submission to the CoNLL 2017 shared task on parsing from raw text to universal dependencies. Our system is a simple pipeline consisting of two components. The first performs joint word and sentence segmentation on raw text; the second predicts dependency trees from raw words. The parser bypasses the need for part-of-speech tagging, but uses word embeddings based on universal tag distributions. We achieved a macroaveraged LAS F1 of 65.11 in the official test run and obtained the 2nd best result for sentence segmentation with a score of 89.03. After fixing two bugs, we obtained an unofficial LAS F1 of 70.49.
  •  
10.
  • de Lhoneux, Miryam, 1990- (author)
  • Linguistically Informed Neural Dependency Parsing for Typologically Diverse Languages
  • 2019
  • Doctoral thesis (other academic/artistic)abstract
    • This thesis presents several studies in neural dependency parsing for typologically diverse languages, using treebanks from Universal Dependencies (UD). The focus is on informing models with linguistic knowledge. We first extend a parser to work well on typologically diverse languages, including morphologically complex languages and languages whose treebanks have a high ratio of non-projective sentences, a notorious difficulty in dependency parsing. We propose a general methodology where we sample a representative subset of UD treebanks for parser development and evaluation. Our parser uses recurrent neural networks which construct information sequentially, and we study the incorporation of a recursive neural network layer in our parser. This follows the intuition that language is hierarchical. This layer turns out to be superfluous in our parser and we study its interaction with other parts of the network. We subsequently study transitivity and agreement information learned by our parser for auxiliary verb constructions (AVCs). We suggest that a parser should learn similar information about AVCs as it learns for finite main verbs. This is motivated by work in theoretical dependency grammar. Our parser learns different information about these two if we do not augment it with a recursive layer, but similar information if we do, indicating that there may be benefits from using that layer and we may not yet have found the best way to incorporate it in our parser. We finally investigate polyglot parsing. Training one model for multiple related languages leads to substantial improvements in parsing accuracy over a monolingual baseline. We also study different parameter sharing strategies for related and unrelated languages. Sharing parameters that partially abstract away from word order appears to be beneficial in both cases but sharing parameters that represent words and characters is more beneficial for related than unrelated languages.
  •  
11.
  • de Lhoneux, Miryam, 1990-, et al. (author)
  • Old School vs. New School : Comparing Transition-Based Parsers with and without Neural Network Enhancement
  • 2017
  • In: <em>Proceedings of the 15th Treebanks and Linguistic Theories Workshop (TLT)</em>. ; , s. 99-110
  • Conference paper (peer-reviewed)abstract
    • In this paper, we attempt a comparison between "new school" transition-based parsers that use neural networks and their classical "old school" coun-terpart. We carry out experiments on treebanks from the Universal Depen-dencies project. To facilitate the comparison and analysis of results, we onlywork on a subset of those treebanks. However, we carefully select this sub-set in the hope to have results that are representative for the whole set oftreebanks. We select two parsers that are hopefully representative of the twoschools; MaltParser and UDPipe and we look at the impact of training sizeon the two models. We hypothesize that neural network enhanced modelshave a steeper learning curve with increased training size. We observe, how-ever, that, contrary to expectations, neural network enhanced models needonly a small amount of training data to outperform the classical models butthe learning curves of both models increase at a similar pace after that. Wecarry out an error analysis on the development sets parsed by the two sys-tems and observe that overall MaltParser suffers more than UDPipe fromlonger dependencies. We observe that MaltParser is only marginally betterthan UDPipe on a restricted set of short dependencies.
  •  
12.
  • de Lhoneux, Miryam, 1990-, et al. (author)
  • What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?
  • 2019
  • In: CoRR. ; abs/1907.07950
  • Journal article (other academic/artistic)abstract
    • This article is a linguistic investigation of a neural parser. We look at transitivity and agreement information of auxiliary verb constructions (AVCs) in comparison to finite main verbs (FMVs). This comparison is motivated by theoretical work in dependency grammar and in particular the work of Tesnière (1959) where AVCs and FMVs are both instances of a nucleus, the basic unit of syntax. An AVC is a dissociated nucleus, it consists of at least two words, and a FMV is its non-dissociated counterpart, consisting of exactly one word. We suggest that the representation of AVCs and FMVs should capture similar information. We use diagnostic classifiers to probe agreement and transitivity information in vectors learned by a transition-based neural parser in four typologically different languages. We find that the parser learns different information about AVCs and FMVs if only sequential models (BiLSTMs) are used in the architecture but similar information when a recursive layer is used. We find explanations for why this is the case by looking closely at how information is learned in the network and looking at what happens with different dependency representations of AVCs.
  •  
13.
  • de Lhoneux, Miryam, 1990-, et al. (author)
  • What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?
  • 2020
  • In: Computational linguistics - Association for Computational Linguistics (Print). - : MIT Press. - 0891-2017 .- 1530-9312. ; 46:4, s. 763-784
  • Journal article (peer-reviewed)abstract
    • There is a growing interest in investigating what neural NLP models learn about language. A prominent open question is the question of whether or not it is necessary to model hierarchical structure. We present a linguistic investigation of a neural parser adding insights to this question. We look at transitivity and agreement information of auxiliary verb constructions (AVCs) in comparison to finite main verbs (FMVs). This comparison is motivated by theoretical work in dependency grammar and in particular the work of Tesnière (1959), where AVCs and FMVs are both instances of a nucleus, the basic unit of syntax. An AVC is a dissociated nucleus; it consists of at least two words, and an FMV is its non-dissociated counterpart, consisting of exactly one word. We suggest that the representation of AVCs and FMVs should capture similar information. We use diagnostic classifiers to probe agreement and transitivity information in vectors learned by a transition-based neural parser in four typologically different languages. We find that the parser learns different information about AVCs and FMVs if only sequential models (BiLSTMs) are used in the architecture but similar information when a recursive layer is used. We find explanations for why this is the case by looking closely at how information is learned in the network and looking at what happens with different dependency representations of AVCs. We conclude that there may be benefits to using a recursive layer in dependency parsing and that we have not yet found the best way to integrate it in our parsers.
  •  
14.
  • Della Corte, Giuseppe, et al. (author)
  • IESTAC : English-Italian Parallel Corpus for End-to-End Speech-to-Text Machine Translation
  • 2020
  • In: Proceedings of the First International Workshop on Natural Language Processing Beyond Text. - Stroudsburg, PA, USA : Association for Computational Linguistics. ; , s. 41-50
  • Conference paper (peer-reviewed)abstract
    • We discuss a set of methods for the creation of IESTAC: a English-Italian speech and text parallel corpus designed for the training of end-to-end speech-to-text machine translation models and publicly released as part of this work. We first mapped English LibriVox audiobooks and their corresponding English Gutenberg Project e-books to Italian e-books with a set of three complementary methods. Then we aligned the English and the Italian texts using both traditional Gale-Church based alignment methods and a recently proposed tool to perform bilingual sentences alignment computing the cosine similarity of multilingual sentence embeddings. Finally, we forced the alignment between the English audiobooks and the English side of our textual parallel corpus with a text-to-speech and dynamic time warping based forced alignment tool. For each step, we provide the reader with a critical discussion based on detailed evaluation and comparison of the results of the different methods.
  •  
15.
  • Dürlich, Luise, et al. (author)
  • Cause and Effect in Governmental Reports: Two Data Sets for Causality Detection in Swedish
  • 2022
  • In: Proceedings of the First Workshop on Natural Language Processing for Political Sciences (PoliticalNLP), Marseille, Framnce,. 24 June 2022. ; , s. 46-55
  • Conference paper (peer-reviewed)abstract
    • Causality detection is the task of extracting information about causal relations from text. It is an important task for different types of document analysis, including political impact assessment. We present two new data sets for causality detection in Swedish. The first data set is annotated with binary relevance judgments, indicating whether a sentence contains causality information or not. In the second data set, sentence pairs are ranked for relevance with respect to a causality query, containing a specific hypothesized cause and/or effect. Both data sets are carefully curated and mainly intended for use as test data. We describe the data sets and their annotation, including detailed annotation guidelines. In addition, we present pilot experiments on cross-lingual zero-shot and few-shot causality detection, using training data from English and German.
  •  
16.
  • Dürlich, Luise, et al. (author)
  • What Causes Unemployment? : Unsupervised Causality Mining from Swedish Governmental Reports
  • 2023
  • In: Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023). - : Association for Computational Linguistics. - 9781959429739 ; , s. 25-29
  • Conference paper (peer-reviewed)abstract
    • Extracting statements about causality from text documents is a challenging task in the absence of annotated training data. We create a search system for causal statements about user-specified concepts by combining pattern matching of causal connectives with semantic similarity ranking, using a language model fine-tuned for semantic textual similarity. Preliminary experiments on a small test set from Swedish governmental reports show promising results in comparison to two simple baselines.
  •  
17.
  • Guillou, Liane, et al. (author)
  • Findings of the 2016 WMT Shared Taskon Cross-lingual Pronoun Prediction
  • 2016
  • In: Proceedings of the First Conference on Machine Translation. ; , s. 525-542
  • Conference paper (other academic/artistic)abstract
    • We describe the design, the evaluation setup, and the results of the 2016 WMT shared task on cross-lingual pronoun prediction. This is a classification task in which participants are asked to provide predictions on what pronoun class label should replace a placeholder value in the target-language text, provided in lemmatised and PoS-tagged form. We provided four subtasks, for the English–French and English–German language pairs, in both directions. Eleven teams participated in the shared task; nine for the English–French subtask, five for French–English, nine for English–German, and six for German–English. Most of the submissions outperformed two strong language-model-based baseline systems, with systems using deep recurrent neural networks outperforming those using other architectures for most language pairs.
  •  
18.
  • Hardmeier, Christian, et al. (author)
  • Anaphora Models and Reordering for Phrase-Based SMT
  • 2014
  • In: Proceedings of the Ninth Workshop on Statistical Machine Translation. - : Association for Computational Linguistics. - 9781941643174 ; , s. 122-129
  • Conference paper (peer-reviewed)abstract
    • We describe the Uppsala University systems for WMT14. We look at the integration of a model for translating pronominal anaphora and a syntactic dependency projection model for English–French. Furthermore, we investigate post-ordering and tunable POS distortion models for English–German.
  •  
19.
  • Hardmeier, Christian, et al. (author)
  • Docent : A Document-Level Decoder for Phrase-Based Statistical Machine Translation
  • 2013
  • In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. - : Association for Computational Linguistics. ; , s. 193-198
  • Conference paper (peer-reviewed)abstract
    • We describe Docent, an open-source decoder for statistical machine translation that breaks with the usual sentence-by-sentence paradigm and translates complete documents as units. By taking translation to the document level, our decoder can handle feature models with arbitrary discourse-wide dependencies and constitutes an essential infrastructure component in the quest for discourse-aware SMT models.
  •  
20.
  • Hardmeier, Christian, et al. (author)
  • Pronoun-Focused MT and Cross-Lingual Pronoun Prediction: Findings of the 2015 DiscoMT Shared Task on Pronoun Translation
  • 2015
  • In: Proceedings of the Second Workshop on Discourse in Machine Translation (DiscoMT). - Stroudsburg, PA : Association for Computational Linguistics. - 9781941643327 ; , s. 1-16
  • Conference paper (other academic/artistic)abstract
    • We describe the design, the evaluation setup, and the results of the DiscoMT 2015 shared task, which included two subtasks, relevant to both the machine translation (MT) and the discourse communities: (i) pronoun-focused translation, a practical MT task, and (ii) cross-lingual pronoun prediction, a classification task that requires no specific MT expertise and is interesting as a machine learning task in its own right. We focused on the English–French language pair, for which MT output is generally of high quality, but has visible issues with pronoun translation due to differences in the pronoun systems of the two languages. Six groups participated in the pronoun-focused translation task and eight groups in the cross-lingual pronoun prediction task.
  •  
21.
  • Holmqvist, Maria, et al. (author)
  • Alignment-based reordering for SMT
  • 2012
  • In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). ; , s. 3436-3440
  • Conference paper (other academic/artistic)abstract
    • We present a method for improving word alignment quality for phrase-based statistical machine translation by reordering the source text according to the target word order suggested by an initial word alignment. The reordered text is used to create a second word alignment which can be an improvement of the first alignment, since the word order is more similar. The method requires no other pre-processing such as part-of-speech tagging or parsing. We report improved Bleu scores for English-to-German and English-to-Swedish translation. We also examined the effect on word alignment quality and found that the reordering method increased recall while lowering precision, which partly can explain the improved Bleu scores. A manual evaluation of the translation output was also performed to understand what effect our reordering method has on the translation system. We found that where the system employing reordering differed from the baseline in terms of having more words, or a different word order, this generally led to an improvement in translation quality.
  •  
22.
  • Holmqvist, Maria, 1979-, et al. (author)
  • Experiments with word alignment, normalization and clause reordering for SMT between English and German
  • 2011
  • In: Proceedings of the Sixth Workshop on Statistical Machine Translation (WMT 2011). ; , s. 393-398
  • Conference paper (peer-reviewed)abstract
    • This paper presents the LIU system for the WMT 2011 shared task for translation between German and English. For English– German we attempted to improve the translation tables with a combination of standard statistical word alignments and phrase-based word alignments. For German–English translation we tried to make the German text more similar to the English text by normalizing German morphology and performing rule-based clause reordering of the German text. This resulted in small improvements for both translation directions.
  •  
23.
  •  
24.
  •  
25.
  • Holmqvist, Maria, 1979-, et al. (author)
  • Improving alignment for SMT by reordering and augmenting the training corpus
  • 2009
  • In: Proceedings of the Fourth Workshop on Statistical Machine Translation (WMT09). - Athens, Greece. ; , s. 120-124
  • Conference paper (peer-reviewed)abstract
    • We describe the LIU systems for English-German and German-English translation in the WMT09 shared task. We focus on two methods to improve the word alignment: (i) by applying Giza++ in a second phase to a reordered training corpus, where reordering is based on the alignments from the first phase, and (ii) by adding lexical data obtained as high-precision alignments from a different word aligner. These methods were studied in the context of a system that uses compound processing, a morphological sequence model for German, and a part-of-speech sequence model for English. Both methods gave some improvements to translation quality as measured by Bleu and Meteor scores, though not consistently. All systems used both out-of-domain and in-domain data as the mixed corpus had better scores in the baseline configuration.
  •  
26.
  •  
27.
  • Karamolegkou, Antonia, et al. (author)
  • Investigation of Transfer Languages for Parsing Latin: Italic Branch vs. Hellenic Branch
  • 2021
  • In: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). - : Linköping University Electronic Press. ; , s. 315-320
  • Conference paper (peer-reviewed)abstract
    • Choosing a transfer language is a crucial step in transfer learning. In much previous research on dependency parsing, related languages have successfully been used. However, when parsing Latin, it has been suggested that languages such as ancient Greek could be helpful. In this work we parse Latin in a low-resource scenario, with the main goal to investigate if Greek languages are more helpful for parsing Latin than related Italic languages, and show that this is indeed the case. We further investigate the influence of other factors including training set size and content as well as linguistic distances. We find that one explanatory factor seems to be the syntactic similarity between Latin and Ancient Greek. The influence of genres or shared annotation projects seems to have a smaller impact.
  •  
28.
  • Lameris, Harm, et al. (author)
  • Whit’s the Richt Pairt o Speech: PoS tagging for Scots
  • 2021
  • In: Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). - : Association for Computational Linguistics. ; , s. 39-48
  • Conference paper (peer-reviewed)abstract
    • In this paper we explore PoS tagging for the Scots language. Scots is spoken in Scotland and Northern Ireland, and is closely related to English. As no linguistically annotated Scots data were available, we manually PoS tagged a small set that is used for evaluation and training. We use English as a transfer language to examine zero-shot transfer and transfer learning methods. We find that training on a very small amount of Scots data was superior to zero-shot transfer from English. Combining the Scots and English data led to further improvements, with a concatenation method giving the best results. We also compared the use of two different English treebanks and found that a treebank containing web data was superior in the zero-shot setting, while it was outperformed by a treebank containing a mix of genres when combined with Scots data.
  •  
29.
  • Loáiciga, Sharid, et al. (author)
  • Findings of the 2017 DiscoMT Shared Task on Cross-lingual Pronoun Prediction
  • 2017
  • In: Proceedings of the Third Workshop on Discourse in Machine Translation.
  • Conference paper (other academic/artistic)abstract
    • We describe the design, the setup, and the evaluation results of the DiscoMT 2017 shared task on cross-lingual pronoun prediction. The task asked participants to predict a target-language pronoun given a source-language pronoun in the context of a sentence. We further provided a lemmatized target-language human-authored translation of the source sentence, and automatic word alignments between the source sentence words and the target-language lemmata. The aim of the task was to predict, for each target-language pronoun placeholder, the word that should replace it from a small, closed set of classes, using any type of information that can be extracted from the entire document.We offered four subtasks, each for a different language pair and translation direction: English-to-French, English-to-German, German-to-English, and Spanish-to-English. Five teams participated in the shared task, making submissions for all language pairs. The evaluation results show that all participating teams outperformed two strong n-gram-based language model-based baseline systems by a sizable margin.
  •  
30.
  • Parks, Magdalena, et al. (author)
  • Plausibility Testing for Lexical Resources
  • 2017
  • In: Proceedings of CLEF 2017. - Cham : Springer International Publishing. ; , s. 132-137
  • Conference paper (peer-reviewed)abstract
    • This paper describes principles for evaluation metrics for lexical components and an implementation of them based on requirements from practical information system
  •  
31.
  • Ramisch, Carlos, et al. (author)
  • Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions
  • 2020
  • In: Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons. ; , s. 107-118
  • Conference paper (peer-reviewed)abstract
    • We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.
  •  
32.
  • Reimann, Sebastian, et al. (author)
  • Exploring Cross-Lingual Transfer to Counteract Data Scarcity for Causality Detection
  • 2022
  • In: WWW '22. - New York, USA : Association for Computing Machinery (ACM). - 9781450391306 ; , s. 501-508
  • Conference paper (peer-reviewed)abstract
    • Finding causal relations in text is an important task for many types of textual analysis. It is a challenging task, especially for the many languages with no or only little annotated training data available. To overcome this issue, we explore cross-lingual methods. Our main focus is on Swedish, for which we have a limited amount of data, and where we explore transfer from English and German. We also present additional results for German with English as a source language. We explore both a zero-shot setting without any target training data, and a few-shot setting with a small amount of target data. An additional challenge is the fact that the annotation schemes for the different data sets differ, and we discuss how we can address this issue. Moreover, we explore the impact of different types of sentence representations. We find that we have the best results for Swedish with German as a source language, for which we have a rather small but compatible data set. We are able to take advantage of a limited amount of noisy Swedish training data, but only if we balance its classes. In addition we find that the newer transformer-based representations can make better use of target language data, but that a representation based on recurrent neural networks is surprisingly competitive in the zero-shot setting.
  •  
33.
  • Rizal, Arra’Di Nur, et al. (author)
  • Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data
  • 2020
  • In: Proceedings of the 4th Workshop on Computational Approaches to Code Switching. ; , s. 26-35
  • Conference paper (peer-reviewed)abstract
    • Code-mixed texts are abundant, especially in social media, and poses a problem for NLP tools, which are typically trained on monolingual corpora. In this paper, we explore and evaluate different types of word embeddings for Indonesian–English code-mixed text. We propose the use of code-mixed embeddings, i.e. embeddings trained on code-mixed text. Because large corpora of code-mixed text are required to train embeddings, we describe a method for synthesizing a code-mixed corpus, grounded in literature and a survey. Using sentiment analysis as a case study, we show that code-mixed embeddings trained on synthesized data are at least as good as cross-lingual embeddings and better than monolingual embeddings.
  •  
34.
  • Ruby, Ahmed, et al. (author)
  • A Mention-Based System for Revision Requirements Detection
  • 2021
  • In: Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language. - Stroudsburg, PA, USA : Association for Computational Linguistics. - 9781954085763 ; , s. 58-63
  • Conference paper (peer-reviewed)abstract
    • Exploring aspects of sentential meaning that are implicit or underspecified in context is important for sentence understanding. In this paper, we propose a novel architecture based on mentions for revision requirements detection. The goal is to improve understandability, addressing some types of revisions, especially for the Replaced Pronoun type. We show that our mention-based system can predict replaced pronouns well on the mention-level. However, our combined sentence-level system does not improve on the sentence-level BERT baseline. We also present additional contrastive systems, and show results for each type of edit.
  •  
35.
  • Sagemo, Oscar, 1994-, et al. (author)
  • The UU Submission to the Machine Translation Quality Estimation Task
  • 2016
  • In: Proceedings of the First Conference on Machine Translation. ; , s. 825-830
  • Conference paper (peer-reviewed)abstract
    • This paper outlines the UU-SVM system for Task 1 of the WMT16 Shared Task in Quality Estimation. Our system uses Support Vector Machine Regression to investigate the impact of a series of features aiming to convey translation quality. We propose novel features measuring reordering and noun translation errors. We show that we can outperform the baseline when we combine it with a subset of our new features.
  •  
36.
  • Savary, Agata, et al. (author)
  • PARSEME Corpus Release 1.3
  • 2023
  • In: Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023). - Stroudsburg : Association for Computational Linguistics. - 9781959429593 ; , s. 24-35
  • Conference paper (peer-reviewed)abstract
    • We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.
  •  
37.
  • Savary, Agata, et al. (author)
  • PARSEME Meets Universal Dependencies : Getting on the Same Page in Representing Multiword Expressions
  • 2023
  • In: Northern European Journal of Language Technology (NEJLT). - : Linköping University Electronic Press. - 2000-1533. ; 9:1
  • Journal article (peer-reviewed)abstract
    • Multiword expressions (MWEs) are challenging and pervasive phenomena whose idiosyncratic properties show notably at the levels of lexicon, morphology, and syntax. Thus, they should best be annotated jointly with morphosyntax. In this position paper we discuss two multilingual initiatives, Universal Dependencies and PARSEME, addressing these annotation layers in cross-lingually unified ways. We compare the annotation principles of these initiatives with respect to MWEs, and we put forward a roadmap towards their gradual unification. The expected outcomes are more consistent treebanking and higher universality in modeling idiosyncrasy.
  •  
38.
  •  
39.
  • Smith, Aaron, 1985-, et al. (author)
  • An Investigation of the Interactions Between Pre-Trained Word Embeddings, Character Models and POS Tags in Dependency Parsing
  • 2018
  • In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. - : Association for Computational Linguistics. - 9781948087841 ; , s. 2711-2720
  • Conference paper (peer-reviewed)abstract
    • We provide a comprehensive analysis of the interactions between pre-trained word embeddings, character models and POS tags in a transition-based dependency parser. While previous studies have shown POS information to be less important in the presence of character models, we show that in fact there are complex interactions between all three techniques. In isolation each produces large improvements over a baseline system using randomly initialised word embeddings only, but combining them quickly leads to diminishing returns. We categorise words by frequency, POS tag and language in order to systematically investigate how each of the techniques affects parsing quality. For many word categories, applying any two of the three techniques is almost as good as the full combined system. Character models tend to be more important for low-frequency open-class words, especially in morphologically rich languages, while POS tags can help disambiguate high-frequency function words. We also show that large character embedding sizes help even for languages with small character sets, especially in morphologically rich languages.
  •  
40.
  • Šoštarić, Margita, et al. (author)
  • Discourse-Related Language Contrasts in English-Croatian Human and Machine Translation
  • 2018
  • In: Proceedings of the Third Conference on Machine Translation: Research Papers. ; , s. 36-48
  • Conference paper (peer-reviewed)abstract
    • We present an analysis of a number of coreference phenomena in English-Croatian human and machine translations. The aim is to shed light on the differences in the way these structurally different languages make use of discourse information and provide insights for discourse-aware machine translation system development. The phenomena are automatically identified in parallel data using annotation produced by parsers and word alignment tools, enabling us to pinpoint patterns of interest in both languages. We make the analysis more fine-grained by including three corpora pertaining to three different registers. In a second step, we create a test set with the challenging linguistic constructions and use it to evaluate the performance of three MT systems. We show that both SMT and NMT systems struggle with handling these discourse phenomena, even though NMT tends to perform somewhat better than SMT. By providing an overview of patterns frequently occurring in actual language use, as well as by pointing out the weaknesses of current MT systems that commonly mistranslate them, we hope to contribute to the effort of resolving the issue of discourse phenomena in MT applications.
  •  
41.
  •  
42.
  •  
43.
  • Stymne, Sara, 1977- (author)
  • A Comparison of Merging Strategies for Translation of German Compounds
  • 2009
  • In: Proceedings of the Student Research Workshop at the 12th Conference of the European Chapter of the ACL (EACL 2009). - : Association for Computational Linguistics. ; , s. 61-69
  • Conference paper (peer-reviewed)abstract
    • In this article, compound processing for translation into German in a factored statistical MT system is investigated. Compound sare handled by splitting them prior to training, and merging the parts after translation. I have explored eight merging strategies using different combinations of external knowledge sources, such as word lists, and internal sources that are carried through the translation process, such as symbols or parts-of-speech. I show that for merging to be successful, some internal knowledge source is needed. I also show that an extra sequence model for part-ofspeech is useful in order to improve the order of compound parts in the output. The best merging results are achieved by a matching scheme for part-of-speech tags.
  •  
44.
  • Stymne, Sara, 1977-, et al. (author)
  • Annotating Errors in Student Texts : First Experiences and Experiments
  • 2017
  • In: <em>Proceedings of Joint 6th NLP4CALL and 2nd NLP4LA</em> Nodalida workshop. - Göteborg. ; , s. 47-60
  • Conference paper (peer-reviewed)abstract
    • We describe the creation of an annotation layer for word-based writing errors for a corpus of student writings. The texts are written in Swedish by students between 9 and 19 years old. Our main purpose is to identify errors regarding spelling, split compounds and merged words. In addition, we also identify simple word-based grammatical errors, including morphological errors and extra words. In this paper we describe the corpus and the annotation process, including detailed descriptions of the error types and guidelines. We find that we can perform this annotation with a substantial inter-annotator agreement, but that there are still some remaining issues with the annotation. We also report results on two pilot experiments regarding spelling correction and the consistency of downstream NLP tools, to exemplify the usefulness of the annotated corpus.
  •  
45.
  • Stymne, Sara, 1977- (author)
  • Blast: A Tool for Error Analysis of Machine Translation Output
  • 2011
  • In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, system demonstrations. - : Association for Computational Linguistics. - 9781932432909 ; , s. 56-61
  • Conference paper (peer-reviewed)abstract
    • We present BLAST, an open source tool for error analysis of machine translation (MT) output. We believe that error analysis, i.e., to identify and classify MT errors, should be an integral part of MT development, since it gives a qualitative view, which is not obtained by standard evaluation methods. BLAST can aid MT researchers and users in this process, by providing an easy-to-use graphical user interface. It is designed to be flexible, and can be used with any MT system, language pair, and error typology. The annotation task can be aided by highlighting similarities with a reference translation.
  •  
46.
  • Stymne, Sara, 1977- (author)
  • Clustered Word Classes for Preordering in Statistical Machine Translation
  • 2012
  • In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. - : Association for Computational Linguistics. ; , s. 28-34
  • Conference paper (peer-reviewed)abstract
    • Clustered word classes have been used in connection with statistical machine translation, for instance for improving word alignments. In this work we investigate if clustered word classes can be used in a preordering strategy, where the source language is reordered prior to training and translation. Part-of-speech tagging has previously been successfully used for learning reordering rules that can be applied before training and translation. We show that we can use word clusters for learning rules, and significantly improve on a baseline with only slightly worse performance than for standard POS-tags on an English–German translation task. We also show the usefulness of the approach for the less-resourced language Haitian Creole, for translation into English, where the suggested approach is significantly better than the baseline.
  •  
47.
  • Stymne, Sara, 1977- (author)
  • Compound Merging Strategies for Statistical Machine Translation
  • 2010
  • In: Grace Hopper Celebration of Women in Computing. ; , s. 43-43
  • Conference paper (other academic/artistic)abstract
    • Translation into compounding languages like German and Swedish is a challenge for statistical machine translation. I present a novel algorithm for merging compound parts, based on part-of-speech matching with an extended tag set. It improves the quality of merged compounds compared to previously suggested methods, both measured automatically and shown in an error analysis. Translation is also improved compared to systems without compound processing for Swedish, Danish,  and German.
  •  
48.
  • Stymne, Sara, 1977- (author)
  • Compound Processing for Phrase-Based Statistical Machine Translation
  • 2009
  • Licentiate thesis (other academic/artistic)abstract
    • In this thesis I explore how compound processing can be used to improve phrase-based statistical machine translation (PBSMT) between English and German/Swedish. Both German and Swedish generally use closed compounds, which are written as one word without spaces or other indicators of word boundaries. Compounding is both common and productive, which makes it problematic for PBSMT, mainly due to sparse data problems.The adopted strategy for compound processing is to split compounds into their component parts before training and translation. For translation into Swedish and German the parts are merged after translation. I investigate the effect of different splitting algorithms for translation between English and German, and of different merging algorithms for German. I also apply these methods to a different language pair, English--Swedish. Overall the studies show that compound processing is useful, especially for translation from English into German or Swedish. But there are improvements for translation into English as well, such as a reduction of unknown words.I show that for translation between English and German different splitting algorithms work best for different translation directions. I also design and evaluate a novel merging algorithm based on part-of-speech matching, which outperforms previous methods for compound merging, showing the need for information that is carried through the translation process, rather than only external knowledge sources such as word lists. Most of the methods for compound processing were originally developed for German. I show that these methods can be applied to Swedish as well, with similar results.
  •  
49.
  • Stymne, Sara, 1977- (author)
  • Cross-Lingual Domain Adaptation for Dependency Parsing
  • 2020
  • In: Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories (TLT). - Stroudsburg, PA, USA : Association for Computational Linguistics. ; , s. 62-69
  • Conference paper (peer-reviewed)abstract
    • We show how we can adapt parsing to low-resource domains by combining treebanks across languages for a parser model with treebank embeddings. We demonstrate how we can take advantage of in-domain treebanks from other languages, and show that this is especially useful when only out-of-domain treebanks are available for the target language. The method is also extended to low-resource languages by using out-of-domain treebanks from related languages. Two parameter-free methods for applying treebank embeddings at test time are proposed, which give competitive results to tuned methods when applied to Twitter data and transcribed speech. This gives us a method for selecting treebanks and training a parser targeted at any combination of domain and language.
  •  
50.
  • Stymne, Sara, 1977- (author)
  • Definite Noun Phrases in Statistical Machine Translation into Danish
  • 2009
  • In: Proceedings of the Workshop on Extracting and Using Constructions in NLP. ; , s. 4-9
  • Conference paper (peer-reviewed)abstract
    • There are two ways to express definiteness in Danish, which makes it problematic for statistical machine translation (SMT) from English, since the wrong realisation can be chosen. We present a part-of-speech-based method for identifying and transforming English definite NPs that would likely be expressed in a different way in Danish. The transformed English is used for training a phrase-based SMT system.This technique gives significant improvements of translation quality, of up to 22.1% relative on Bleu, compared to a baseline trained on original English, in two different domains.
  •  
Skapa referenser, mejla, bekava och länka
  • Result 1-50 of 81
Type of publication
conference paper (72)
journal article (6)
doctoral thesis (2)
licentiate thesis (1)
Type of content
peer-reviewed (67)
other academic/artistic (13)
pop. science, debate, etc. (1)
Author/Editor
Stymne, Sara, 1977- (69)
Stymne, Sara (11)
Tiedemann, Jörg (10)
Nivre, Joakim, 1962- (10)
Hardmeier, Christian (10)
de Lhoneux, Miryam, ... (9)
show more...
Nivre, Joakim (8)
Ahrenberg, Lars, 194 ... (8)
Ahrenberg, Lars (4)
Savary, Agata (3)
Cap, Fabienne (3)
Dürlich, Luise (2)
Danielsson, Henrik, ... (2)
Loáiciga, Sharid (2)
Karlsson, Johanna (2)
Bremin, Sofia (2)
Hu, Hongzhan (2)
Prytz Lillkull, Anna (2)
Wester, Martin (2)
Cerniavski, Rafal (2)
Von Arnold, Sara (1)
Lankinen, Åsa (1)
Sundberg, Björn (1)
Adams, Allison (1)
Sundberg, Eva (1)
Ljung, Karin (1)
Stenlid, Jan (1)
Krek, Simon (1)
Merkel, Magnus (1)
Nilsson, Ove (1)
Karlgren, Jussi (1)
Guillou, Liane (1)
Andersson, Inger (1)
Bhalerao, Rishikesh ... (1)
Bozhkov, Peter (1)
Dixelius, Christina (1)
Mellerowicz, Ewa (1)
Stymne, Sten (1)
Wingsle, Gunnar (1)
Östling, Robert (1)
Megyesi, Beáta, 1971 ... (1)
Smith, Christian (1)
Smith, Aaron (1)
Gatt, Albert (1)
Basirat, Ali, 1982- (1)
Palmér, Anne, 1961- (1)
Kovalevskaite, Jolan ... (1)
Bohnet, Bernd (1)
Ginter, Filip (1)
Svedjedal, Johan, 19 ... (1)
show less...
University
Uppsala University (53)
Linköping University (28)
RISE (3)
Royal Institute of Technology (1)
Swedish University of Agricultural Sciences (1)
Language
English (78)
Swedish (3)
Research subject (UKÄ/SCB)
Natural sciences (76)
Humanities (11)
Agricultural Sciences (1)
Social Sciences (1)

Year

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Close

Copy and save the link in order to return to this view