SwePub - sökning: WFRF:(Joakim Nivre)

Numrering	Referens	Omslagsbild	Hitta
1.	Adesam, Yvonne, 1975- (författare) The Multilingual Forest : Investigating High-quality Parallel Corpus Development 2012 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract This thesis explores the development of parallel treebanks, collections of language data consisting of texts and their translations, with syntactic annotation and alignment, linking words, phrases, and sentences to show translation equivalence. We describe the semi-manual annotation of the SMULTRON parallel treebank, consisting of 1,000 sentences in English, German and Swedish. This description is the starting point for answering the first of two questions in this thesis.What issues need to be considered to achieve a high-quality, consistent,parallel treebank?The units of annotation and the choice of annotation schemes are crucial for quality, and some automated processing is necessary to increase the size. Automatic quality checks and evaluation are essential, but manual quality control is still needed to achieve high quality.Additionally, we explore improving the automatically created annotation for one language, using information available from the annotation of the other languages. This leads us to the second of the two questions in this thesis.Can we improve automatic annotation by projecting information available in the other languages?Experiments with automatic alignment, which is projected from two language pairs, L1–L2 and L1–L3, onto the third pair, L2–L3, show an improvement in precision, in particular if the projected alignment is intersected with the system alignment. We also construct a test collection for experiments on annotation projection to resolve prepositional phrase attachment ambiguities. While majority vote projection improves the annotation, compared to the basic automatic annotation, using linguistic clues to correct the annotation before majority vote projection is even better, although more laborious. However, some structural errors cannot be corrected by projection at all, as different languages have different wording, and thus different structures.
2.	Ahlsén, Elisabeth, 1951, et al. (författare) Feedback in different social activities 2006 Ingår i: Current trends in Research on Spoken Language in the Nordic Countries. ; , s. 26-44 Tidskriftsartikel (refereegranskat)
3.	Allwood, Jens, 1947, et al. (författare) A Pragmatics Based Language Understanding System 1991 Ingår i: Proceedings of the ESPRIT Conference 1991. Tidskriftsartikel (refereegranskat)
4.	Allwood, Jens, 1947, et al. (författare) Speech Management - on the Non-Written Life of Speech 1990 Ingår i: Nordic Journal of Linguistics. - 0332-5865. ; 13:1, s. 3-48 Tidskriftsartikel (refereegranskat)abstract This paper introduces the concept of speech management (SM), which refers to processes whereby a speaker manages his or her linguistic contributions to a communicative interaction, and which involves phenomena which have previously been studies under such rubrics as "planning", "editing" "(self)repair" etc.
5.	Allwood, Jens, 1947, et al. (författare) Towards Multimodal Spoken Language Corpora: TransTool and SyncTool 1998 Ingår i: Proceedings of ACL-COLING. Konferensbidrag (refereegranskat)
6.	Baldwin, Timothy, et al. (författare) Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics 2021 Ingår i: Dagstuhl Reports. - Dagstuhl. - 2192-5283. ; 11:7, s. 89-138 Tidskriftsartikel (refereegranskat)abstract Computational linguistics builds models that can usefully process and produce language and thatcan increase our understanding of linguistic phenomena. From the computational perspective,language data are particularly challenging notably due to their variable degree of idiosyncrasy(unexpected properties shared by few peer objects), and the pervasiveness of non-compositionalphenomena such as multiword expressions (whose meaning cannot be straightforwardly deducedfrom the meanings of their components, e.g. red tape, by and large, to pay a visit and to pullone’s leg) and constructions (conventional associations of forms and meanings). Additionally, ifmodels and methods are to be consistent and valid across languages, they have to face specificitiesinherent either to particular languages, or to various linguistic traditions.These challenges were addressed by the Dagstuhl Seminar 21351 entitled “Universals ofLinguistic Idiosyncrasy in Multilingual Computational Linguistics”, which took place on 30-31 August 2021. Its main goal was to create synergies between three distinct though partlyoverlapping communities: experts in typology, in cross-lingual morphosyntactic annotation and inmultiword expressions. This report documents the program and the outcomes of the seminar. Wepresent the executive summary of the event, reports from the 3 Working Groups and abstracts ofindividual talks and open problems presented by the participants.
7.	Ballesteros, Miguel, et al. (författare) Going to the Roots of Dependency Parsing 2013 Ingår i: Computational linguistics - Association for Computational Linguistics (Print). - : MIT Press. - 0891-2017 .- 1530-9312. ; 39:1, s. 5-13 Tidskriftsartikel (refereegranskat)abstract Dependency trees used in syntactic parsing often include a root node representing a dummy word prefixed or suffixed to the sentence, a device that is generally considered a mere technical convenience and is tacitly assumed to have no impact on empirical results. We demonstrate that this assumption is false and that the accuracy of data-driven dependency parsers can in fact be sensitive to the existence and placement of the dummy root node. In particular, we show that a greedy, left-to-right, arc-eager transition-based parser consistently performs worse when the dummy root node is placed at the beginning of the sentence (following the current convention in data-driven dependency parsing) than when it is placed at the end or omitted completely. Control experiments with an arc-standard transition-based parser and an arc-factored graph-based parser reveal no consistent preferences but nevertheless exhibit considerable variation in results depending on root placement. We conclude that the treatment of dummy root nodes in data-driven dependency parsing is an underestimated source of variation in experiments and may also be a parameter worth tuning for some parsers.
8.	Ballesteros, Miguel, et al. (författare) MaltOptimizer : Fast and Effective Parser Optimization 2016 Ingår i: Natural Language Engineering. - 1351-3249 .- 1469-8110. ; 22:2, s. 187-213 Tidskriftsartikel (refereegranskat)abstract Statistical parsers often require careful parameter tuning and feature selection. This is a nontrivial task for application developers who are not interested in parsing for its own sake, and it can be time-consuming even for experienced researchers. In this paper we present MaltOptimizer, a tool developed to automatically explore parameters and features for MaltParser, a transition-based dependency parsing system that can be used to train parser's given treebank data. MaltParser provides a wide range of parameters for optimization, including nine different parsing algorithms, an expressive feature specification language that can be used to define arbitrarily rich feature models, and two machine learning libraries, each with their own parameters. MaltOptimizer is an interactive system that performs parser optimization in three stages. First, it performs an analysis of the training set in order to select a suitable starting point for optimization. Second, it selects the best parsing algorithm and tunes the parameters of this algorithm. Finally, it performs feature selection and tunes machine learning parameters. Experiments on a wide range of data sets show that MaltOptimizer quickly produces models that consistently outperform default settings and often approach the accuracy achieved through careful manual optimization.
9.	Ballesteros, Miguel, et al. (författare) Optimizing Planar and 2-Planar Parsers with MaltOptimizer 2012 Ingår i: Revista de Procesamiento de Lenguaje Natural (SEPLN). - 1135-5948 .- 1989-7553. ; 49, s. 171-178 Tidskriftsartikel (refereegranskat)
10.	Basirat, Ali, et al. (författare) A statistical model for grammar mapping 2016 Ingår i: Natural Language Engineering. - : Cambridge University Press. - 1351-3249 .- 1469-8110. ; 22:2, s. 215-255 Tidskriftsartikel (refereegranskat)abstract The two main classes of grammars are (a) hand-crafted grammars, which are developed bylanguage experts, and (b) data-driven grammars, which are extracted from annotated corpora.This paper introduces a statistical method for mapping the elementary structures of a data-driven grammar onto the elementary structures of a hand-crafted grammar in order to combinetheir advantages. The idea is employed in the context of Lexicalized Tree-Adjoining Grammars(LTAG) and tested on two LTAGs of English: the hand-crafted LTAG developed in theXTAG project, and the data-driven LTAG, which is automatically extracted from the PennTreebank and used by the MICA parser. We propose a statistical model for mapping anyelementary tree sequence of the MICA grammar onto a proper elementary tree sequence ofthe XTAG grammar. The model has been tested on three subsets of the WSJ corpus thathave average lengths of 10, 16, and 18 words, respectively. The experimental results show thatfull-parse trees with average F1 -scores of 72.49, 64.80, and 62.30 points could be built from94.97%, 96.01%, and 90.25% of the XTAG elementary tree sequences assigned to the subsets,respectively. Moreover, by reducing the amount of syntactic lexical ambiguity of sentences,the proposed model significantly improves the efficiency of parsing in the XTAG system.
11.	Basirat, Ali, et al. (författare) Greedy Universal Dependency Parsing with Right Singular Word Vectors 2016 Konferensbidrag (refereegranskat)abstract A set of continuous feature vectors formed by right singular vectors of a transformed co-occurrence matrix are used with the Stanford neural dependency parser to train parsing models for a limited number of languages in the corpus of universal dependencies. We show that the feature vector can help the parser to remain greedy and be as accurate as (or even more accurate than) some other greedy and non-greedy parsers.
12.	Basirat, Ali, 1982-, et al. (författare) Polyglot Parsing for One Thousand and One Languages (And Then Some) 2019 Konferensbidrag (övrigt vetenskapligt/konstnärligt)
13.	Basirat, Ali, 1982- (författare) Principal Word Vectors 2018 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract Word embedding is a technique for associating the words of a language with real-valued vectors, enabling us to use algebraic methods to reason about their semantic and grammatical properties. This thesis introduces a word embedding method called principal word embedding, which makes use of principal component analysis (PCA) to train a set of word embeddings for words of a language. The principal word embedding method involves performing a PCA on a data matrix whose elements are the frequency of seeing words in different contexts. We address two challenges that arise in the application of PCA to create word embeddings. The first challenge is related to the size of the data matrix on which PCA is performed and affects the efficiency of the word embedding method. The data matrix is usually a large matrix that requires a very large amount of memory and CPU time to be processed. The second challenge is related to the distribution of word frequencies in the data matrix and affects the quality of the word embeddings. We provide an extensive study of the distribution of the elements of the data matrix and show that it is unsuitable for PCA in its unmodified form.We overcome the two challenges in principal word embedding by using a generalized PCA method. The problem with the size of the data matrix is mitigated by a randomized singular value decomposition (SVD) procedure, which improves the performance of PCA on the data matrix. The data distribution is reshaped by an adaptive transformation function, which makes it more suitable for PCA. These techniques, together with a weighting mechanism that generalizes many different weighting and transformation approaches used in literature, enable the principal word embedding to train high quality word embeddings in an efficient way.We also provide a study on how principal word embedding is connected to other word embedding methods. We compare it to a number of word embedding methods and study how the two challenges in principal word embedding are addressed in those methods. We show that the other word embedding methods are closely related to principal word embedding and, in many instances, they can be seen as special cases of it.The principal word embeddings are evaluated in both intrinsic and extrinsic ways. The intrinsic evaluations are directed towards the study of the distribution of word vectors. The extrinsic evaluations measure the contribution of principal word embeddings to some standard NLP tasks. The experimental results confirm that the newly proposed features of principal word embedding (i.e., the randomized SVD algorithm, the adaptive transformation function, and the weighting mechanism) are beneficial to the method and lead to significant improvements in the results. A comparison between principal word embedding and other popular word embedding methods shows that, in many instances, the proposed method is able to generate word embeddings that are better than or as good as other word embeddings while being faster than several popular word embedding methods.
14.	Basirat, Ali, 1982-, et al. (författare) Real-valued syntactic word vectors 2020 Ingår i: Journal of experimental and theoretical artificial intelligence (Print). - 0952-813X .- 1362-3079. ; 32:4, s. 557-579 Tidskriftsartikel (refereegranskat)abstract We introduce a word embedding method that generates a set of real-valued word vectors from a distributional semantic space. The semantic space is built with a set of context units (words) which are selected by an entropy-based feature selection approach with respect to the certainty involved in their contextual environments. We show that the most predictive context of a target word is its preceding word. An adaptive transformation function is also introduced that reshapes the data distribution to make it suitable for dimensionality reduction techniques. The final low-dimensional word vectors are formed by the singular vectors of a matrix of transformed data. We show that the resulting word vectors are as good as other sets of word vectors generated with popular word embedding methods.
15.	Basirat, Ali, 1982-, et al. (författare) Real-valued Syntactic Word Vectors (RSV) for Greedy Neural Dependency Parsing 2017 Konferensbidrag (refereegranskat)abstract We show that a set of real-valued word vectors formed by right singular vectors of a transformed co-occurrence matrix are meaningful for determining different types of dependency relations between words. Our experimental results on the task of dependency parsing confirm the superiority of the word vectors to the other sets of word vectors generated by popular methods of word embedding. We also study the effect of using these vectors on the accuracy of dependency parsing in different languages versus using more complex parsing architectures.
16.	Basirat, Ali, Postdoctoral Researcher, 1982-, et al. (författare) Syntactic Nuclei in Dependency Parsing – : A Multilingual Exploration 2021 Ingår i: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. - Stroudsburg, PA, USA : Association for Computational Linguistics. - 9781954085022 ; , s. 1376-1387 Konferensbidrag (refereegranskat)abstract Standard models for syntactic dependency parsing take words to be the elementary units that enter into dependency relations. In this paper, we investigate whether there are any benefits from enriching these models with the more abstract notion of nucleus proposed by Tesniere. We do this by showing how the concept of nucleus can be defined in the framework of Universal Dependencies and how we can use composition functions to make a transition-based dependency parser aware of this concept. Experiments on 12 languages show that nucleus composition gives small but significant improvements in parsing accuracy. Further analysis reveals that the improvement mainly concerns a small number of dependency relations, including relations of coordination, direct objects, nominal modifiers, and main predicates.
17.	Bengoetxea, Kepa, et al. (författare) On WordNet Semantic Classes and Dependency Parsing 2014 Ingår i: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). ; , s. 649-655 Konferensbidrag (refereegranskat)
18.	Bigert, Johnny, 1976- (författare) Automatic and unsupervised methods in natural language processing 2005 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract Natural language processing (NLP) means the computer-aided processing of language produced by a human. But human language is inherently irregular and the most reliable results are obtained when a human is involved in at least some part of the processing. However, manual workis time-consuming and expensive. This thesis focuses on what can be accomplished in NLP when manual workis kept to a minimum. We describe the construction of two tools that greatly simplify the implementation of automatic evaluation. They are used to implement several supervised, semi-supervised and unsupervised evaluations by introducing artificial spelling errors. We also describe the design of a rule-based shallow parser for Swedish called GTA and a detection algorithm for context-sensitive spelling errors based on semi-supervised learning, called ProbCheck. In the second part of the thesis, we first implement a supervised evaluation scheme that uses an error-free treebankto determine the robustness of a parser when faced with noisy input such as spelling errors. We evaluate the GTA parser and determine the robustness of the individual components of the parser as well as the robustness for different phrase types. Second, we create an unsupervised evaluation procedure for parser robustness. The procedure allows us to evaluate the robustness of parsers using different parser formalisms on the same text and compare their performance. Five parsers and one tagger are evaluated. For four of these, we have access to annotated material and can verify the estimations given by the unsupervised evaluation procedure. The results turned out to be very accurate with few exceptions and thus, we can reliably establish the robustness of an NLP system without any need of manual work. Third, we implement an unsupervised evaluation scheme for spell checkers. Using this, we perform a very detailed analysis of three spell checkers for Swedish. Last, we evaluate the ProbCheck algorithm. Two methods are included for comparison: a full parser and a method using tagger transition probabilities. The algorithm obtains results superior to the comparison methods. The algorithm is also evaluated on authentic data in combination with a grammar and spell checker.
19.	Björkelund, Anders, et al. (författare) Non-Deterministic Oracles for Unrestricted Non-Projective Transition-Based Dependency Parsing 2015 Ingår i: Proceedings of the 14th International Conference on Parsing Technologies. ; , s. 76-86 Konferensbidrag (refereegranskat)
20.	Bohnet, Bernd, et al. (författare) Joint Morphological and Syntactic Analysis for Richly Inflected Languages 2013 Ingår i: Transactions of the Association for Computational Linguistics. - 2307-387X. ; 1:4, s. 415-428 Tidskriftsartikel (refereegranskat)
21.	Borg, Markus, et al. (författare) Time extraction from real-time generated football reports 2007 Ingår i: [Host publication title missing]. - 9789985405130 ; , s. 37-43 Konferensbidrag (refereegranskat)abstract This paper describes a system to extract events and time information from football match reports generated through minute-by-minute reporting. We describe a method that uses regular expressions to find the events and divides them into different types to determine in which order they occurred. In addition, our system detects time expressions and we present a way to structure the collected data using XML.
22.	Bouma, Gosse, et al. (författare) Expletives in Universal Dependency Treebanks 2018 Ingår i: Proceedings of the Second Workshop on Universal Dependencies (UDW 2018). ; , s. 18-26 Konferensbidrag (refereegranskat)
23.	Buljan, Maja, et al. (författare) A Tale of Four Parsers : Methodological Reflections on Diagnostic Evaluation and In-Depth Error Analysis for Meaning Representation Parsing 2022 Ingår i: Language Resources and Evaluation. - : Springer Science and Business Media LLC. - 1574-020X .- 1574-0218. ; 56:4, s. 1075-1102 Tidskriftsartikel (refereegranskat)abstract We discuss methodological choices in diagnostic evaluation and error analysis in meaning representation parsing (MRP), i.e. mapping from natural language utterances to graph-based encodings of semantic structure. We expand on a pilot quantitative study in contrastive diagnostic evaluation, inspired by earlier work in syntactic dependency parsing, and propose a novel methodology for qualitative error analysis. This two-pronged study is performed using a selection of submissions, data, and evaluation tools featured in the 2019 shared task on MRP. Our aim is to devise methods for identifying strengths and weaknesses in different broad families of parsing techniques, as well as investigating the relations between specific parsing approaches, different meaning representation frameworks, and individual linguistic phenomena—by identifying and comparing common error patterns. Our preliminary empirical results suggest that the proposed methodologies can be meaningfully applied to parsing into graph-structured target representations, as a side-effect uncovering hitherto unknown properties of the different systems that can inform future development and cross-fertilization across approaches.
24.	Buljan, Maja, et al. (författare) A Tale of Three Parsers : Towards Diagnostic Evaluation for Meaning Representation Parsing 2020 Ingår i: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020). - Paris : European Language Resources Association (ELRA). - 9791095546344 ; , s. 1902-1909 Konferensbidrag (refereegranskat)abstract We discuss methodological choices in contrastive and diagnostic evaluation in meaning representation parsing, i.e. mapping from natural language utterances to graph-based encodings of semantic structure. Drawing inspiration from earlier work in syntactic dependency parsing, we transfer and refine several quantitative diagnosis techniques for use in the context of the 2019 shared task on Meaning Representation Parsing (MRP). As in parsing proper, moving evaluation from simple rooted trees to general graphs brings along its own range of challenges. Specifically, we seek to begin to shed light on relative strenghts and weaknesses in different broad families of parsing techniques. In addition to these theoretical reflections, we conduct a pilot experiment on a selection of top-performing MRP systems and two of the five meaning representation frameworks in the shared task. Empirical results suggest that the proposed methodology can be meaningfully applied to parsing into graph-structured target representations, uncovering hitherto unknown properties of the different systems that can inform future development and cross-fertilization across approaches.
25.	Bunt, Harry, et al. (författare) Grammars, Parsers and Recognizers 2014 Ingår i: Journal of Logic and Computation. - : Oxford Journals. ; 24:2, s. 309- Tidskriftsartikel (refereegranskat)
26.	Calacean, Mihaela, et al. (författare) A Data-Driven Dependency Parser for Romanian 2009 Ingår i: Proceedings of the Seventh International Workshop on Treebanks and Linguistic Theories.. - 9789078328773 ; , s. 65-76 Konferensbidrag (refereegranskat)
27.	Cap, Fabienne, et al. (författare) SWORD : Towards Cutting-Edge Swedish Word Processing 2016 Ingår i: Proceedings of SLTC 2016. Konferensbidrag (refereegranskat)abstract Despite many years of research on Swedish language technology, there is still no well-documented standard for Swedish word processing covering the whole spectrum from low-level tokenization to morphological analysis and disambiguation. SWORD is a new initiative within the SWE-CLARIN consortium aiming to develop documented standards for Swedish word processing. In this paper, we report on a pilot study of Swedish tokenization, where we compare the output of six different tokenizers on four different text types. For one text type (Wikipedia articles), we also compare to the tokenization produced by six manual annotators.
28.	Cap, Fabienne, et al. (författare) SWORD: Towards Cutting-Edge Swedish Word Processing 2016 Konferensbidrag (övrigt vetenskapligt/konstnärligt)
29.	Carlsson, Fredrik, et al. (författare) Fine-Grained Controllable Text Generation Using Non-Residual Prompting 2022 Ingår i: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. - Stroudsburg, PA, USA : Association for Computational Linguistics. - 9781955917216 ; , s. 6837-6857 Konferensbidrag (refereegranskat)abstract The introduction of immensely large Causal Language Models (CLMs) has rejuvenated the interest in open-ended text generation. However, controlling the generative process for these Transformer-based models is at large an unsolved problem. Earlier work has explored either plug-and-play decoding strategies, or more powerful but blunt approaches such as prompting. There hence currently exists a trade-off between fine-grained control, and the capability for more expressive high-level instructions. To alleviate this trade-off, we propose an encoder-decoder architecture that enables intermediate text prompts at arbitrary time steps. We propose a resource-efficient method for converting a pre-trained CLM into this architecture, and demonstrate its potential on various experiments, including the novel task of contextualized word inclusion. Our method provides strong results on multiple experimental settings, proving itself to be both expressive and versatile.
30.	Constant, Matthieu, et al. (författare) A Transition-Based System for Joint Lexical and Syntactic Analysis 2016 Ingår i: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. ; , s. 161-171 Konferensbidrag (refereegranskat)
31.	Constant, Matthieu, et al. (författare) A Transition-Based System for Joint Lexical and Syntactic Analysis 2016 Ingår i: PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1. ; , s. 161-171 Konferensbidrag (refereegranskat)abstract We present a transition-based system that jointly predicts the syntactic structure and lexical units of a sentence by building two structures over the input words: a syntactic dependency tree and a forest of lexical units including multiword expressions (MWEs). This combined representation allows us to capture both the syntactic and semantic structure of MWEs, which in turn enables deeper downstream semantic analysis, especially for semi-compositional MWEs. The proposed system extends the arc-standard transition system for dependency parsing with transitions for building complex lexical units. Experiments on two different data sets show that the approach significantly improves MWE identification accuracy (and sometimes syntactic accuracy) compared to existing joint approaches.
32.	Csató, Éva Ágnes, 1948-, et al. (författare) Parallel corpora and Universal Dependencies for Turkic 2015 Ingår i: Turkic languages. - Wiesbaden. - 1431-4983. ; 19:2, s. 259-273 Tidskriftsartikel (refereegranskat)abstract The first part of this paper presents ongoing work on Turkic parallel corpora at the De- partment of Linguistics and Philology, Uppsala University. Moreover, examples are given of how the Swedish-Turkish-English corpus is used in teaching Turkish and in compara- tive linguistic studies. The second part deals with the annotation scheme Universal De- pendencies (UD) used in treebanks, and its application to Turkic languages.
33.	de Lhoneux, Miryam, 1990-, et al. (författare) Arc-Hybrid Non-Projective Dependency Parsing with a Static-Dynamic Oracle 2017 Ingår i: IWPT 2017 15th International Conference on Parsing Technologies. - Pisa, Italy : Association for Computational Linguistics. - 9781945626739 ; , s. 99-104 Konferensbidrag (refereegranskat)abstract We extend the arc-hybrid transition system for dependency parsing with a SWAP transition that enables reordering of the words and construction of non-projective trees. Although this extension potentially breaks the arc-decomposability of the transition system, we show that the existing dynamic oracle can be modified and combined with a static oracle for the SWAP transition. Experiments on five languages with different degrees of non-projectivity show that the new system gives competitive accuracy and is significantly better than a system trained with a purely static oracle.
34.	de Lhoneux, Miryam, 1990-, et al. (författare) From raw text to Universal Dependencies : look, no tags! 2017 Ingår i: Proceedings of the CoNLL 2017 Shared Task. - Vancouver, Canada : Association for Computational Linguistics. - 9781945626708 ; , s. 207-217 Konferensbidrag (refereegranskat)abstract We present the Uppsala submission to the CoNLL 2017 shared task on parsing from raw text to universal dependencies. Our system is a simple pipeline consisting of two components. The first performs joint word and sentence segmentation on raw text; the second predicts dependency trees from raw words. The parser bypasses the need for part-of-speech tagging, but uses word embeddings based on universal tag distributions. We achieved a macroaveraged LAS F1 of 65.11 in the official test run and obtained the 2nd best result for sentence segmentation with a score of 89.03. After fixing two bugs, we obtained an unofficial LAS F1 of 70.49.
35.	de Lhoneux, Miryam, 1990- (författare) Linguistically Informed Neural Dependency Parsing for Typologically Diverse Languages 2019 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract This thesis presents several studies in neural dependency parsing for typologically diverse languages, using treebanks from Universal Dependencies (UD). The focus is on informing models with linguistic knowledge. We first extend a parser to work well on typologically diverse languages, including morphologically complex languages and languages whose treebanks have a high ratio of non-projective sentences, a notorious difficulty in dependency parsing. We propose a general methodology where we sample a representative subset of UD treebanks for parser development and evaluation. Our parser uses recurrent neural networks which construct information sequentially, and we study the incorporation of a recursive neural network layer in our parser. This follows the intuition that language is hierarchical. This layer turns out to be superfluous in our parser and we study its interaction with other parts of the network. We subsequently study transitivity and agreement information learned by our parser for auxiliary verb constructions (AVCs). We suggest that a parser should learn similar information about AVCs as it learns for finite main verbs. This is motivated by work in theoretical dependency grammar. Our parser learns different information about these two if we do not augment it with a recursive layer, but similar information if we do, indicating that there may be benefits from using that layer and we may not yet have found the best way to incorporate it in our parser. We finally investigate polyglot parsing. Training one model for multiple related languages leads to substantial improvements in parsing accuracy over a monolingual baseline. We also study different parameter sharing strategies for related and unrelated languages. Sharing parameters that partially abstract away from word order appears to be beneficial in both cases but sharing parameters that represent words and characters is more beneficial for related than unrelated languages.
36.	de Lhoneux, Miryam, 1990-, et al. (författare) Old School vs. New School : Comparing Transition-Based Parsers with and without Neural Network Enhancement 2017 Ingår i: <em>Proceedings of the 15th Treebanks and Linguistic Theories Workshop (TLT)</em>. ; , s. 99-110 Konferensbidrag (refereegranskat)abstract In this paper, we attempt a comparison between "new school" transition-based parsers that use neural networks and their classical "old school" coun-terpart. We carry out experiments on treebanks from the Universal Depen-dencies project. To facilitate the comparison and analysis of results, we onlywork on a subset of those treebanks. However, we carefully select this sub-set in the hope to have results that are representative for the whole set oftreebanks. We select two parsers that are hopefully representative of the twoschools; MaltParser and UDPipe and we look at the impact of training sizeon the two models. We hypothesize that neural network enhanced modelshave a steeper learning curve with increased training size. We observe, how-ever, that, contrary to expectations, neural network enhanced models needonly a small amount of training data to outperform the classical models butthe learning curves of both models increase at a similar pace after that. Wecarry out an error analysis on the development sets parsed by the two sys-tems and observe that overall MaltParser suffers more than UDPipe fromlonger dependencies. We observe that MaltParser is only marginally betterthan UDPipe on a restricted set of short dependencies.
37.	de Lhoneux, Miryam, 1990-, et al. (författare) Recursive Subtree Composition in LSTM-Based Dependency Parsing 2019 Ingår i: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. - Stroudsburg : Association for Computational Linguistics. - 9781950737130 ; , s. 1566-1576 Konferensbidrag (refereegranskat)abstract The need for tree structure modelling on top of sequence modelling is an open issue in neural dependency parsing. We investigate the impact of adding a tree layer on top of a sequential model by recursively composing subtree representations (composition) in a transition-based parser that uses features extracted by a BiLSTM. Composition seems superfluous with such a model, suggesting that BiLSTMs capture information about subtrees. We perform model ablations to tease out the conditions under which composition helps. When ablating the backward LSTM, performance drops and composition does not recover much of the gap. When ablating the forward LSTM, performance drops less dramatically and composition recovers a substantial part of the gap, indicating that a forward LSTM and composition capture similar information. We take the backward LSTM to be related to lookahead features and the forward LSTM to the rich history-based features both crucial for transition-based parsers. To capture history-based information, composition is better than a forward LSTM on its own, but it is even better to have a forward LSTM as part of a BiLSTM. We correlate results with language properties, showing that the improved lookahead of a backward LSTM is especially important for head-final languages.
38.	de Lhoneux, Miryam, 1990-, et al. (författare) Should Have, Would Have, Could Have : Investigating Verb Group Representations for Parsing with Universal Dependencies. 2016 Ingår i: Proceedings of the Workshop on Multilingual and Crosslingual Methods in NLP. - Stroudsburg : Association for Computational Linguistics (ACL). - 9781941643877 ; , s. 10-19 Konferensbidrag (refereegranskat)abstract Treebanks have recently been released for a number of languages with the harmonized annotation created by the Universal Dependencies project. The representation of certain constructions in UD are known to be suboptimal for parsing and may be worth transforming for the purpose of parsing. In this paper, we focus on the representation of verb groups. Several studies have shown that parsing works better when auxiliaries are the head of auxiliary dependency relations which is not the case in UD. We therefore transformed verb groups in UD treebanks, parsed the test set and transformed it back, and contrary to expectations, observed significant decreases in accuracy. We provide suggestive evidence that improvements in previous studies were obtained because the transformation helps disambiguating POS tags of main verbs and auxiliaries. The question of why parsing accuracy decreases with this approach in the case of UD is left open.
39.	de Lhoneux, Miryam, 1990-, et al. (författare) UD Treebank Sampling for Comparative Parser Evaluation 2016 Konferensbidrag (refereegranskat)
40.	de Lhoneux, Miryam, 1990-, et al. (författare) What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions? 2019 Ingår i: CoRR. ; abs/1907.07950 Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)abstract This article is a linguistic investigation of a neural parser. We look at transitivity and agreement information of auxiliary verb constructions (AVCs) in comparison to finite main verbs (FMVs). This comparison is motivated by theoretical work in dependency grammar and in particular the work of Tesnière (1959) where AVCs and FMVs are both instances of a nucleus, the basic unit of syntax. An AVC is a dissociated nucleus, it consists of at least two words, and a FMV is its non-dissociated counterpart, consisting of exactly one word. We suggest that the representation of AVCs and FMVs should capture similar information. We use diagnostic classifiers to probe agreement and transitivity information in vectors learned by a transition-based neural parser in four typologically different languages. We find that the parser learns different information about AVCs and FMVs if only sequential models (BiLSTMs) are used in the architecture but similar information when a recursive layer is used. We find explanations for why this is the case by looking closely at how information is learned in the network and looking at what happens with different dependency representations of AVCs.
41.	de Lhoneux, Miryam, 1990-, et al. (författare) What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions? 2020 Ingår i: Computational linguistics - Association for Computational Linguistics (Print). - : MIT Press. - 0891-2017 .- 1530-9312. ; 46:4, s. 763-784 Tidskriftsartikel (refereegranskat)abstract There is a growing interest in investigating what neural NLP models learn about language. A prominent open question is the question of whether or not it is necessary to model hierarchical structure. We present a linguistic investigation of a neural parser adding insights to this question. We look at transitivity and agreement information of auxiliary verb constructions (AVCs) in comparison to finite main verbs (FMVs). This comparison is motivated by theoretical work in dependency grammar and in particular the work of Tesnière (1959), where AVCs and FMVs are both instances of a nucleus, the basic unit of syntax. An AVC is a dissociated nucleus; it consists of at least two words, and an FMV is its non-dissociated counterpart, consisting of exactly one word. We suggest that the representation of AVCs and FMVs should capture similar information. We use diagnostic classifiers to probe agreement and transitivity information in vectors learned by a transition-based neural parser in four typologically different languages. We find that the parser learns different information about AVCs and FMVs if only sequential models (BiLSTMs) are used in the architecture but similar information when a recursive layer is used. We find explanations for why this is the case by looking closely at how information is learned in the network and looking at what happens with different dependency representations of AVCs. We conclude that there may be benefits to using a recursive layer in dependency parsing and that we have not yet found the best way to integrate it in our parsers.
42.	de Marneffe, Marie-Catherine, et al. (författare) Dependency Grammar 2019 Ingår i: Annual review of linguistics. - : ANNUAL REVIEWS. - 2333-9691 .- 2333-9683. ; 5, s. 197-218 Tidskriftsartikel (refereegranskat)abstract Dependency grammar is a descriptive and theoretical tradition in linguistics that can be traced back to antiquity. It has long been influential in the European linguistics tradition and has more recently become a mainstream approach to representing syntactic and semantic structure in natural language processing. In this review, we introduce the basic theoretical assumptions of dependency grammar and review some key aspects in which different dependency frameworks agree or disagree. We also discuss advantages and disadvantages of dependency representations and introduce Universal Dependencies, a framework for multilingual dependency-based morphosyntactic annotation that has been applied to more than 60 languages.
43.	De Marneffe, Marie-Catherine, et al. (författare) Universal Dependencies 2021 Ingår i: Computational Linguistics. - : MIT Press. - 0891-2017 .- 1530-9312. ; 47, s. 255-308 Tidskriftsartikel (refereegranskat)abstract Universal dependencies (UD) is a framework for morphosyntactic annotation of human language, which to date has been used to create treebanks for more than 100 languages. In this article, we outline the linguistic theory of the UD framework, which draws on a long tradition of typologically oriented grammatical theories. Grammatical relations between words are centrally used to explain how predicate–argument structures are encoded morphosyntactically in different languages while morphological features and part-of-speech classes give the properties of words. We argue that this theory is a good basis for crosslinguistically consistent annotation of typologically diverse languages in a way that supports computational natural language understanding as well as broader linguistic studies.
44.	de Marneffe, Marie-Catherine, et al. (författare) Universal Stanford Dependencies : A Cross-Linguistic Typology 2014 Ingår i: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC). - 9782951740884 ; , s. 4585-4592 Konferensbidrag (refereegranskat)abstract Revisiting the now de facto standard Stanford dependency representation, we propose an improved taxonomy to capture grammatical relations across languages, including morphologically rich ones. We suggest a two-layered taxonomy: a set of broadly attested universal grammatical relations, to which language-specific relations can be added. We emphasize the lexicalist stance of the Stanford Dependencies, which leads to a particular, partially new treatment of compounding, prepositions, and morphology. We show how existing dependency schemes for several languages map onto the universal taxonomy proposed here and close with consideration of practical implications of dependency representation choices for NLP applications, in particular parsing.
45.	DL4NLP 2019. Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing 2019 Samlingsverk (redaktörskap) (refereegranskat)
46.	Dobrovoljc, Kaja, et al. (författare) The Universal Dependencies Treebank of Spoken Slovenian 2016 Ingår i: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). - 9782951740891 ; , s. 1566-1573 Konferensbidrag (refereegranskat)abstract This paper presents the construction of an open-source dependency treebank of spoken Slovenian, the first syntactically annotated collection of spontaneous speech in Slovenian. The treebank has been manually annotated using the Universal Dependencies annotation scheme, a one-layer syntactic annotation scheme with a high degree of cross-modality, cross-framework and cross-language interoperability. In this original application of the scheme to spoken language transcripts, we address a wide spectrum of syntactic particularities in speech, either by extending the scope of application of existing universal labels or by proposing new speech-specific extensions. The initial analysis of the resulting treebank and its comparison with the written Slovenian UD treebank confirms significant syntactic differences between the two language modalities, with spoken data consisting of shorter and more elliptic sentences, less and simpler nominal phrases, and more relations marking disfluencies, interaction, deixis and modality.
47.	Dubremetz, Marie, 1988- (författare) Detecting Rhetorical Figures Based on Repetition of Words: Chiasmus, Epanaphora, Epiphora 2017 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract This thesis deals with the detection of three rhetorical figures based on repetition of words: chiasmus (“Fair is foul, and foul is fair.”), epanaphora (“Poor old European Commission! Poor old European Council.”) and epiphora (“This house is mine. This car is mine. You are mine.”). For a computer, locating all repetitions of words is trivial, but locating just those repetitions that achieve a rhetorical effect is not. How can we make this distinction automatically? First, we propose a new definition of the problem. We observe that rhetorical figures are a graded phenomenon, with universally accepted prototypical cases, equally clear non-cases, and a broad range of borderline cases in between. This makes it natural to view the problem as a ranking task rather than a binary detection task. We therefore design a model for ranking candidate repetitions in terms of decreasing likelihood of having a rhetorical effect, which allows potential users to decide for themselves where to draw the line with respect to borderline cases. Second, we address the problem of collecting annotated data to train the ranking model. Thanks to a selective method of annotation, we can reduce by three orders of magnitude the annotation work for chiasmus, and by one order of magnitude the work for epanaphora and epiphora. In this way, we prove that it is feasible to develop a system for detecting the three figures without an unsurmountable amount of human work. Finally, we propose an evaluation scheme and apply it to our models. The evaluation reveals that, even with a very incompletely annotated corpus, a system for repetitive figure detection can be trained to achieve reasonable accuracy. We investigate the impact of different linguistic features, including length, n-grams, part-of-speech tags, and syntactic roles, and find that different features are useful for different figures. We also apply the system to four different types of text: political discourse, fiction, titles of articles and novels, and quotations. Here the evaluation shows that the system is robust to shifts in genre and that the frequencies of the three rhetorical figures vary with genre.
48.	Dubremetz, Marie, 1988-, et al. (författare) Extraction of Nominal Multiword Expressions in French 2014 Ingår i: Proceedings of the 10th Workshop on Multiword Expressions (MWE). - Gothenburg, Sweden : Association for Computational Linguistics. ; , s. 72-76 Konferensbidrag (refereegranskat)
49.	Dubremetz, Marie, 1988-, et al. (författare) Machine Learning for Rhetorical Figure Detection: More Chiasmus with Less Annotation 2017 Ingår i: Proceedings of the 21st Nordic Conference of Computational Linguistics. - Gothenburg, Sweden. ; , s. 37-45 Konferensbidrag (refereegranskat)
50.	Dubremetz, Marie, 1988-, et al. (författare) Rhetorical Figure Detection : Chiasmus, Epanaphora, Epiphora 2018 Ingår i: Frontiers in Digital Humanities. - : Frontiers Media SA. - 2297-2668. ; 5:10 Tidskriftsartikel (refereegranskat)abstract Rhetorical figures are valuable linguistic data for literary analysis. In this article, we target the detection of three rhetorical figures that belong to the family of repetitive figures: chiasmus (I go where I please, and I please where I go.), epanaphora also called anaphora (“Poor old European Commission! Poor old European Council.”) and epiphora (“This house is mine. This car is mine. You are mine.”). Detecting repetition of words is easy for a computer but detecting only the ones provoking a rhetorical effect is difficult because of many accidental and irrelevant repetitions. For all figures, we train a log-linear classifier on a corpus of political debates. The corpus is only very partially annotated, but we nevertheless obtain good results, with more than 50% precision for all figures. We then apply our models to totally different genres and perform a comparative analysis, by comparing corpora of fiction, science and quotes. Thanks to the automatic detection of rhetorical figures, we discover that chiasmus is more likely to appear in the scientific context whereas epanaphora and epiphora are more common in fiction.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Träfflista för sökning "WFRF:(Joakim Nivre) "

Avgränsa träffmängd

År