SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Joakim Nivre) srt2:(2015-2019)"

Sökning: WFRF:(Joakim Nivre) > (2015-2019)

  • Resultat 1-50 av 66
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Ballesteros, Miguel, et al. (författare)
  • MaltOptimizer : Fast and Effective Parser Optimization
  • 2016
  • Ingår i: Natural Language Engineering. - 1351-3249 .- 1469-8110. ; 22:2, s. 187-213
  • Tidskriftsartikel (refereegranskat)abstract
    • Statistical parsers often require careful parameter tuning and feature selection. This is a nontrivial task for application developers who are not interested in parsing for its own sake, and it can be time-consuming even for experienced researchers. In this paper we present MaltOptimizer, a tool developed to automatically explore parameters and features for MaltParser, a transition-based dependency parsing system that can be used to train parser's given treebank data. MaltParser provides a wide range of parameters for optimization, including nine different parsing algorithms, an expressive feature specification language that can be used to define arbitrarily rich feature models, and two machine learning libraries, each with their own parameters. MaltOptimizer is an interactive system that performs parser optimization in three stages. First, it performs an analysis of the training set in order to select a suitable starting point for optimization. Second, it selects the best parsing algorithm and tunes the parameters of this algorithm. Finally, it performs feature selection and tunes machine learning parameters. Experiments on a wide range of data sets show that MaltOptimizer quickly produces models that consistently outperform default settings and often approach the accuracy achieved through careful manual optimization.
  •  
2.
  • Basirat, Ali, et al. (författare)
  • A statistical model for grammar mapping
  • 2016
  • Ingår i: Natural Language Engineering. - : Cambridge University Press. - 1351-3249 .- 1469-8110. ; 22:2, s. 215-255
  • Tidskriftsartikel (refereegranskat)abstract
    • The two main classes of grammars are (a) hand-crafted grammars, which are developed bylanguage experts, and (b) data-driven grammars, which are extracted from annotated corpora.This paper introduces a statistical method for mapping the elementary structures of a data-driven grammar onto the elementary structures of a hand-crafted grammar in order to combinetheir advantages. The idea is employed in the context of Lexicalized Tree-Adjoining Grammars(LTAG) and tested on two LTAGs of English: the hand-crafted LTAG developed in theXTAG project, and the data-driven LTAG, which is automatically extracted from the PennTreebank and used by the MICA parser. We propose a statistical model for mapping anyelementary tree sequence of the MICA grammar onto a proper elementary tree sequence ofthe XTAG grammar. The model has been tested on three subsets of the WSJ corpus thathave average lengths of 10, 16, and 18 words, respectively. The experimental results show thatfull-parse trees with average F1 -scores of 72.49, 64.80, and 62.30 points could be built from94.97%, 96.01%, and 90.25% of the XTAG elementary tree sequences assigned to the subsets,respectively. Moreover, by reducing the amount of syntactic lexical ambiguity of sentences,the proposed model significantly improves the efficiency of parsing in the XTAG system.
  •  
3.
  • Basirat, Ali, et al. (författare)
  • Greedy Universal Dependency Parsing with Right Singular Word Vectors
  • 2016
  • Konferensbidrag (refereegranskat)abstract
    • A set of continuous feature vectors formed by right singular vectors of a transformed co-occurrence matrix are used with the Stanford neural dependency parser to train parsing models for a limited number of languages in the corpus of universal dependencies. We show that the feature vector can help the parser to remain greedy and be as accurate as (or even more accurate than) some other greedy and non-greedy parsers.
  •  
4.
  •  
5.
  • Basirat, Ali, 1982- (författare)
  • Principal Word Vectors
  • 2018
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Word embedding is a technique for associating the words of a language with real-valued vectors, enabling us to use algebraic methods to reason about their semantic and grammatical properties. This thesis introduces a word embedding method called principal word embedding, which makes use of principal component analysis (PCA) to train a set of word embeddings for words of a language. The principal word embedding method involves performing a PCA on a data matrix whose elements are the frequency of seeing words in different contexts. We address two challenges that arise in the application of PCA to create word embeddings. The first challenge is related to the size of the data matrix on which PCA is performed and affects the efficiency of the word embedding method. The data matrix is usually a large matrix that requires a very large amount of memory and CPU time to be processed. The second challenge is related to the distribution of word frequencies in the data matrix and affects the quality of the word embeddings. We provide an extensive study of the distribution of the elements of the data matrix and show that it is unsuitable for PCA in its unmodified form.We overcome the two challenges in principal word embedding by using a generalized PCA method. The problem with the size of the data matrix is mitigated by a randomized singular value decomposition (SVD) procedure, which improves the performance of PCA on the data matrix. The data distribution is reshaped by an adaptive transformation function, which makes it more suitable for PCA. These techniques, together with a weighting mechanism that generalizes many different weighting and transformation approaches used in literature, enable the principal word embedding to train high quality word embeddings in an efficient way.We also provide a study on how principal word embedding is connected to other word embedding methods. We compare it to a number of word embedding methods and study how the two challenges in principal word embedding are addressed in those methods. We show that the other word embedding methods are closely related to principal word embedding and, in many instances, they can be seen as special cases of it.The principal word embeddings are evaluated in both intrinsic and extrinsic ways. The intrinsic evaluations are directed towards the study of the distribution of word vectors. The extrinsic evaluations measure the contribution of principal word embeddings to some standard NLP tasks. The experimental results confirm that the newly proposed features of principal word embedding (i.e., the randomized SVD algorithm, the adaptive transformation function, and the weighting mechanism) are beneficial to the method and lead to significant improvements in the results. A comparison between principal word embedding and other popular word embedding methods shows that, in many instances, the proposed method is able to generate word embeddings that are better than or as good as other word embeddings while being faster than several popular word embedding methods.
  •  
6.
  • Basirat, Ali, 1982-, et al. (författare)
  • Real-valued Syntactic Word Vectors (RSV) for Greedy Neural Dependency Parsing
  • 2017
  • Konferensbidrag (refereegranskat)abstract
    • We show that a set of real-valued word vectors formed by right singular vectors of a transformed co-occurrence matrix are meaningful for determining different types of dependency relations between words. Our experimental results on the task of dependency parsing confirm the superiority of the word vectors to the other sets of word vectors generated by popular methods of word embedding. We also study the effect of using these vectors on the accuracy of dependency parsing in different languages versus using more complex parsing architectures.
  •  
7.
  •  
8.
  •  
9.
  • Cap, Fabienne, et al. (författare)
  • SWORD : Towards Cutting-Edge Swedish Word Processing
  • 2016
  • Ingår i: Proceedings of SLTC 2016.
  • Konferensbidrag (refereegranskat)abstract
    • Despite many years of research on Swedish language technology, there is still no well-documented standard for Swedish word processing covering the whole spectrum from low-level tokenization to morphological analysis and disambiguation. SWORD is a new initiative within the SWE-CLARIN consortium aiming to develop documented standards for Swedish word processing. In this paper, we report on a pilot study of Swedish tokenization, where we compare the output of six different tokenizers on four different text types. For one text type (Wikipedia articles), we also compare to the tokenization produced by six manual annotators.
  •  
10.
  •  
11.
  •  
12.
  • Constant, Matthieu, et al. (författare)
  • A Transition-Based System for Joint Lexical and Syntactic Analysis
  • 2016
  • Ingår i: PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1. ; , s. 161-171
  • Konferensbidrag (refereegranskat)abstract
    • We present a transition-based system that jointly predicts the syntactic structure and lexical units of a sentence by building two structures over the input words: a syntactic dependency tree and a forest of lexical units including multiword expressions (MWEs). This combined representation allows us to capture both the syntactic and semantic structure of MWEs, which in turn enables deeper downstream semantic analysis, especially for semi-compositional MWEs. The proposed system extends the arc-standard transition system for dependency parsing with transitions for building complex lexical units. Experiments on two different data sets show that the approach significantly improves MWE identification accuracy (and sometimes syntactic accuracy) compared to existing joint approaches.
  •  
13.
  • Csató, Éva Ágnes, 1948-, et al. (författare)
  • Parallel corpora and Universal Dependencies for Turkic
  • 2015
  • Ingår i: Turkic languages. - Wiesbaden. - 1431-4983. ; 19:2, s. 259-273
  • Tidskriftsartikel (refereegranskat)abstract
    • The first part of this paper presents ongoing work on Turkic parallel corpora at the De- partment of Linguistics and Philology, Uppsala University. Moreover, examples are given of how the Swedish-Turkish-English corpus is used in teaching Turkish and in compara- tive linguistic studies. The second part deals with the annotation scheme Universal De- pendencies (UD) used in treebanks, and its application to Turkic languages. 
  •  
14.
  • de Lhoneux, Miryam, 1990-, et al. (författare)
  • Arc-Hybrid Non-Projective Dependency Parsing with a Static-Dynamic Oracle
  • 2017
  • Ingår i: IWPT 2017 15th International Conference on Parsing Technologies. - Pisa, Italy : Association for Computational Linguistics. - 9781945626739 ; , s. 99-104
  • Konferensbidrag (refereegranskat)abstract
    • We extend the arc-hybrid transition system for dependency parsing with a SWAP transition that enables reordering of the words and construction of non-projective trees. Although this extension potentially breaks the arc-decomposability of the transition system, we show that the existing dynamic oracle can be modified and combined with a static oracle for the SWAP transition. Experiments on five languages with different degrees of non-projectivity show that the new system gives competitive accuracy and is significantly better than a system trained with a purely static oracle.
  •  
15.
  • de Lhoneux, Miryam, 1990-, et al. (författare)
  • From raw text to Universal Dependencies : look, no tags!
  • 2017
  • Ingår i: Proceedings of the CoNLL 2017 Shared Task. - Vancouver, Canada : Association for Computational Linguistics. - 9781945626708 ; , s. 207-217
  • Konferensbidrag (refereegranskat)abstract
    • We present the Uppsala submission to the CoNLL 2017 shared task on parsing from raw text to universal dependencies. Our system is a simple pipeline consisting of two components. The first performs joint word and sentence segmentation on raw text; the second predicts dependency trees from raw words. The parser bypasses the need for part-of-speech tagging, but uses word embeddings based on universal tag distributions. We achieved a macroaveraged LAS F1 of 65.11 in the official test run and obtained the 2nd best result for sentence segmentation with a score of 89.03. After fixing two bugs, we obtained an unofficial LAS F1 of 70.49.
  •  
16.
  • de Lhoneux, Miryam, 1990- (författare)
  • Linguistically Informed Neural Dependency Parsing for Typologically Diverse Languages
  • 2019
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • This thesis presents several studies in neural dependency parsing for typologically diverse languages, using treebanks from Universal Dependencies (UD). The focus is on informing models with linguistic knowledge. We first extend a parser to work well on typologically diverse languages, including morphologically complex languages and languages whose treebanks have a high ratio of non-projective sentences, a notorious difficulty in dependency parsing. We propose a general methodology where we sample a representative subset of UD treebanks for parser development and evaluation. Our parser uses recurrent neural networks which construct information sequentially, and we study the incorporation of a recursive neural network layer in our parser. This follows the intuition that language is hierarchical. This layer turns out to be superfluous in our parser and we study its interaction with other parts of the network. We subsequently study transitivity and agreement information learned by our parser for auxiliary verb constructions (AVCs). We suggest that a parser should learn similar information about AVCs as it learns for finite main verbs. This is motivated by work in theoretical dependency grammar. Our parser learns different information about these two if we do not augment it with a recursive layer, but similar information if we do, indicating that there may be benefits from using that layer and we may not yet have found the best way to incorporate it in our parser. We finally investigate polyglot parsing. Training one model for multiple related languages leads to substantial improvements in parsing accuracy over a monolingual baseline. We also study different parameter sharing strategies for related and unrelated languages. Sharing parameters that partially abstract away from word order appears to be beneficial in both cases but sharing parameters that represent words and characters is more beneficial for related than unrelated languages.
  •  
17.
  • de Lhoneux, Miryam, 1990-, et al. (författare)
  • Old School vs. New School : Comparing Transition-Based Parsers with and without Neural Network Enhancement
  • 2017
  • Ingår i: <em>Proceedings of the 15th Treebanks and Linguistic Theories Workshop (TLT)</em>. ; , s. 99-110
  • Konferensbidrag (refereegranskat)abstract
    • In this paper, we attempt a comparison between "new school" transition-based parsers that use neural networks and their classical "old school" coun-terpart. We carry out experiments on treebanks from the Universal Depen-dencies project. To facilitate the comparison and analysis of results, we onlywork on a subset of those treebanks. However, we carefully select this sub-set in the hope to have results that are representative for the whole set oftreebanks. We select two parsers that are hopefully representative of the twoschools; MaltParser and UDPipe and we look at the impact of training sizeon the two models. We hypothesize that neural network enhanced modelshave a steeper learning curve with increased training size. We observe, how-ever, that, contrary to expectations, neural network enhanced models needonly a small amount of training data to outperform the classical models butthe learning curves of both models increase at a similar pace after that. Wecarry out an error analysis on the development sets parsed by the two sys-tems and observe that overall MaltParser suffers more than UDPipe fromlonger dependencies. We observe that MaltParser is only marginally betterthan UDPipe on a restricted set of short dependencies.
  •  
18.
  • de Lhoneux, Miryam, 1990-, et al. (författare)
  • Recursive Subtree Composition in LSTM-Based Dependency Parsing
  • 2019
  • Ingår i: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. - Stroudsburg : Association for Computational Linguistics. - 9781950737130 ; , s. 1566-1576
  • Konferensbidrag (refereegranskat)abstract
    • The need for tree structure modelling on top of sequence modelling is an open issue in neural dependency parsing. We investigate the impact of adding a tree layer on top of a sequential model by recursively composing subtree representations (composition) in a transition-based parser that uses features extracted by a BiLSTM. Composition seems superfluous with such a model, suggesting that BiLSTMs capture information about subtrees. We perform model ablations to tease out the conditions under which composition helps. When ablating the backward LSTM, performance drops and composition does not recover much of the gap. When ablating the forward LSTM, performance drops less dramatically and composition recovers a substantial part of the gap, indicating that a forward LSTM and composition capture similar information. We take the backward LSTM to be related to lookahead features and the forward LSTM to the rich history-based features both crucial for transition-based parsers. To capture history-based information, composition is better than a forward LSTM on its own, but it is even better to have a forward LSTM as part of a BiLSTM. We correlate results with language properties, showing that the improved lookahead of a backward LSTM is especially important for head-final languages.
  •  
19.
  • de Lhoneux, Miryam, 1990-, et al. (författare)
  • Should Have, Would Have, Could Have : Investigating Verb Group Representations for Parsing with Universal Dependencies.
  • 2016
  • Ingår i: Proceedings of the Workshop on Multilingual and Crosslingual Methods in NLP. - Stroudsburg : Association for Computational Linguistics (ACL). - 9781941643877 ; , s. 10-19
  • Konferensbidrag (refereegranskat)abstract
    • Treebanks have recently been released for a number of languages with the harmonized annotation created by the Universal Dependencies project. The representation of certain constructions in UD are known to be suboptimal for parsing and may be worth transforming for the purpose of parsing. In this paper, we focus on the representation of verb groups. Several studies have shown that parsing works better when auxiliaries are the head of auxiliary dependency relations which is not the case in UD. We therefore transformed verb groups in UD treebanks, parsed the test set and transformed it back, and contrary to expectations, observed significant decreases in accuracy. We provide suggestive evidence that improvements in previous studies were obtained because the transformation helps disambiguating POS tags of main verbs and auxiliaries. The question of why parsing accuracy decreases with this approach in the case of UD is left open.
  •  
20.
  •  
21.
  • de Lhoneux, Miryam, 1990-, et al. (författare)
  • What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?
  • 2019
  • Ingår i: CoRR. ; abs/1907.07950
  • Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)abstract
    • This article is a linguistic investigation of a neural parser. We look at transitivity and agreement information of auxiliary verb constructions (AVCs) in comparison to finite main verbs (FMVs). This comparison is motivated by theoretical work in dependency grammar and in particular the work of Tesnière (1959) where AVCs and FMVs are both instances of a nucleus, the basic unit of syntax. An AVC is a dissociated nucleus, it consists of at least two words, and a FMV is its non-dissociated counterpart, consisting of exactly one word. We suggest that the representation of AVCs and FMVs should capture similar information. We use diagnostic classifiers to probe agreement and transitivity information in vectors learned by a transition-based neural parser in four typologically different languages. We find that the parser learns different information about AVCs and FMVs if only sequential models (BiLSTMs) are used in the architecture but similar information when a recursive layer is used. We find explanations for why this is the case by looking closely at how information is learned in the network and looking at what happens with different dependency representations of AVCs.
  •  
22.
  • de Marneffe, Marie-Catherine, et al. (författare)
  • Dependency Grammar
  • 2019
  • Ingår i: Annual review of linguistics. - : ANNUAL REVIEWS. - 2333-9691 .- 2333-9683. ; 5, s. 197-218
  • Tidskriftsartikel (refereegranskat)abstract
    • Dependency grammar is a descriptive and theoretical tradition in linguistics that can be traced back to antiquity. It has long been influential in the European linguistics tradition and has more recently become a mainstream approach to representing syntactic and semantic structure in natural language processing. In this review, we introduce the basic theoretical assumptions of dependency grammar and review some key aspects in which different dependency frameworks agree or disagree. We also discuss advantages and disadvantages of dependency representations and introduce Universal Dependencies, a framework for multilingual dependency-based morphosyntactic annotation that has been applied to more than 60 languages.
  •  
23.
  •  
24.
  • Dobrovoljc, Kaja, et al. (författare)
  • The Universal Dependencies Treebank of Spoken Slovenian
  • 2016
  • Ingår i: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). - 9782951740891 ; , s. 1566-1573
  • Konferensbidrag (refereegranskat)abstract
    • This paper presents the construction of an open-source dependency treebank of spoken Slovenian, the first syntactically annotated collection of spontaneous speech in Slovenian. The treebank has been manually annotated using the Universal Dependencies annotation scheme, a one-layer syntactic annotation scheme with a high degree of cross-modality, cross-framework and cross-language interoperability. In this original application of the scheme to spoken language transcripts, we address a wide spectrum of syntactic particularities in speech, either by extending the scope of application of existing universal labels or by proposing new speech-specific extensions. The initial analysis of the resulting treebank and its comparison with the written Slovenian UD treebank confirms significant syntactic differences between the two language modalities, with spoken data consisting of shorter and more elliptic sentences, less and simpler nominal phrases, and more relations marking disfluencies, interaction, deixis and modality.
  •  
25.
  • Dubremetz, Marie, 1988- (författare)
  • Detecting Rhetorical Figures Based on Repetition of Words: Chiasmus, Epanaphora, Epiphora
  • 2017
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • This thesis deals with the detection of three rhetorical figures based on repetition of words: chiasmus (“Fair is foul, and foul is fair.”), epanaphora (“Poor old European Commission! Poor old European Council.”) and epiphora (“This house is mine. This car is mine. You are mine.”). For a computer, locating all repetitions of words is trivial, but locating just those repetitions that achieve a rhetorical effect is not. How can we make this distinction automatically? First, we propose a new definition of the problem. We observe that rhetorical figures are a graded phenomenon, with universally accepted prototypical cases, equally clear non-cases, and a broad range of borderline cases in between. This makes it natural to view the problem as a ranking task rather than a binary detection task. We therefore design a model for ranking candidate repetitions in terms of decreasing likelihood of having a rhetorical effect, which allows potential users to decide for themselves where to draw the line with respect to borderline cases. Second, we address the problem of collecting annotated data to train the ranking model. Thanks to a selective method of annotation, we can reduce by three orders of magnitude the annotation work for chiasmus, and by one order of magnitude the work for epanaphora and epiphora. In this way, we prove that it is feasible to develop a system for detecting the three figures without an unsurmountable amount of human work. Finally, we propose an evaluation scheme and apply it to our models. The evaluation reveals that, even with a very incompletely annotated corpus, a system for repetitive figure detection can be trained to achieve reasonable accuracy. We investigate the impact of different linguistic features, including length, n-grams, part-of-speech tags, and syntactic roles, and find that different features are useful for different figures. We also apply the system to four different types of text: political discourse, fiction, titles of articles and novels, and quotations. Here the evaluation shows that the system is robust to shifts in genre and that the frequencies of the three rhetorical figures vary with genre.
  •  
26.
  •  
27.
  • Dubremetz, Marie, 1988-, et al. (författare)
  • Rhetorical Figure Detection : Chiasmus, Epanaphora, Epiphora
  • 2018
  • Ingår i: Frontiers in Digital Humanities. - : Frontiers Media SA. - 2297-2668. ; 5:10
  • Tidskriftsartikel (refereegranskat)abstract
    • Rhetorical figures are valuable linguistic data for literary analysis. In this article, we target the detection of three rhetorical figures that belong to the family of repetitive figures: chiasmus (I go where I please, and I please where I go.), epanaphora also called anaphora (“Poor old European Commission! Poor old European Council.”) and epiphora (“This house is mine. This car is mine. You are mine.”). Detecting repetition of words is easy for a computer but detecting only the ones provoking a rhetorical effect is difficult because of many accidental and irrelevant repetitions. For all figures, we train a log-linear classifier on a corpus of political debates. The corpus is only very partially annotated, but we nevertheless obtain good results, with more than 50% precision for all figures. We then apply our models to totally different genres and perform a comparative analysis, by comparing corpora of fiction, science and quotes. Thanks to the automatic detection of rhetorical figures, we discover that chiasmus is more likely to appear in the scientific context whereas epanaphora and epiphora are more common in fiction.
  •  
28.
  •  
29.
  •  
30.
  •  
31.
  •  
32.
  •  
33.
  •  
34.
  • Kann, Viggo, 1964-, et al. (författare)
  • En rekommenderad svensk språkteknologisk terminologi
  • 2016
  • Ingår i: Proc. Sixth Swedish Language Technology Conference. - Umeå : Svenska språkteknologitermgruppen.
  • Konferensbidrag (refereegranskat)abstract
    • In 2014 the Swedish Language Technology Terminology Group was created, with representatives from different parts of the language technology community, both higher education and research, industry and governmental agencies. In 2016 we have recommended Swedish terms for the 270 language technological concepts in the Bank of Finnish Terminology in Arts and Sciences. The language technology terms are published on folkets-lexikon.csc.kth.se/LTterminology, where anyone can lookup Swedish and English terms interactively and read the full list of terms. We also try to enter the most important Swedish terminology into the Swedish Wikipedia. We encourage use of these Swedish terms and welcome suggestions for improvements of the Swedish terminology.
  •  
35.
  • Kulmizev, Artur, et al. (författare)
  • Deep Contextualized Word Embeddings in Transition-Based and Graph-Based Dependency Parsing – A Tale of Two Parsers Revisited
  • 2019
  • Ingår i: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). ; , s. 2755-2768
  • Konferensbidrag (refereegranskat)abstract
    • Transition-based and graph-based dependency parsers have previously been shown to have complementary strengths and weaknesses: transition-based parsers exploit rich structural features but suffer from error propagation, while graph-based parsers benefit from global optimization but have restricted feature scope. In this paper, we show that, even though some details of the picture have changed after the switch to neural networks and continuous representations, the basic trade-off between rich features and global optimization remains essentially the same. Moreover, we show that deep contextualized word embeddings, which allow parsers to pack information about global sentence structure into local feature representations, benefit transition-based parsers more than graph-based parsers, making the two approaches virtually equivalent in terms of both accuracy and error profile. We argue that the reason is that these representations help prevent search errors and thereby allow transitionbased parsers to better exploit their inherent strength of making accurate local decisions. We support this explanation by an error analysis of parsing experiments on 13 languages.
  •  
36.
  •  
37.
  •  
38.
  • Nivre, Joakim, 1962- (författare)
  • Om datorer och språkförståelse
  • 2015
  • Ingår i: Å…rsbok 2015. - : Kungliga Vetenskaps-Societeten i Uppsala. ; , s. 75-82
  • Bokkapitel (övrigt vetenskapligt/konstnärligt)
  •  
39.
  • Nivre, Joakim (författare)
  • Towards a Universal Grammar for Natural Language Processing
  • 2015
  • Ingår i: Computational Linguistics and Intelligent Text Processing. - Uppsala Univ, Dept Linguist & Philol, S-75105 Uppsala, Sweden. : Springer International Publishing. - 9783319181110 - 9783319181103 ; , s. 3-16
  • Konferensbidrag (refereegranskat)abstract
    • Universal Dependencies is a recent initiative to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. In this paper, I outline the motivation behind the initiative and explain how the basic design principles follow from these requirements. I then discuss the different components of the annotation standard, including principles for word segmentation, morphological annotation, and syntactic annotation. I conclude with some thoughts on the challenges that lie ahead.
  •  
40.
  • Nivre, Joakim, 1962-, et al. (författare)
  • Universal Dependencies v1 : A Multilingual Treebank Collection
  • 2016
  • Ingår i: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). - Paris : EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA. - 9782951740891 ; , s. 1659-1666
  • Konferensbidrag (refereegranskat)abstract
    • Cross-linguistically consistent annotation is necessary for sound comparative evaluation and cross-lingual learning experiments. It is also useful for multilingual system development and comparative linguistic studies. Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. In this paper, we describe v1 of the universal guidelines, the underlying design principles, and the currently available treebanks for 33 languages.
  •  
41.
  • Nivre, Joakim, 1962-, et al. (författare)
  • Universal Dependency Evaluation
  • 2017
  • Ingår i: Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017). - 9789176855010 ; , s. 86-95
  • Konferensbidrag (refereegranskat)
  •  
42.
  •  
43.
  • Pettersson, Eva, 1978-, et al. (författare)
  • Ranking Relevant Verb Phrases Extracted from Historical Text
  • 2015
  • Ingår i: <em>Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities</em>.
  • Konferensbidrag (refereegranskat)abstract
    • In this paper, we present three approaches to automatic ranking of relevant verb phrases extracted from historical text. These approaches are based on conditional probability, log likelihood ratio, and bagof-words classification respectively. The aim of the ranking in our study is to present verb phrases that have a high probability of describing work at the top of the results list, but the methods are likely to be applicable to other information needs as well. The results are evaluated by use of three different evaluation metrics: precision at k, R-precision, and average precision. In the best setting, 91 out of the top-100 instances in the list are true positives.
  •  
44.
  • Pettersson, Eva, 1978- (författare)
  • Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction
  • 2016
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Historical text constitutes a rich source of information for historians and other researchers in humanities. Many texts are however not available in an electronic format, and even if they are, there is a lack of NLP tools designed to handle historical text. In my thesis, I aim to provide a generic workflow for automatic linguistic analysis and information extraction from historical text, with spelling normalisation as a core component in the pipeline. In the spelling normalisation step, the historical input text is automatically normalised to a more modern spelling, enabling the use of existing taggers and parsers trained on modern language data in the succeeding linguistic analysis step. In the final information extraction step, certain linguistic structures are identified based on the annotation labels given by the NLP tools, and ranked in accordance with the specific information need expressed by the user.An important consideration in my implementation is that the pipeline should be applicable to different languages, time periods, genres, and information needs by simply substituting the language resources used in each module. Furthermore, the reuse of existing NLP tools developed for the modern language is crucial, considering the lack of linguistically annotated historical data combined with the high variability in historical text, making it hard to train NLP tools specifically aimed at analysing historical text.In my evaluation, I show that spelling normalisation can be a very useful technique for easy access to historical information content, even in cases where there is little (or no) annotated historical training data available. For the specific information extraction task of automatically identifying verb phrases describing work in Early Modern Swedish text, 91 out of the 100 top-ranked instances are true positives in the best setting. 
  •  
45.
  •  
46.
  •  
47.
  •  
48.
  • Seraji, Mojgan (författare)
  • Morphosyntactic Corpora and Tools for Persian
  • 2015
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • This thesis presents open source resources in the form of annotated corpora and modules for automatic morphosyntactic processing and analysis of Persian texts. More specifically, the resources consist of an improved part-of-speech tagged corpus and a dependency treebank, as well as tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and dependency parsing for Persian.In developing these resources and tools, two key requirements are observed: compatibility and reuse. The compatibility requirement encompasses two parts. First, the tools in the pipeline should be compatible with each other in such a way that the output of one tool is compatible with the input requirements of the next. Second, the tools should be compatible with the annotated corpora and deliver the same analysis that is found in these. The reuse requirement means that all the components in the pipeline are developed by reusing resources, standard methods, and open source state-of-the-art tools. This is necessary to make the project feasible.Given these requirements, the thesis investigates two main research questions. The first is how can we develop morphologically and syntactically annotated corpora and tools while satisfying the requirements of compatibility and reuse? The approach taken is to accept the tokenization variations in the corpora to achieve robustness. The tokenization variations in Persian texts are related to the orthographic variations of writing fixed expressions, as well as various types of affixes and clitics. Since these variations are inherent properties of Persian texts, it is important that the tools in the pipeline can handle them. Therefore, they should not be trained on idealized data.The second question concerns how accurately we can perform morphological and syntactic analysis for Persian by adapting and applying existing tools to the annotated corpora. The experimental evaluation of the tools shows that the sentence segmenter and tokenizer achieve an F-score close to 100%, the tagger has an accuracy of nearly 97.5%, and the parser achieves a best labeled accuracy of over 82% (with unlabeled accuracy close to 87%).
  •  
49.
  • Seraji, Mojgan, et al. (författare)
  • ParsPer : A Dependency Parser for Persian
  • 2015
  • Ingår i: Depling 2015. - Uppsala : Uppsala universitet. - 9789163789656 ; , s. 300-309
  • Konferensbidrag (refereegranskat)abstract
    • We present a dependency parser for Persian, called ParsPer, developed using the graph-based parser in the Mate Tools. The parser is trained on the entire Uppsala Persian Dependency Treebank with a specific configuration that was selected by MaltParser as the best performing parsing representation. The treebank’s syntactic annotation scheme is based on Stanford Typed Dependencies with extensions for Persian. The results of the ParsPer evaluation revealed a best labeled accuracy over 82% with an unlabeled accuracy close to 87%. The parser is freely available and released as an open source tool for parsing Persian.
  •  
50.
  • Seraji, Mojgan, et al. (författare)
  • Universal Dependencies for Persian
  • 2016
  • Ingår i: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). - Paris : EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA. - 9782951740891 ; , s. 2361-2365
  • Konferensbidrag (refereegranskat)abstract
    • The Persian Universal Dependency Treebank (Persian UD) is a recent effort of treebanking Persian with Universal Dependencies (UD), an ongoing project that designs unified and cross-linguistically valid grammatical representations including part-of-speech tags, morphological features, and dependency relations. The Persian UD is the converted version of the Uppsala Persian Dependency Treebank (UPDT) to the universal dependencies framework and consists of nearly 6,000 sentences and 152,871 word tokens with an average sentence length of 25 words. In addition to the universal dependencies syntactic annotation guidelines, the two treebanks differ in tokenization. All words containing unsegmented clitics (pronominal and copula clitics) annotated with complex labels in the UPDT have been separated from the clitics and appear with distinct labels in the Persian UD. The treebank has its original syntactic annotation scheme based on Stanford Typed Dependencies. In this paper, we present the approaches taken in the development of the Persian UD.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-50 av 66
Typ av publikation
konferensbidrag (48)
tidskriftsartikel (7)
doktorsavhandling (6)
bok (2)
bokkapitel (2)
samlingsverk (redaktörskap) (1)
visa fler...
visa färre...
Typ av innehåll
refereegranskat (55)
övrigt vetenskapligt/konstnärligt (11)
Författare/redaktör
Nivre, Joakim, 1962- (44)
Nivre, Joakim (18)
de Lhoneux, Miryam, ... (13)
Stymne, Sara, 1977- (6)
Ginter, Filip (6)
Basirat, Ali, 1982- (4)
visa fler...
Hardmeier, Christian (4)
Schuster, Sebastian (4)
Hajic, Jan (4)
Tiedemann, Jörg (3)
Ahrenberg, Lars (3)
Östling, Robert (3)
Smith, Aaron (3)
Cap, Fabienne (3)
Kulmizev, Artur (2)
Nivre, Joakim, Profe ... (2)
Kann, Viggo (2)
Karlgren, Jussi (2)
Megyesi, Beáta, 1971 ... (2)
Ballesteros, Miguel (2)
Basirat, Ali (2)
Wirén, Mats (2)
Potthast, Martin (2)
Stymne, Sara (2)
Nilsson, Henrik (1)
Adesam, Yvonne, 1975 (1)
Borin, Lars, 1957 (1)
Bouma, Gerlof, 1979 (1)
Forsberg, Markus, 19 ... (1)
Dobrovoljc, Kaja (1)
Forsberg, Markus (1)
Kurfali, Murathan (1)
Björkelund, Anders (1)
Borin, Lars (1)
Karlsson, Ola (1)
Stella, Antonio (1)
Faili, Heshaam (1)
Schütze, Hinrich (1)
Bohnet, Bernd (1)
Kann, Viggo, 1964- (1)
Joakim, Nivre (1)
Domeij, Rickard (1)
Rehm, Georg (1)
Marheinecke, Katrin (1)
Bouma, Gosse (1)
Haug, Dag (1)
Solberg, Per Erik (1)
Øvrelid, Lilja (1)
Romary, Laurent (1)
Oepen, Stephan (1)
visa färre...
Lärosäte
Uppsala universitet (64)
Stockholms universitet (2)
Göteborgs universitet (1)
Kungliga Tekniska Högskolan (1)
Språk
Engelska (65)
Svenska (1)
Forskningsämne (UKÄ/SCB)
Naturvetenskap (58)
Humaniora (11)
Teknik (2)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy