Sökning: id:"swepub:oai:DiVA.org:uu-522561" >
UD-MULTIGENRE :
UD-MULTIGENRE : a UD-Based Dataset Enriched with Instance-Level Genre Annotations
-
- Danilova, Vera (författare)
- Uppsala universitet,Institutionen för lingvistik och filologi,Datorlingvistik
-
- Stymne, Sara, 1977- (författare)
- Uppsala universitet,Institutionen för lingvistik och filologi
-
(creator_code:org_t)
- Association for Computational Linguistics, 2023
- 2023
- Engelska.
-
Ingår i: Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL). - : Association for Computational Linguistics. - 9798891760561 ; , s. 253-267
- Relaterad länk:
-
https://doi.org/10.1...
-
visa fler...
-
https://aclanthology...
-
https://uu.diva-port... (primary) (Raw object)
-
https://urn.kb.se/re...
-
https://doi.org/10.1...
-
visa färre...
Abstract
Ämnesord
Stäng
- Prior research on the impact of genre on cross-lingual dependency parsing has suggested that genre is an important signal. However, these studies suffer from a scarcity of reliable data for multiple genres and languages. While Universal Dependencies (UD), the only available large-scale resource for cross-lingual dependency parsing, contains data from diverse genres, the documentation of genre labels is missing, and there are multiple inconsistencies. This makes studies of the impact of genres difficult to design. To address this, we present a new dataset, UD-MULTIGENRE, where 17 genres are defined and instance-level annotations of these are applied to a subset of UD data, covering 38 languages. It provides a rich ground for research related to text genre from a multilingual perspective. Utilizing this dataset, we can overcome the data shortage that hindered previous research and reproduce experiments from earlier studies with an improved setup. We revisit a previous study that used genre-based clusters and show that the clusters for most target genres provide a mix of genres. We compare training data selection based on clustering and gold genre labels and provide an analysis of the results. The dataset is publicly available. (https://github.com/UppsalaNLP/UD-MULTIGENRE)
Ämnesord
- NATURVETENSKAP -- Data- och informationsvetenskap -- Språkteknologi (hsv//swe)
- NATURAL SCIENCES -- Computer and Information Sciences -- Language Technology (hsv//eng)
Nyckelord
- Dependency parsing
- genres
- corpora
- Datorlingvistik
- Computational Linguistics
Publikations- och innehållstyp
- ref (ämneskategori)
- kon (ämneskategori)
Hitta via bibliotek
Till lärosätets databas