1. |
- Danilova, Vera, et al.
(författare)
-
UD-MULTIGENRE : a UD-Based Dataset Enriched with Instance-Level Genre Annotations
- 2023
-
Ingår i: Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL). - : Association for Computational Linguistics. - 9798891760561 ; , s. 253-267
-
Konferensbidrag (refereegranskat)abstract
- Prior research on the impact of genre on cross-lingual dependency parsing has suggested that genre is an important signal. However, these studies suffer from a scarcity of reliable data for multiple genres and languages. While Universal Dependencies (UD), the only available large-scale resource for cross-lingual dependency parsing, contains data from diverse genres, the documentation of genre labels is missing, and there are multiple inconsistencies. This makes studies of the impact of genres difficult to design. To address this, we present a new dataset, UD-MULTIGENRE, where 17 genres are defined and instance-level annotations of these are applied to a subset of UD data, covering 38 languages. It provides a rich ground for research related to text genre from a multilingual perspective. Utilizing this dataset, we can overcome the data shortage that hindered previous research and reproduce experiments from earlier studies with an improved setup. We revisit a previous study that used genre-based clusters and show that the clusters for most target genres provide a mix of genres. We compare training data selection based on clustering and gold genre labels and provide an analysis of the results. The dataset is publicly available. (https://github.com/UppsalaNLP/UD-MULTIGENRE)
|
|