Despite many years of research on Swedish language technology, there is still no well-documented standard for Swedish word processing covering the whole spectrum from low-level tokenization to morphological analysis and disambiguation. SWORD is a new initiative within the SWE-CLARIN consortium aiming to develop documented standards for Swedish word processing. In this paper, we report on a pilot study of Swedish tokenization, where we compare the output of six different tokenizers on four different text types. For one text type (Wikipedia articles), we also compare to the tokenization produced by six manual annotators.
Ämnesord
HUMANIORA -- Språk och litteratur -- Jämförande språkvetenskap och allmän lingvistik (hsv//swe)
HUMANITIES -- Languages and Literature -- General Language Studies and Linguistics (hsv//eng)
NATURVETENSKAP -- Data- och informationsvetenskap -- Språkteknologi (hsv//swe)
NATURAL SCIENCES -- Computer and Information Sciences -- Language Technology (hsv//eng)
HUMANIORA -- Språk och litteratur -- Studier av enskilda språk (hsv//swe)
HUMANITIES -- Languages and Literature -- Specific Languages (hsv//eng)