Sökning: id:"swepub:oai:DiVA.org:uu-400834" >
Code and Data for “...
Code and Data for “Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters”
-
- Dahllöf, Mats, 1965- (författare)
- Uppsala universitet,Institutionen för lingvistik och filologi
-
(creator_code:org_t)
- 2020
- Engelska.
- Relaterad länk:
-
https://uu.diva-port...
-
visa fler...
-
https://urn.kb.se/re...
-
visa färre...
Abstract
Ämnesord
Stäng
- Code and data for the article Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters (to appear in DHN2020 Digital Humanities in the Nordic Countries}, Riga, 17--20 March 2020).The study based on this code and dataset is a comparative exploration of different classification tasks for Swedish medieval charters (transcriptions from the SDHK collection) and different classifier setups. In particular, we explore the identification of the issuer, place of issue, and decade of production. The experiments used features based on lowercased words and character 3- and 4-grams. We evaluated the performance of two learning algorithms: linear discriminant analysis and decision trees. For evaluation, five-fold cross-validation was performed. We report accuracy and macro-averaged F1 score. The validation made use of six labeled subsets of SDHK combining the three tasks with Old Swedish and Latin. Issuer identification for the Latin dataset (595 charters from 12 issuers) reached the highest scores, above 0.9, for the decision tree classifier using word features. The best corresponding accuracy for Old Swedish was 0.81. Place and decade identification produced lower performance scores for both languages. Which classifier design is the best one seems to depend on peculiarities of the dataset and the classification task. The present study does however support the idea that text classification is useful also for medieval documents characterized by extreme spelling variation.
Ämnesord
- NATURVETENSKAP -- Data- och informationsvetenskap -- Språkteknologi (hsv//swe)
- NATURAL SCIENCES -- Computer and Information Sciences -- Language Technology (hsv//eng)
Nyckelord
- Historia
- History
- Datorlingvistik
- Computational Linguistics
- Scandinavian Languages
- Nordiska språk
- Latin
- Latin
Publikations- och innehållstyp
- ovr (ämneskategori)