Sökning: WFRF:(Santini Marina 1960 )
> (2017) >
A Web Corpus for eC...
A Web Corpus for eCare : Collection, Lay Annotation and Learning - First Results
-
- Santini, Marina, 1960- (författare)
- RISE,SICS,RISE SICS East Linköping, Sweden
-
- Jönsson, Arne, 1955- (författare)
- Linköpings universitet,RISE,SICS,Linköping University, Sweden,Interaktiva och kognitiva system,Filosofiska fakulteten,RISE SICS East Linköping, Sweden
-
- Nyström, Mikael, 1977- (författare)
- Linköpings universitet,RISE,SICS,Linköping University, Sweden,Institutionen för medicinsk teknik,Tekniska fakulteten
-
visa fler...
-
- Alireza, Marjan (författare)
- Örebro University, Örebro, Sweden
-
visa färre...
-
(creator_code:org_t)
- 2017-09-24
- 2017
- Engelska.
- Relaterad länk:
-
https://annals-csis....
-
visa fler...
-
https://urn.kb.se/re...
-
https://doi.org/10.1...
-
https://urn.kb.se/re...
-
visa färre...
Abstract
Ämnesord
Stäng
- In this position paper, we put forward two claims: 1) it is possible to design a dynamic and extensible corpus without running the risk of getting into scalability problems; 2) it is possible to devise noise-resistant Language Technology applications without affecting performance. To support our claims, we describe the design, construction and limitations of a very specialized medical web corpus, called eCare_Sv_01, and we present two experiments on lay-specialized text classification. eCare_Sv_01 is a small corpus of web documents written in Swedish. The corpus contains documents about chronic diseases. The sublanguage used in each document has been labelled as “lay” or “specialized” by a lay annotator. The corpus is designed as a flexible text resource, where additional medical documents will be appended over time. Experiments show that the lay-specialized labels assigned by the lay annotator are reliably learned by standard classifiers. More specifically, Experiment 1 shows that scalability is not an issue when increasing the size of the datasets to be learned from 156 up to 801 documents. Experiment 2 shows that lay-specialized labels can be learned regardless of the large amount of disturbing factors, such as machine translated documents or low-quality texts that are numerous in the corpus.
Ämnesord
- NATURVETENSKAP -- Data- och informationsvetenskap -- Språkteknologi (hsv//swe)
- NATURAL SCIENCES -- Computer and Information Sciences -- Language Technology (hsv//eng)
Publikations- och innehållstyp
- ref (ämneskategori)
- kon (ämneskategori)