SwePub
Sök i LIBRIS databas

  Utökad sökning

WFRF:(Nyström Mikael)
 

Sökning: WFRF:(Nyström Mikael) > (2015-2019) > A Web Corpus for eC...

A Web Corpus for eCare : Collection, Lay Annotation and Learning - First Results

Santini, Marina, 1960- (författare)
RISE,SICS,RISE SICS East Linköping, Sweden
Jönsson, Arne, 1955- (författare)
Linköpings universitet,RISE,SICS,Linköping University, Sweden,Interaktiva och kognitiva system,Filosofiska fakulteten,RISE SICS East Linköping, Sweden
Nyström, Mikael, 1977- (författare)
Linköpings universitet,RISE,SICS,Linköping University, Sweden,Institutionen för medicinsk teknik,Tekniska fakulteten
visa fler...
Alireza, Marjan (författare)
Örebro University, Örebro, Sweden
visa färre...
 (creator_code:org_t)
2017-09-24
2017
Engelska.
  • Konferensbidrag (refereegranskat)
Abstract Ämnesord
Stäng  
  • In this position paper, we put forward two claims: 1) it is possible to design a dynamic and extensible corpus without running the risk of getting into scalability problems; 2) it is possible to devise noise-resistant Language Technology applications without affecting performance. To support our claims, we describe the design, construction and limitations of a very specialized medical web corpus, called eCare_Sv_01, and we present two experiments on lay-specialized text classification. eCare_Sv_01 is a small corpus of web documents written in Swedish. The corpus contains documents about chronic diseases. The sublanguage used in each document has been labelled as “lay” or “specialized” by a lay annotator. The corpus is designed as a flexible text resource, where additional medical documents will be appended over time. Experiments show that the lay-specialized labels assigned by the lay annotator are reliably learned by standard classifiers. More specifically, Experiment 1 shows that scalability is not an issue when increasing the size of the datasets to be learned from 156 up to 801 documents. Experiment 2 shows that lay-specialized labels can be learned regardless of the large amount of disturbing factors, such as machine translated documents or low-quality texts that are numerous in the corpus.

Ämnesord

NATURVETENSKAP  -- Data- och informationsvetenskap -- Språkteknologi (hsv//swe)
NATURAL SCIENCES  -- Computer and Information Sciences -- Language Technology (hsv//eng)

Publikations- och innehållstyp

ref (ämneskategori)
kon (ämneskategori)

Till lärosätets databas

Sök utanför SwePub

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy