SwePub
Sök i LIBRIS databas

  Extended search

onr:"swepub:oai:DiVA.org:liu-152495"
 

Search: onr:"swepub:oai:DiVA.org:liu-152495" > Towards a Model of ...

  • 1 of 1
  • Previous record
  • Next record
  •    To hitlist

Towards a Model of General Text Complexity for Swedish

Falkenjack, Johan, 1986- (author)
Linköpings universitet,Interaktiva och kognitiva system,Tekniska fakulteten,NLPLAB
Jönsson, Arne, 1955- (thesis advisor)
Linköpings universitet,Interaktiva och kognitiva system,Tekniska fakulteten
Östling, Robert, Ph.D. (opponent)
Stockholms Universitet, Stockholm, Sweden
 (creator_code:org_t)
ISBN 9789176851555
Linköping : Linköping University Electronic Press, 2018
English 103 s.
  • Licentiate thesis (other academic/artistic)
Abstract Subject headings
Close  
  • In an increasingly networked world, where the amount of written information is growing at a rate never before seen, the ability to read and absorb written information is of utmost importance for anything but a superficial understanding of life's complexities. That is an example of a sentence which is not very easy to read. It can be said to have a relatively high degree of text complexity. Nevertheless, the sentence is also true. It is important to be able to read and understand written materials. While not everyone might have a job where they have to read a lot, access to written material is necessary in order to participate in modern society. Most information, from news reporting, to medical information, to governmental information, come primarily in a written form.But what makes the sentence at the start of this abstract so complex? We can probably all agree that the length is part of it. But then what? Researches in the field of readability and text complexity analysis have been studying this question for almost 100 years. That research has over time come to include many computational and data driven methods within the field of computational linguistics.This thesis cover some of my contributions to this field of research, though with a main focus on Swedish rather than English text. It aims to explore two primary questions (1) Which linguistic features are most important when assessing text complexity in Swedish? and (2) How can we deal with the problem of data sparsity with regards to complexity annotated texts in Swedish?The first issue is tackled by exploring the task of identifying easy-to-read ("lättläst") text using classification with Support Vector Machines. A large set of linguistic features is evaluated with regards to predictive performance and is shown to separate easy-to-read texts from regular texts with a very high accuracy. Meanwhile, using a genetic algorithm for variable selection, we find that almost the same accuracy can be reached with only 8 features. This implies that this classification problem is not very hard and that results might not generalize to comparing less easy-to-read texts.This, in turn, brings us to the second question. Except for easy-to-read labeled texts, the data with text complexity annotations is very sparse. It consist of multiple small corpora using different scales to label documents. To deal with this problem, we propose a novel statistical model. The model belongs to the larger family of Probit models and is implemented in a Bayesian fashion and estimated using a Gibbs sampler based on extending a well established Gibbs sampler for the Ordered Probit model. This model is evaluated using both simulated and real world readability data with very promising results.

Subject headings

NATURVETENSKAP  -- Data- och informationsvetenskap -- Datavetenskap (hsv//swe)
NATURAL SCIENCES  -- Computer and Information Sciences -- Computer Sciences (hsv//eng)
NATURVETENSKAP  -- Data- och informationsvetenskap -- Språkteknologi (hsv//swe)
NATURAL SCIENCES  -- Computer and Information Sciences -- Language Technology (hsv//eng)
NATURVETENSKAP  -- Matematik -- Sannolikhetsteori och statistik (hsv//swe)
NATURAL SCIENCES  -- Mathematics -- Probability Theory and Statistics (hsv//eng)
HUMANIORA  -- Språk och litteratur -- Studier av enskilda språk (hsv//swe)
HUMANITIES  -- Languages and Literature -- Specific Languages (hsv//eng)

Publication and Content Type

vet (subject category)
lic (subject category)

Find in a library

To the university's database

  • 1 of 1
  • Previous record
  • Next record
  •    To hitlist

Search outside SwePub

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Close

Copy and save the link in order to return to this view