SwePub
Sök i SwePub databas

  Extended search

Träfflista för sökning "WFRF:(Golub Koraljka) srt2:(2005-2009)"

Search: WFRF:(Golub Koraljka) > (2005-2009)

  • Result 1-10 of 31
Sort/group result
   
EnumerationReferenceCoverFind
1.
  • Ardö, Anders, et al. (author)
  • Deliverable D7.2 : focused crawler software package
  • 2007
  • Other publication (other academic/artistic)abstract
    • The focused crawler in ALVIS is based on the Combine system, whichis an open source system for crawling Internet resources. This deliverable is the software package, the text here describes the software, packaging and distribution of the focused crawler. It provides instructions for how to download, install, test and use the Combine system for focused crawling. Evaluation of performance and scalability aredescribed.Finally a lot of details about the software structure and configuration is provided.
  •  
2.
  • Ardö, Anders, et al. (author)
  • Focused crawler software package
  • 2007
  • Reports (other academic/artistic)abstract
    • The focused crawler in ALVIS is based on the Combine system, which is an open source system for crawling Internet resources. This deliverable is the software package, the text here describes the software, packaging and distribution of the focused crawler. It provides instructions for how to download, install, test and use the Combine system for focused crawling. Evaluation of performance and scalability are described. Finally a lot of details about the software structure and configuration is provided.
  •  
3.
  •  
4.
  • Golub, Koraljka, et al. (author)
  • Automated classification of textual documents based on a controlled vocabulary in engineering
  • 2007
  • In: Knowledge organization. - : Ergon-Verlag. - 0943-7444. ; 34:4, s. 247-263
  • Journal article (peer-reviewed)abstract
    • Automated subject classification has been a challenging research issue for many years now, receiving particular attention in the past decade due to rapid increase of digital documents. The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if these are similar enough to the former. We explore a string-matching algorithm based on a controlled vocabulary, which does not require training documents--instead it reuses the intellectual work put into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against title and abstract of engineering papers from the Compendex database. Simple string-matching was enhanced by several methods such as term weighting schemes and cut-offs, exclusion of certain terms, and enrichment of the controlled vocabulary with automatically extracted terms. The best results are 76% recall when the controlled vocabulary is enriched with new terms, and 79% precision when certain terms are excluded. Precision of individual classes is up to 98%. These results are comparable to state-of-the-art machine-learning algorithms.
  •  
5.
  •  
6.
  • Golub, Koraljka, et al. (author)
  • Automated classification of Web pages in hierarchical browsing
  • 2009
  • In: Journal of Documentation. - : Emerald Group Publishing Limited. - 0022-0418 .- 1758-7379. ; 6:65, s. 901-925
  • Journal article (peer-reviewed)abstract
    • Purpose - The purpose of this study is twofold: to investigate whether it is meaningful to use the Engineering Index (Ei) classification scheme for browsing, and then, if proven useful, to investigate the performance of an automated classificationalgorithm based on the Ei classification scheme.Design/methodology/approach - A user study was conducted in which users solved four controlled searching tasks. The users browsed the Ei classification scheme in order to examine the suitability of the classification systems for browsing. The classification algorithm was evaluated by the users who judged the correctness of the automatically assigned classes.Findings - The study showed that the Ei classification scheme is suited for browsing. Automatically assigned classes were on average partly correct, with some classes working better than others. Success of browsing showed to be correlated and dependent on classification correctness.Research limitations/implications - Further research should address problems of disparate evaluations of one and the same web page. Additional reasons behind browsing failures in the Ei classification scheme also need further investigation.Practical implications - Improvements for browsing were identified: describing class captions and/or listing their subclasses from start; allowing for searching for words from class captions with synonym search (easily provided for Ei since the classes are mapped to thesauri terms); when searching for class captions, returning the hierarchical tree expanded around the class in which caption the search term is found. The need for improvements of classification schemes was also indicated.Originality/value - A User-based evaluation of automated subject classification in the context of browsing has not been conducted before; hence the study also presents new findings concerning methodology.
  •  
7.
  • Golub, Koraljka (author)
  • Automated Subject Classification of Textual Documents in the Context of Web-Based Hierarchical Browsing
  • 2007
  • Doctoral thesis (other academic/artistic)abstract
    • With the exponential growth of the World Wide Web, automated subject classification has become a major research issue. Organizing web pages into a hierarchical structure for subject browsing has been gaining more recognition as an important tool in information-seeking processes.The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if they are similar enough to the former. In the thesis, a string-matching algorithm based on a controlled vocabulary was explored. It does not require training documents, but instead reuses the intellectual work invested into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against text of documents to be classified. Plain string-matching was enhanced in several ways, including term weighting with cut-offs, exclusion of certain terms, and enrichment of the controlled vocabulary with automatically extracted terms. The final results were comparable to those of state-of-the-art machine-learning algorithms, especially for particular classes. Concerning web pages, it was indicated that all the structural information and metadata available in web pages should be used in order to achieve the best automated classification results; however, the exact way of combining them proved not to be very important.In the context of browsing, the biggest difference between three approaches to automated classification (machine learning, information retrieval, library science) is whether they use controlled vocabularies. It has been claimed that well-structured, high-quality classification schemes, such as those used predominantly in library science approaches, could serve as good browsing structures. In the thesis it was shown that Dewey Decimal Classification and Engineering Information classification scheme are suitable for the task. Moreover, a log analysis of a large web-based service using Dewey Decimal Classification demonstrated that browsing is used to a much larger degree than searching.The final conclusion is that an appropriate controlled vocabulary, with a large number of entry vocabulary designating classes, could be utilised in automated classification. If the same controlled vocabulary has an appropriate hierarchical structure, it could at the same time provide a good browsing structure to the automatically classified collection of documents.
  •  
8.
  • Golub, Koraljka (author)
  • Automated subject classification of textual web documents
  • 2006
  • In: Journal of Documentation. - : Emerald Group Publishing Limited. - 0022-0418 .- 1758-7379. ; 62:3, s. 350-371
  • Journal article (peer-reviewed)abstract
    • Purpose– To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning, information retrieval and library science), and point to problems with the approaches and automated classification as such.Design/methodology/approach– A range of works dealing with automated classification of full‐text web documents are discussed. Explorations of individual approaches are given in the following sections: special features (description, differences, evaluation), application and characteristics of web pages.Findings– Provides major similarities and differences between the three approaches: document pre‐processing and utilization of web‐specific document characteristics is common to all the approaches; major differences are in applied algorithms, employment or not of the vector space model and of controlled vocabularies. Problems of automated classification are recognized.Research limitations/implications– The paper does not attempt to provide an exhaustive bibliography of related resources.Practical implications– As an integrated overview of approaches from different research communities with application examples, it is very useful for students in library and information science and computer science, as well as for practitioners. Researchers from one community have the information on how similar tasks are conducted in different communities.Originality/value– To the author's knowledge, no review paper on automated text classification attempted to discuss more than one community's approach from an integrated perspective.
  •  
9.
  • Golub, Koraljka (author)
  • Automated subject classification of textual Web pages, based on a controlled vocabulary : challenges and recommendations
  • 2006
  • In: New Review of Hypermedia and Multimedia. - : Informa UK Limited. - 1361-4568 .- 1740-7842. ; 12:1, s. 11-27
  • Journal article (peer-reviewed)abstract
    • The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather heterogeneous. The classification approach used comprises string-to-string matching between words in a term list extracted from the Ei (Engineering Information) thesaurus and classification scheme, and words in the text to be classified. Based on a sample of 70 Web pages, a number of problems with the term list are identified. Reasons for those problems are discussed and improvements proposed. Methods for implementing the improvements are also specified, suggesting further research.
  •  
10.
  •  
Skapa referenser, mejla, bekava och länka
  • Result 1-10 of 31

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Close

Copy and save the link in order to return to this view