SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "LAR1:lu ;lar1:(lnu);pers:(Golub Koraljka)"

Sökning: LAR1:lu > Linnéuniversitetet > Golub Koraljka

  • Resultat 1-10 av 12
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Golub, Koraljka (författare)
  • Automated Subject Classification of Textual Documents in the Context of Web-Based Hierarchical Browsing
  • 2007
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • With the exponential growth of the World Wide Web, automated subject classification has become a major research issue. Organizing web pages into a hierarchical structure for subject browsing has been gaining more recognition as an important tool in information-seeking processes.The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if they are similar enough to the former. In the thesis, a string-matching algorithm based on a controlled vocabulary was explored. It does not require training documents, but instead reuses the intellectual work invested into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against text of documents to be classified. Plain string-matching was enhanced in several ways, including term weighting with cut-offs, exclusion of certain terms, and enrichment of the controlled vocabulary with automatically extracted terms. The final results were comparable to those of state-of-the-art machine-learning algorithms, especially for particular classes. Concerning web pages, it was indicated that all the structural information and metadata available in web pages should be used in order to achieve the best automated classification results; however, the exact way of combining them proved not to be very important.In the context of browsing, the biggest difference between three approaches to automated classification (machine learning, information retrieval, library science) is whether they use controlled vocabularies. It has been claimed that well-structured, high-quality classification schemes, such as those used predominantly in library science approaches, could serve as good browsing structures. In the thesis it was shown that Dewey Decimal Classification and Engineering Information classification scheme are suitable for the task. Moreover, a log analysis of a large web-based service using Dewey Decimal Classification demonstrated that browsing is used to a much larger degree than searching.The final conclusion is that an appropriate controlled vocabulary, with a large number of entry vocabulary designating classes, could be utilised in automated classification. If the same controlled vocabulary has an appropriate hierarchical structure, it could at the same time provide a good browsing structure to the automatically classified collection of documents.
  •  
2.
  • Golub, Koraljka (författare)
  • Automated subject classification of textual web documents
  • 2006
  • Ingår i: Journal of Documentation. - : Emerald Group Publishing Limited. - 0022-0418 .- 1758-7379. ; 62:3, s. 350-371
  • Tidskriftsartikel (refereegranskat)abstract
    • Purpose– To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning, information retrieval and library science), and point to problems with the approaches and automated classification as such.Design/methodology/approach– A range of works dealing with automated classification of full‐text web documents are discussed. Explorations of individual approaches are given in the following sections: special features (description, differences, evaluation), application and characteristics of web pages.Findings– Provides major similarities and differences between the three approaches: document pre‐processing and utilization of web‐specific document characteristics is common to all the approaches; major differences are in applied algorithms, employment or not of the vector space model and of controlled vocabularies. Problems of automated classification are recognized.Research limitations/implications– The paper does not attempt to provide an exhaustive bibliography of related resources.Practical implications– As an integrated overview of approaches from different research communities with application examples, it is very useful for students in library and information science and computer science, as well as for practitioners. Researchers from one community have the information on how similar tasks are conducted in different communities.Originality/value– To the author's knowledge, no review paper on automated text classification attempted to discuss more than one community's approach from an integrated perspective.
  •  
3.
  • Golub, Koraljka (författare)
  • Automated subject classification of textual Web pages, based on a controlled vocabulary : challenges and recommendations
  • 2006
  • Ingår i: New Review of Hypermedia and Multimedia. - : Informa UK Limited. - 1361-4568 .- 1740-7842. ; 12:1, s. 11-27
  • Tidskriftsartikel (refereegranskat)abstract
    • The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather heterogeneous. The classification approach used comprises string-to-string matching between words in a term list extracted from the Ei (Engineering Information) thesaurus and classification scheme, and words in the text to be classified. Based on a sample of 70 Web pages, a number of problems with the term list are identified. Reasons for those problems are discussed and improvements proposed. Methods for implementing the improvements are also specified, suggesting further research.
  •  
4.
  • Golub, Koraljka, et al. (författare)
  • Automatic Classification of Swedish Metadata Using Dewey Decimal Classification : A Comparison of Approaches
  • 2020
  • Ingår i: Journal of Data and Information Science. - : Walter de Gruyter GmbH. - 2096-157X .- 2543-683X. ; 5:1, s. 18-38
  • Tidskriftsartikel (refereegranskat)abstract
    • With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC. State-of-the-art machine learning algorithms require at least 1,000 training examples per class. The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data (totaling 802 classes in the training and testing sample, out of 14,413 classes at all levels). Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average; the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task. Word embeddings combined with different types of neural networks (simple linear network, standard neural network, 1D convolutional neural network, and recurrent neural network) produced worse results than Support Vector Machine, but reach close results, with the benefit of a smaller representation size. Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input. Stemming only marginally improves the results. Removed stop-words reduced accuracy in most cases, while removing less frequent words increased it marginally. The greatest impact is produced by the number of training examples: 81.90% accuracy on the training set is achieved when at least 1,000 records per class are available in the training set, and 66.13% when too few records (often less than 100 per class) on which to train are available-and these hold only for top 3 hierarchical levels (803 instead of 14,413 classes). Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes, skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems. In conclusion, for operative information retrieval systems applying purely automatic DDC does not work, either using machine learning (because of the lack of training data for the large number of DDC classes) or using string-matching algorithm (because DDC characteristics perform well for automatic classification only in a small number of classes). Over time, more training examples may become available, and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC. In order for quality information services to reach the objective of highest possible precision and recall, automatic classification should never be implemented on its own; instead, machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future. The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems. Due to lack of sufficient training data across the entire set of classes, an approach complementing machine learning, that of string matching, was applied. This combination should be explored further since it provides the potential for real-life applications with large target classification systems.
  •  
5.
  • Golub, Koraljka, et al. (författare)
  • Automatic classification using DDC on the Swedish Union Catalogue
  • 2018
  • Ingår i: Proceedings of the 18th European Networked Knowledge Organization Systems (NKOS 2018) Workshop, Porto, Portugal, September 13, 2018. - : CEUR-WS.org. ; 2200, s. 4-16
  • Konferensbidrag (refereegranskat)abstract
    • With more and more digital collections of various information re- sources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of two machine learning algorithms for Swe- dish catalogue records from the Swedish union catalogue (LIBRIS). The algo- rithms are tested on the top three hierarchical levels of the DDC. Based on a data set of 143,838 records, evaluation shows that Support Vector Machine with linear kernel outperforms Multinomial Naïve Bayes algorithm. Also, using keywords or combining titles and keywords gives better results than using only titles as input. The class imbalance where many DDC classes only have few records greatly affects classification performance: 81.37% accuracy on the training set is achieved when at least 1,000 records per class are available, and 66.13% when few records on which to train are available. Proposed future research involves an exploration of the intellectual effort put into creating the DDC to further improve the algorithm performance as commonly applied in string matching, and to test the best approach on new digital collections that do not have DDC assigned.
  •  
6.
  • Golub, Koraljka, et al. (författare)
  • Comparing and combining two approaches to automated subject classification of text
  • 2006
  • Ingår i: Research and advanced technology for digital libraries. - Berlin, Heidelberg : Springer. - 9783540446361 - 9783540446385 ; 4172, s. 467-470
  • Konferensbidrag (refereegranskat)abstract
    • A machine-learning and a string-matching approach to automated subject classification of text were compared, as to their performance, advantages and downsides. The former approach was based on an SVM algorithm, while the latter comprised string-matching between a controlled vocabulary and words in the text to be classified. Data collection consisted of a subset from Compendex, classified into six different classes. It was shown that SVM on average outperforms the string-matching approach: our hypothesis that SVM yields better recall and string-matching better precision was confirmed only on one of the classes. The two approaches being complementary, we investigated different combinations of the two based on combining their vocabularies. The results have shown that the original approaches, i.e. machine-learning approach without using background knowledge from the controlled vocabulary, and string-matching approach based on controlled vocabulary, outperform approaches in which combinations of automatically and manually obtained terms were used. Reasons for these results need further investigation, including a larger data collection and combining the two using predictions.
  •  
7.
  • Golub, Koraljka, et al. (författare)
  • Digital humanities in Sweden and its infrastructure : Status quo and the sine qua non
  • 2020
  • Ingår i: Digital Scholarship in the Humanities. - Oxford : Oxford University Press (OUP). - 2055-7671 .- 2055-768X. ; 35:3, s. 547-556
  • Tidskriftsartikel (refereegranskat)abstract
    • The article offers a state-of-the-art overview of a number of Digital Humanities (DH) initiatives that have emerged in Sweden over the past decade. We identify two major developments that seem to be taking place within DH, with a specific focus on the infrastructural aspects of the development: (1) a strive to open up and broaden the research output and (2) multi-disciplinary collaboration and its effects. The two major components accentuate the new infrastructural patterns that are developing and the challenges these infer on universities. While current research is at large multi-disciplinary, developing infrastructures also enable the move towards post-disciplinarity, bringing the universities closer to the surrounding society. At five universities in Sweden, individual-sited infrastructures supporting DH research have been built today. They are complemented by national and international infrastructures, thus supporting developments and tackling some of the major challenges. In the article, the relations between individual disciplines, the question of multi- and post-disciplinarity, and the field of Digital Humanities are discussed, while stressing the factors necessary—sine qua non—for a fruitful development of the scholarly infrastructures.
  •  
8.
  • Koch, Traugott, et al. (författare)
  • Browsing and searching behavior in the Renardus Web service: a study based on log analysis
  • 2004
  • Ingår i: Proceedings of the Fourth ACM/IEEE Joint Conference on Digital Libraries. - : ACM Press. - 1581138326 ; , s. 378-378
  • Konferensbidrag (refereegranskat)abstract
    • Renardus is a distributed Web-based service, which provides integrated searching and browsing access to quality-controlled Web resources. With the overall purpose of improving Renardus, the research aims to study: the detailed usage patterns (quantitative/qualitative, paths through the system); the balance between browsing and searching or mixed activities; typical sequences of usage steps and transition probabilities in a session; typical entry points, referring sites, points of failure and exit points; and, the usage degree of the browsing support features
  •  
9.
  •  
10.
  • Koch, Traugott, et al. (författare)
  • Users browsing behaviour in a DDC-based Web service : a log analysis
  • 2006
  • Ingår i: Cataloging & Classification Quarterly. - : Taylor & Francis. - 0163-9374 .- 1544-4554. ; 42:3-4, s. 163-186
  • Tidskriftsartikel (refereegranskat)abstract
    • This study explores the navigation behaviour of all users of a large web service, Renardus, using web log analysis. Renardus provides integrated searching and browsing access to quality-controlled web resources from major individual subject gateway services. The main navigation feature is subject browsing through the Dewey Decimal Classification (DDC) based on mapping of classes of resources from the distributed gateways to the DDC structure. Among the more surprising results are the hugely dominant share of browsing activities, the good use of browsing support features like the graphical fish-eye overviews, rather long and varied navigation sequences, as well as extensive hierarchical directory-style browsing through the large DDC system.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 12

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy