SwePub
Tyck till om SwePub Sök här!
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "hsv:(NATURVETENSKAP) hsv:(Data och informationsvetenskap) ;pers:(Tiedemann Jörg)"

Sökning: hsv:(NATURVETENSKAP) hsv:(Data och informationsvetenskap) > Tiedemann Jörg

  • Resultat 1-10 av 120
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Tiedemann, Jörg (författare)
  • Improved Text Extraction from PDF Documents for Large-Scale Natural Language Processing
  • 2014
  • Ingår i: Computational Linguistics and Intelligent Text Processing, Cicling 2014, PT I. - 9783642549052 - 9783642549069 ; , s. 102-112
  • Konferensbidrag (refereegranskat)abstract
    • The inability of reliable text extraction from arbitrary documents is often an obstacle for large scale NLP based on resources crawled from the Web. One of the largest problems in the conversion of PDF documents is the detection of the boundaries of common textual units such as paragraphs, sentences and words. PDF is a file format optimized for printing and encapsulates a complete description of the layout of a document including text, fonts, graphics and so on. This paper describes a tool for extracting texts from arbitrary PDF files for the support of large-scale data-driven natural language processing. Our approach combines the benefits of several existing solutions for the conversion of PDF documents to plain text and adds a language-independent post-processing procedure that cleans the output for further linguistic processing. In particular, we use the PDF-rendering libraries pdfXtk, Apache Tika and Poppler in various configurations. From the output of these tools we recover proper boundaries using on-the-fly language models and language-independent extraction heuristics. In our research, we looked especially at publications from the European Union, which constitute a valuable multilingual resource, for example, for training statistical machine translation models. We use our tool for the conversion of a large multilingual database crawled from the EU bookshop with the aim of building parallel corpora. Our experiments show that our conversion software is capable of fixing various common issues leading to cleaner data sets in the end.
  •  
2.
  •  
3.
  • Ahrenberg, Lars and Merkel, Magnus and Ridings, Daniel and Sågvall Hein, Anna and Tiedemann, Jörg (författare)
  • Automatic processing of parallel corpora: A Swedish perspective.
  • 1999
  • Rapport (övrigt vetenskapligt/konstnärligt)abstract
    • As empirical methods have come to the fore in language technology and translation studies, the processing of parallel texts and parallel corpora have become a major issue. In this article we review the state of the art in alignment and data extraction tec
  •  
4.
  • Ahrenberg, Lars, 1948-, et al. (författare)
  • Automatic Processing of Parallel Corpora: A Swedish Perspective
  • 1999
  • Rapport (övrigt vetenskapligt/konstnärligt)abstract
    • As empirical methods have come to the fore in multilingual language technology and translation studies, the processing of parallel texts and parallel corpora have become a major research area in computational linguistics. In this article we review the state of the art in alignment and data extraction techniques for parallel texts, and give an overview of current work in Sweden in this area. In a final section, we summarize the results achieved so far and make some proposals for future research.
  •  
5.
  •  
6.
  • Ahrenberg, Lars, 1948-, et al. (författare)
  • Evaluation of word alignment systems
  • 2000
  • Ingår i: Proceedings of the Second International Conference on Linguistic Resources and Evaluation (LREC-2000). - Paris, France : European Language Resources Association (ELRA). ; , s. 1255-1261
  • Konferensbidrag (refereegranskat)
  •  
7.
  •  
8.
  • Bjerva, Johannes, et al. (författare)
  • What Do Language Representations Really Represent?
  • 2019
  • Ingår i: Computational linguistics - Association for Computational Linguistics (Print). - : MIT Press - Journals. - 0891-2017 .- 1530-9312. ; 45:2, s. 381-389
  • Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)abstract
    • A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just as it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, whereas genetic relationships—a convenient benchmark used for evaluation in previous work—appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another.
  •  
9.
  •  
10.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 120
Typ av publikation
konferensbidrag (80)
bokkapitel (14)
tidskriftsartikel (8)
rapport (5)
annan publikation (4)
doktorsavhandling (4)
visa fler...
samlingsverk (redaktörskap) (1)
bok (1)
proceedings (redaktörskap) (1)
licentiatavhandling (1)
recension (1)
visa färre...
Typ av innehåll
refereegranskat (91)
övrigt vetenskapligt/konstnärligt (29)
Författare/redaktör
Hardmeier, Christian (18)
Plas, Lonneke van de ... (15)
Nivre, Joakim (11)
Mur, Jori (9)
Bouma, Gosse (8)
visa fler...
Noord, Gertjan van (8)
Sågvall Hein, Anna (7)
Stymne, Sara, 1977- (5)
Östling, Robert, 198 ... (5)
Stymne, Sara (5)
Nivre, Joakim, 1962- (4)
Fahmi, Ismail (4)
Forsbom, Eva (3)
Smith, Aaron (3)
Pettersson, Eva (3)
Nabende, Peter (2)
Nivre, Joakim, Profe ... (2)
Agić, Zeljko (2)
Dalianis, Hercules (2)
Ahrenberg, Lars, 194 ... (2)
Merkel, Magnus (2)
Guillou, Liane (2)
Ginter, Filip (2)
Kloosterman, Geert (2)
Merkler, Danijela (1)
Krek, Simon (1)
Dobrovoljc, Kaja (1)
Moze, Sara (1)
Ahrenberg, Lars and ... (1)
Olsson, Leif-Jöran (1)
Ahrenberg, Lars (1)
Merkel, Magnus, 1959 ... (1)
Ridings, Daniel (1)
Almqvist, Ingrid (1)
Östling, Robert (1)
Forsbom, Eva, 1964- (1)
Bollmann, Marcel (1)
Augenstein, Isabelle (1)
Prashant, Mathur (1)
Bertels, Ann (1)
Fairon, Cédrick (1)
Verlinde, Serge (1)
Loáiciga, Sharid (1)
Bjerva, Johannes (1)
Han Veiga, Maria (1)
Oepen, Stephan (1)
Callin, Jimmy (1)
Schleussner, Sebasti ... (1)
Cap, Fabienne (1)
visa färre...
Lärosäte
Uppsala universitet (111)
Stockholms universitet (6)
Linköpings universitet (3)
Kungliga Tekniska Högskolan (1)
Språk
Engelska (119)
Franska (1)
Forskningsämne (UKÄ/SCB)
Naturvetenskap (120)
Medicin och hälsovetenskap (1)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy