SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Semmar Nasredine) "

Sökning: WFRF:(Semmar Nasredine)

  • Resultat 1-6 av 6
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Adouane, Wafia, 1985, et al. (författare)
  • A Comparison of Character Neural Language Model and Bootstrapping for Language Identification in Multilingual Noisy Texts
  • 2018
  • Ingår i: Proceedings of the Second Workshop on Subword and Character Level Models in NLP (SCLeM), June 6, 2018 New Orleans, Louisiana. - New Orleans, Louisiana USA. - 9781948087186
  • Konferensbidrag (refereegranskat)abstract
    • This paper seeks to examine the effect of including background knowledge in the form of character pre-trained neural language model (LM), and data bootstrapping to overcome the problem of unbalanced limited resources. As a test, we explore the task of language identification in mixed-language short non-edited texts with an under-resourced language, namely the case of Algerian Arabic for which both labelled and unlabelled data are limited. We compare the performance of two traditional machine learning methods and a deep neural networks (DNNs) model. The results show that overall DNNs perform better on labelled data for the majority categories and struggle with the minority ones. While the effect of the untokenised and unlabelled data encoded as LM differs for each category, bootstrapping, however, improves the performance of all systems and all categories. These methods are language independent and could be generalised to other under-resourced languages for which a small labelled data and a larger unlabelled data are available.
  •  
2.
  • Adouane, Wafia, 1985, et al. (författare)
  • Arabicized and Romanized Berber Automatic Identification
  • 2016
  • Ingår i: Proceedings of TICAM 2016. - Morocco : IRCAM.
  • Konferensbidrag (refereegranskat)abstract
    • We present an automatic language identification tool for both Arabicized Berber (Berber written in the Arabic script) and Romanized Berber (Berber written in the Latin script). The focus is on short texts (social media content). We use supervised machine learning method with character and word-based n-gram models as features. We also describe the corpora used in this paper. For both Arabicized and Romanized Berber, character-based 5-grams score the best giving an F-score of 99.50%.
  •  
3.
  •  
4.
  • Adouane, Wafia, 1985, et al. (författare)
  • Automatic Detection of Arabicized Berber and Arabic Varieties
  • 2016
  • Ingår i: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects; 63–72; December 12; Osaka, Japan.
  • Konferensbidrag (refereegranskat)abstract
    • Automatic Language Identification (ALI) is the detection of the natural language of an input text by a machine. It is the first necessary step to do any language-dependent natural language processing task. Various methods have been successfully applied to a wide range of languages, and the state-of-the-art automatic language identifiers are mainly based on character n-gram models trained on huge corpora. However, there are many languages which are not yet automatically processed, for instance minority and informal languages. Many of these languages are only spoken and do not exist in a written format. Social media platforms and new technologies have facilitated the emergence of written format for these spoken languages based on pronunciation. The latter are not well represented on the Web, commonly referred to as under-resourced languages, and the current available ALI tools fail to properly recognize them. In this paper, we revisit the problem of ALI with the focus on Arabicized Berber and dialectal Arabic short texts. We introduce new resources and evaluate the existing methods. The results show that machine learning models combined with lexicons are well suited for detecting Arabicized Berber and different Arabic varieties and distinguishing between them, giving a macro-average F-score of 92.94%.
  •  
5.
  • Adouane, Wafia, 1985, et al. (författare)
  • Romanized Arabic and Berber Detection Using PPM and Dictionary Methods
  • 2017
  • Ingår i: 13th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2016. - Morocco. - 2161-5322. - 9781509043200
  • Konferensbidrag (refereegranskat)abstract
    • Arabic is one of the Semitic languages written in Arabic script in its standard form. However, the recent rise of social media and new technologies has contributed considerably to the emergence of a new form of Arabic, namely Arabic written in Latin scripts, often called Romanized Arabic or Arabizi. While Romanized Arabic is an informal language, Berber or Tamazight uses Latin script in its standard form with some orthography differences depending on the country it is used in. Both these languages are under-resourced and unknown to the state-of-the-art language identifiers. In this paper, we present a language automatic identifier for both Romanized Arabic and Romanized Berber. We also describe the built linguistic resources (large dataset and lexicons) including a wide range of Arabic dialects (Algerian, Egyptian, Gulf, Iraqi, Levantine, Moroccan and Tunisian dialects) as well as the most popular Berber varieties (Kabyle, Tashelhit, Tarifit, Tachawit and Tamzabit). We use the Prediction by Partial Matching (PPM) and dictionary-based methods. The methods reach a macro-average F-Measure of 98.74% and 97.60% respectively.
  •  
6.
  • Adouane, Wafia, 1985, et al. (författare)
  • Romanized Berber and Romanized Arabic Automatic Language Identification Using Machine Learning
  • 2016
  • Ingår i: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects; 53–61; December 12, 2016 ; Osaka, Japan. - : Association for Computational Linguistics. - 0736-587X.
  • Konferensbidrag (refereegranskat)abstract
    • The identification of the language of text/speech input is the first step to be able to properly do any language-dependent natural language processing. The task is called Automatic Language Identification (ALI). Being a well-studied field since early 1960’s, various methods have been applied to many standard languages. The ALI standard methods require datasets for training and use character/word-based n-gram models. However, social media and new technologies have contributed to the rise of informal and minority languages on the Web. The state-of-the-art automatic language identifiers fail to properly identify many of them. Romanized Arabic (RA) and Romanized Berber (RB) are cases of these informal languages which are under-resourced. The goal of this paper is twofold: detect RA and RB, at a document level, as separate languages and distinguish between them as they coexist in North Africa. We consider the task as a classification problem and use supervised machine learning to solve it. For both languages, character-based 5-grams combined with additional lexicons score the best, F-score of 99.75% and 97.77% for RB and RA respectively.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-6 av 6
Typ av publikation
konferensbidrag (6)
Typ av innehåll
refereegranskat (6)
Författare/redaktör
Adouane, Wafia, 1985 (6)
Semmar, Nasredine (6)
Johansson, Richard, ... (5)
Bernardy, Jean-Phili ... (1)
Dobnik, Simon, 1977 (1)
Bobicev, Victoria (1)
Lärosäte
Göteborgs universitet (6)
Språk
Engelska (6)
Forskningsämne (UKÄ/SCB)
Naturvetenskap (6)
Humaniora (3)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy