• Login
    • Login
    Advanced Search
    View Item 
    •   Maseno IR Home
    • Journal Articles
    • School of Computing and informatics
    • Department of Computer science
    • View Item
    •   Maseno IR Home
    • Journal Articles
    • School of Computing and informatics
    • Department of Computer science
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Building Text and Speech Datasets for Low Resourced Languages: A Case of Languages in East Africa

    Thumbnail
    View/Open
    building_text_and_speech_datas.pdf (430.2Kb)
    Publication Date
    2022
    Author
    Claire Babirye, Joyce Nakatumba-Nabende, Andrew Katumba, Ronald Ogwang, Jeremy Tusubira Francis, Jonathan Mukiibi, Medadi Ssentanda, Lilian D Wanzare, Davis David
    Metadata
    Show full item record
    Abstract/Overview
    Africa has over 2000 languages; however, those languages are not well repre sented in the existing Natural Language Processing ecosystem. African languages lack essential digital resources to be engaged effectively in the advancing lan guage technologies. This growing gap has attracted researchers to empower and build resources for African languages to transfer the various Natural Language Processing methods to African languages. This paper discusses the process we took to create, curate and annotate language text and speech datasets for low resourced languages in East Africa. This paper focuses on five languages. Four of the languages: Luganda, Runyankore-Rukiga, Acholi, and Lumasaaba, are ma jorly spoken in Uganda, and Kiswahili which is a majorly spoken language across East Africa. We have run baseline: machine translation models on the English - Luganda dataset in the parallel text corpora and Automatic Speech Recognition (ASR) models on the Luganda speech dataset. We recorded a BiLingual Evalua tion Understudy (BLEU) score of 37 for the English-Luganda model and a BLEU score of 36.8 for the Luganda-English model. For the ASR experiments, we ob tained a Word Error Rate (WER) of 33%. Speech, Text, Luganda, Common Voice, ASR, Swahili
    Permalink
    https://repository.maseno.ac.ke/handle/123456789/5278
    Collections
    • Department of Computer science [62]

    Maseno University. All rights reserved | Copyright © 2022 
    Contact Us | Send Feedback

     

     

    Browse

    All of Maseno IRCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    Statistics

    View Usage Statistics

    Maseno University. All rights reserved | Copyright © 2022 
    Contact Us | Send Feedback