Restoring tone-marks in standard yorùbá electronic text: improved model

Asahiah, F. O.; Odejobi, O. A.; Adagunodo, E. R.

doi:10.7494/csci.2017.18.3.2128

Artykuł - szczegóły

Tytuł artykułu

Restoring tone-marks in standard yorùbá electronic text: improved model

Autorzy

Asahiah F. O. , Odejobi O. A. , Adagunodo E. R.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.7494/csci.2017.18.3.2128

Warianty tytułu

Języki publikacji

Abstrakty

Diacritic Restoration is a necessity in the processing of languages with Latinbased scripts that utilizes letters outside the basic Latin alphabet used by English language. Yorùbá is one such languages, marking underdot (dot-below)on three characters and tone marks on all seven vowels and two syllabic nasals. The problem of restoring underdotted characters has been fairly addressed using character as linguistic units for restoration. However, the existing characterbased approaches and word-based approach has not been able to sufficiently address restoration of tone marks in Yorùbá. We address in this study tone marks restoration as a subset of diacritic restoration. We proposed using the syllable (derived from word) as the linguistic token for tone marks restoration. In our experimental setup, we used Yoruba text collected from various sources as data with total word count of 250,336 words. These words, on syllabification, yielded 464,274 syllables. The syllables were divided into training and testing data in different proportions ranging from 99% used for training and 1% used for testing to 70% used for training and 30% used for testing. The aim of evaluation different proportions was to determine how the ratio of training-to-test data affect the variations that may occur in the result. We applied Memory-based learning to train the models. We also set up a similar experiment using character token to be able to compare the performance. The result showed that using syllable was able to increase accuracy at word level to 96.23% and an average of almost 15% over that gotten from using character. We also found out that using 75% of data for training and the remaining 25% for testing gives the results with the least variation in a ten-fold cross validation test. Hybridizing this method that uses syllabless as processing linguistic units with other methods like lexicon lookup might likely lead to improvement over the current result.

Słowa kluczowe

diacritic restoration syllables characters word-level accuracy

Wydawca

Wydawnictwa AGH

Czasopismo

Computer Science

Rocznik

2017

Tom

Vol. 18 (3)

Strony

301--315

Opis fizyczny

Bibliogr. 24 poz., rys., wykr., tab.

Twórcy

autor

Asahiah F. O.

sobusola@oauife.edu.ng

Obafemi Awolowo University, Ile-Ife, Nigeria, Department of Computer Science and Engineering

autor

Odejobi O. A.

oodejobi@oauife.edu.ng

Obafemi Awolowo University, Ile-Ife, Nigeria, Department of Computer Science and Engineering

autor

Adagunodo E. R.

eadagun@oauife.edu.ng

Obafemi Awolowo University, Ile-Ife, Nigeria, Department of Computer Science and Engineering

Bibliografia

[1] Adegbola T., Odilinye L.U.: Quantifying the effect of corpus size on the quality of automatic diacritization of Yorùbá texts. In: Proceedings of 3rd international Workshop on Spoken Languages Technologies for Under-resourced Languages. Cape Town, South Africa, 2012. Online, Retrieved August 12, 2012 from http://www.mica.edu.vn/sltu2012/files/proceedings/10.pdf.
[2] Alake C.A.: Early Descriptions of the Yoruba Language: The Work of Samuel Ajayi Crowther. In: Schmitter P., Jooken L., Desmet P., Swiggers P. (eds.), The History of Linguistic and Grammatic Praxis. Proceedings of the XIth International Colloquium of the Studienkris “Geschichte der Sprachwissenschaft”, Leuven, 2nd–4th July 1998, pp. 427–443, Peeters Publishers, 2000.
[3] Asahiah F.O.: Development of a Standard Yorùbá Text Automatic Diacritic Restoration System. Phd thesis, Obafemi Awolowo University, Ile-Ife, Nigeria, 2014.
[4] Brill E., Ngai G.: Man vs. machine: a case study in base noun phrase learning. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 65–72. Association for Computational Linguistics, 1999.
[5] De Palma P.A.: Syllables and concepts in large vocabulary speech recognition. Phd thesis, The University of New Mexico, New Mexico, The United States of America, 2010.
[6] De Pauw G., Wagacha P.W., de Schryver G.: Automatic Diacritic Restoration for Resource-Scarce Languages. In: Matoušek V., Mautner P. (eds.), Text, Speech and Dialogue, 10th International Conference, TSD 2007, Pilsen, Czech Republic, September 3–7, 2007, Proceedings Lecture Notes in Artificial Intelligence LNAI, subseries of Lecture Notes in Computer Science LNCS, vol. 4629, pp. 170–179, Springer-Verlag, Berlin, 2007.
[7] Fagborun J.G.: Disparities in Tonal and Vowel Representation: Some Practical Problems in Yoruba Orthography, Journal of West African Languages, vol. 19(2), 1989.
[8] Habash N., Rambow O.: Arabic Diacritization through Full Morphological Tagging. In: Proceedings of NAACL HLT 2007, pp. 53–56, Association for Computational Linguistics, Rochester, NY, 2007.
[9] Haertel R.A., McClanahan P., Ringger E.K.: Automatic Diacritization for Low- -Resource Languages Using a Hybrid Word and Consonant CMM. In: Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, California, June 2010, pp. 519–527, 2010
[10] Larson M., Eickeler S.: Using Syllable-based Indexing Features and Language Models to improve German Spoken Document Retrieval. In: Proceedings of Eurospeech. 8th European Conference on Speech Communication and Technology, 2003.
[11] Liu X., Hieronymus J.L., Gales M.J., Woodland P.C.: Syllable language models for Mandarin speech recognition: Exploiting character language models. In: The Journal of the Acoustical Society of America, vol. 133(1), pp. 519–528, 2013.
[12] Majewski P.: Syllable based language model for large vocabulary continuous speech recognition of Polish. In: International Conference on Text, Speech and Dialogue, pp. 397–401, Springer, 2008.
[13] Mihalcea R.: Diacritic Restoration: Learning from Letters versus Learning from Words. In: Proceedings of Computational Linguistics and Intelligent Text Processing, 3rd International Conference, CICLing 2002, Mexico City, vol. 2276, pp. 339–438, Springer, 2002.
[14] Nguyen K.H., Ock C.Y.: Diacritics restoration in vietnamese: letter based vs. syllable based model. In: PRICAI 2010: Trends in Artificial Intelligence, pp. 631–636, Springer, 2010.
[15] Olúmúyìwá T.: Yoruba Writing: Standards and Trends, Journal of Arts and Humanities, vol. 2(1), p. 40, 2013.
[16] O. dé.jo. bí O.A.: A Computational Model of Prosody for Yorùbá Text-to-Speech Synthesis. Phd thesis, Aston University, Aston, 2005.
[17] Šantic N., Šnajder J., Bašic B.D.: Automatic Diacritics Restoration in Croatian Texts. In: INFuture2009: Digital Resources and Knowledge Sharing, pp. 309–318, 2009.
[18] Scannell K.P.: Statistical Unicodification of African Languages. In: Language Resources and Evaluation, pp. 1–12, 2011. Retrieved July 20, 2011 from http: //borel.slu.edu/pub/lre.pdf.
[19] Schlippe T., Nguyen T., Vogel S.: Diacritization as a Machine Translation Problem and as a Sequence Labeling Problem. In: AMTA-2008. MT at work: Proceedings of the Eighth Conference of the Association for Machine Translation in the Americas, pp. 270–278, Waikiki, Hawai’i, 2008.
[20] Schrumpf C., Larson M., Eickeler S.: Syllable-based language models in speech recognition for English spoken document retrieval. In: Proceedings of the 7th International Workshop of the EU Network of Excellence DELOS on AVIVDiLib, Cortona, Italy, pp. 196–205, 2005.
[21] Surmei M., Burileanu D., Negrescu C., Pîrvu R., Ungurean C., Dervis A.: Text-to-Speech Engines as Telecom Service Enablers. In: Advances in Spoken Language Technology, Publishing House of the Romanian Academy, Bucharest, pp. 89–98, 2007.
[22] Truyen T.T., Phung D.Q., Venkatesh S.: Constrained Sequence Classification for Lexical Disambiguation. In: Lecture Notes in Computer Science including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, vol. 5351, pp. 430–441, Springer, 2008. Retrieved from http://www. computing.edu.au/~trantt2/pubs/pricai08.pdf.
[23] Tufis D., Ceausu A.: DIAC: A Professional Diacritics Recovering System. In: Proceedings of the Sixth International Language Resources and Evaluation, 2008, Paper 54 on Conference CD.
[24] Tufis D., Chiµu A.: Automatic Diacritic Insertion in Romanian Texts. In: Proceedings of the International Conference on Computational Lexicography COMPLEX’ 99. Pecs, Hungary, pp. 185–194, 1999.

Uwagi

Opracowanie ze środków MNiSW w ramach umowy 812/P-DUN/2016 na działalność upowszechniającą naukę (zadania 2017).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-5fd2f5e9-b261-4202-8a70-9ac8835afbff