A Deep Learning Approach for the Romanized Tunisian Dialect Identification

Author Jihene Younes1, Hadhemi Achour1, Emna Souissi2, and Ahmed Ferchichi1 1Université de Tunis, ISGT, Tunisia 2Université de Tunis, ENSIT, Tunisia,

Keywords #Tunisian dialect #language identification #deep learning #BLSTM #CRF #natural language processing

Abstract Language identification is an important task in natural language processing that consists of determining the language of a given text. It has increasingly picked the interest of researchers for the past few years, especially for code- switching informal textual content. This paper, focuses on the identification of the Romanized user-generated Tunisian dialect on the social web. Segmented and annotated a corpus extracted from social media and propose a deep learning approach for the identification task. A Bidirectional Long Short-Term Memory neural network with Conditional Random Fields decoding (BLSTM-CRF) had been used. For word embeddings, a combination of word-character BLSTM vector representation and Fast Text embeddings that takes into consideration character n-gram features. The overall accuracy obtained is 98.65%.

References

[1] Ahmed B., Cha S., and Tappert C., “Language Identification from Text Using N-gram Based Cumulative Frequency Addition,” in Proceedings of Student/Faculty Research Day, USA, pp. 1-8, 2004.

[2] Al-Badrashiny M. and Diab M., “LILI: A Simple Language Independent Approach for Language Identification,” in Proceedings of COLING 26th International Conference on Computational Linguistics: Technical Papers, Osaka, pp. 1211- 1219, 2016.

[3] Aridhi C., Achour H., Souissi E., and Younes J., “Word-Level Identification of Romanized Tunisian Dialect,” in Proceedings of International Conference on Applications of Natural Language to Information Systems, Liège, pp. 170-175, 2017.

[4] Bar K. and Dershowitz N., “The Tel Aviv University System for the Code-Switching Workshop Shared Task,” in Proceedings of 1st Workshop on Computational Approaches to Code Switching, Doha, pp. 139-143, 2014.

[5] Barman U., Das A., Wagner J., and Foster J., “Code Mixing: A Challenge for Language Identification in the Language of Social Media,” in Proceedings of 1st Workshop on Computational Approaches to Code Switching, Doha, pp. 13-23, 2014.

[6] Barman U., Wagner J., Chrupala G., and Foster J., “DCU-UVT: Word-Level Language Classification with Code-Mixed Data,” in Proceedings of 1st Workshop on Computational Approaches to Code Switching, Doha, pp. 127- 132, 2014.

[7] Barnett R., Codò E., Eppler E., Forcadell M., Gardner-Chloros P., Hout R., Moyer M., Torras M., Turell M., and Sebba M., “The LIDES Coding Manual: A document for Preparing and Analyzing Language Interaction Data Version,” International Journal of Bilingualism, vol. 4, no. 2, pp. 131-271, 2000.

[8] Bartz C., Herold T., Yang H., and Meinel C., “Language Identification Using Deep Convolutional Recurrent Neural Networks,” in Proceedings of International Conference on Neural Information Processing, Guangzhou, pp. 880-889, 2017.

[9] Bouamor H., Habash N., and Oflazer K., “A Multidialectal Parallel Corpus of Arabic,” in Proceedings of 9th International Conference on Language Resources and Evaluation, Reykjavik, pp. 1240-1245, 2014.

[10] Chanda A., Das D., and Mazumdar C., “Columbia-Jadavpur submission for EMNLP 2016 Code-Switching Workshop Shared Task: System Description,” in Proceedings of the 2nd Workshop on Computational Approaches to Code Switching, Austin, pp.112-115, 2016.

[11] Chanda A., Das D., and Mazumdar C., “Unraveling the English-Bengali Code Mixing 944 The International Arab Journal of Information Technology, Vol. 17, No. 6, November 2020 Phenomenon,” in Proceedings of 2nd Workshop on Computational Approaches to Code Switching, Austin, pp. 80-89, 2016.

[12] Chang J. and Lin C., “Recurrent-neural-network for Language Detection on Twitter Code- Switching Corpus,” arXiv preprint, arXiv:1412.4314, pp. 1-9, 2014.

[13] Chittaranjan G., Vyas Y., Bali K., and Choudhury M., “Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System,” in Proceedings of 1st Workshop on Computational Approaches to Code Switching, Doha, pp. 73-79, 2014.

[14] Daoud M., “The Language Situation in Tunisia,” Current Issues in Language Planning, vol. 2, no. 1, pp. 1-52, 2001.

[15] Dongen N., Analysis and Prediction of Dutch- English Code-switching in Dutch Social Media Messages, Master’s Thesis, Universiteit van Amsterdam, 2017.

[16] Dyer C., Ballesteros M., Ling W., Matthews A., and Smith N., “Transition-Based Dependency Parsing with Stack Long Short-Term Memory,” in Proceedings of 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, pp. 334-343, 2015.

[17] Elaraby M. and Abdul-Mageed M., “Deep Models for Arabic Dialect Identification on Benchmarked Data,” in Proceedings of 5th Workshop on NLP for Similar Languages, Varieties and Dialects, Santa Fe, pp. 263-274, 2018.

[18] Elfardy H., Al-Badrashiny M., and Diab M., “AIDA: Identifying Code Switching in Informal Arabic Text,” in Proceedings of 1st Workshop on Computational Approaches to Code Switching, Doha, pp. 94-101, 2014.

[19] Eskander R., Al-Badrashiny M., Habash N., and Rambow O., “Foreign Words and the Automatic Processing of Arabic Social Media Text Written in Roman Script,” in Proceedings of 1st Workshop on Computational Approaches to Code Switching, Doha, pp. 1-12, 2014.

[20] Giwa O. and Davel M., “N-Gram based Language Identification of Individual Words,” in Proceedings of Conference: Pattern Recognition Association of South Africa, Johannesburg, pp. 1- 22, 2013.

[21] Goumi A., Volckaert-Legrier O., Bert-Erboul A., and Bernicot J., “SMS Length and Function: A Comparative Study of 13-to 18-Year-Old Girls and Boys,” European Review of Applied Psychology, vol. 61, no. 4, pp. 175-184, 2011.

[22] Graves A., Mohamed A., and Hinton G., “Speech Recognition with Deep Recurrent Neural Networks,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, Vancouver, pp. 6645-6649, 2013.

[23] Guellil I. and Azouaou F., “Arabic Dialect Identification with an Unsupervised Learning (based on a lexicon) Application case: ALGERIAN Dialect,” in Proceedings of IEEE International Conference on Computational Science and Engineering, IEEE International Conference on Embedded and Ubiquitous Computing, and International Symposium on Distributed Computing and Applications to Business, Engineering and Science, AnYang, pp. 724-731, 2016.

[24] Guzman G., Serigos J., Bullock B., and Toribio A., “Simple Tools for Exploring Variation in Codeswitching for Linguists,” in Proceedings of 2nd Workshop on Computational Approaches to Code Switching, Austin, pp. 12-20, 2016.

[25] Hassine M., Boussaid L., and Hassani M., “Tunisian Dialect Recognition Based on Hybrid Techniques,” The International Arab Journal of Information Technology, vol. 15, no. 1, pp. 58- 65, 2018.

[26] Heafield K., “KenLM: Faster and Smaller Language Model Queries,” in Proceedings of 6th Workshop on Statistical Machine Translation, Edinburgh, pp. 187-197, 2011.

[27] Hochreiter S. and Schmidhuber J., “Long Short- Term Memory,” Neural Computation Archive, vol. 9, no. 8, pp.1735-1780, 1997.

[28] Jaech A., Mulcaire G., Hathi S., Ostendorf M., and Smith N., “A Neural Model for Language Identification in Code-Switched Tweets,” in Proceedings of 2nd Workshop on Computational Approaches to Code Switching, Austin, pp. 60- 64, 2016.

[29] Jhamtani H., Kumar B., and Raychoudhury V., “Word-level Language Identification in Bi- lingual Code-switched Texts,” in Proceedings of 28th Pacific Asia Conference on Language, Information and Computation, Phuket, pp. 348- 357, 2014.

[30] Joulin A., Grave E., Bojanowski P., Douze M., Jégou H., and Mikolov T., “FastText.zip: Compressing Text Classification Models,” CoRR, abs/1612.03651, 2016.

[31] King L., Baucom E., Gilmanov T., Kübler S., Whyatt D., Maier W., and Rodrigues P., “The IUCL+ System: Word-Level Language Identification via Extended Markov Models,” in Proceedings of 1st Workshop on Computational Approaches to Code Switching, Doha, pp. 102- 106, 2014.

[32] Lample G., Ballesteros M., Subramanian S., Kawakami K., and Dyer C., “Neural Architectures for Named Entity Recognition,” in Proceedings of the Conference of the North A Deep Learning Approach for the Romanized Tunisian Dialect Identification 945 American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, pp. 260-270, 2016.

[33] Lichouri M., Abbasa M., Freihatc A., and Megtoufa D., “Word-Level vs Sentence-Level Language Identification: Application to Algerian and Arabic Dialects,” Procedia Computer Science, vol. 142, pp. 246-253, 2018.

[34] Lin C., Ammar W., Levin L., and Dyer C., “The CMU Submission for the Shared Task on Language Identification in Code-Switched Data,” in Proceedings of 1st Workshop on Computational Approaches to Code Switching, Doha, pp. 80-86, 2014.

[35] Mager M., Çetinoğlu Ö., and Kann K., “Subword-Level Language Identification for Intra-Word Code-Switching,” Ground AI, vol. 1, 2019.

[36] Mave D., Maharjan S., and Solorio T., “Language Identification and Analysis of Code- Switched Social Media Text,” in Proceedings of 3rd Workshop on Computational Approaches to Code-Switching, Melbourne, pp. 51-61, 2018.

[37] Ma X. and Hovy E., “End-To-End Sequence Labeling Via Bi-Directional LSTM-CNNs-CRF,” in Proceedings of 54th Annual Meeting of the Association for Computational Linguistics, Berlin, pp. 1064-1074, 2016.

[38] Mikolov T., Sutskever I., Chen K., Corrado G., and Dean J., “Distributed Representations of Words and Phrases and their Compositionality,” in Proceedings of 26th International Conference on Neural Information Processing Systems 2, Lake Tahoe, pp. 3111-3119, 2013.

[39] Molina G., Rey-Villamizar N., Solorio T., AlGhamdi F., Ghoneim M., Hawwari A., and Diab M., “Overview for the Second Shared Task on Language Identification in Code-Switched Data,” in Proceedings of 2nd Workshop on Computational Approaches to Code Switching, Austin, pp. 40-49, 2016.

[40] Nguyen D. and Cornips L., “Automatic Detection of Intra-Word Code-Switching,” in Proceedings of 14th Annual SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Berlin, pp. 82-86, 2016.

[41] Papalexakis E., Nguyen D., and Dogruöz A., “Predicting Code-Switching in Multilingual Communication for Immigrant Communities,” in Proceedings of 1st Workshop on Computational Approaches to Code Switching, Doha, pp. 42-50, 2014.

[42] Shirvani R., Piergallini M., Gautam G., and Chouikha M., “The Howard University System Submission for the Shared Task in Language Identification in Spanish-English Codeswitching,” in Proceedings of 2nd Workshop on Computational Approaches to Code Switching, Austin, pp. 116-120, 2016.

[43] Piergallini M., Shirvani R., Gautam G., and Chouikha M., “Word-Level Language Identification and Predicting Codeswitching Points in Swahili-English Language Data,” in Proceedings of 2nd Workshop on Computational Approaches to Code Switching, Austin, pp. 21- 29, 2016.

[44] Revay S. and Teschke M., “Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals,” arXiv:1905.04348v1, 2019.

[45] Rijhwani S., Sequeira R., Choudhury M., Bali K., and Maddila C., “Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique,” in Proceedings of 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, pp. 1971- 1982, 2017.

[46] Sadat F., Kazemi F., and Farzindar A., “Automatic Identification of Arabic Language Varieties and Dialects in Social Media,” in Proceedings of 2nd Workshop on Natural Language Processing for Social Media, Dublin, pp. 22-27, 2014.

[47] Salameh M., Bouamor H., and Habash N., “Fine- Grained Arabic Dialect Identification,” in Proceedings of 27th International Conference Computational Linguistics, Santa Fe, pp. 1332- 1344, 2018.

[48] Samih Y. and Maier W., “Detecting Code- Switching in Moroccan Arabic Social Media,” in Proceedings of 4th International Workshop on Natural Language Processing for Social Media SocialNLP, New York, 2016.

[49] Samih Y., Maharjan S., Attia M., Kallmeyer L., and Solorio T., “Multilingual Codeswitching Identification via LSTM Recurrent Neural Networks,” in Proceedings of 2nd Workshop on Computational Approaches to Code Switching, Austin, pp. 50-59, 2016.

[50] Sayadi K., Hamidi M., Bui M., Liwicki M., and Fischer A., “Character-Level Dialect Identification in Arabic Using Long Short-Term Memory,” in Proceedings of International Conference on Computational Linguistics and Intelligent Text Processing, Budapest, pp. 324- 337, 2017.

[51] Schulz S. and Keller M., “Code-Switching Ubique Est - Language Identification and Part-of- Speech Tagging for Historical Mixed Text,” in Proceedings of 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Berlin, pp. 43- 51, 2016.

[52] Shrestha P., “Codeswitching Detection via Lexical Features using Conditional Random 946 The International Arab Journal of Information Technology, Vol. 17, No. 6, November 2020 Fields,” in Proceedings of 2nd Workshop on Computational Approaches to Code Switching, Austin, pp. 121-126, 2016.

[53] Shrestha P., “Incremental N-gram Approach for Language Identification in Code-Switched Text,” in Proceedings of 1st Workshop on Computational Approaches to Code Switching, Doha, pp. 13-138, 2014.

[54] Sikdar U. and Gambäck B., “Language Identification in Code-Switched Text Using Conditional Random Fields and Babelnet,” in Proceedings of 2nd Workshop on Computational Approaches to Code Switching, Austin, pp. 127- 131, 2016.

[55] Solorio T., Blair E., Maharjan S., Bethard S., Diab M., Gohneim M., Hawwari A., AlGhamdi F., Hirschberg J., Chang A., and Fung P., “Overview for the First Shared Task on Language Identification in Code-Switched Data,” in Proceedings of 1st Workshop on Computational Approaches to Code Switching, Doha, pp. 62-72, 2014.

[56] Xia M., “Codeswitching Language Identification Using Subword Information Enriched Word Vectors,” in Proceedings of 2nd Workshop on Computational Approaches to Code Switching, Austin, pp. 132-136, 2016.

[57] Yankova D. and Vassileva I., “Functions and Mechanisms of Code-Switching,” Bulgarian Canadians, Étudescanadiennes/Canadian Studies, vol. 74, pp. 103-121, 2013.

[58] Younes J., Achour H., and Souissi E., “Constructing Linguistic Resources for the Tunisian Dialect Using Textual User-Generated Contents on The Social Web,” in Proceedings of International Conference on Web Engineering, Rotterdam, pp. 3-14, 2015.

[59] Younes J. and Souissi E., “A Quantitative View of Tunisian Dialect Electronic Writing,” in Proceedings of 5th International Conference on Arabic Language Processing CITALA, Oujda, pp. 63-72, 2014.

[60] Younes J., Souissi E., Achour H., and Ferchichi A., “Un Etat De L'art Du Traitement Automatique Du Dialecte Tunisien,” Traitement Automatique des Langues, vol. 59, no. 3, pp. 93- 117, 2018.

[61] Zaidan O. and Callison-Burch C., “Arabic Dialect Identification,” Computational Linguistics, vol. 40, no. 1, pp. 171-202, 2014. Jihene Younes is PhD student at the ISGT, University of Tunis, Tunisia. She received her Master’s in Computer Science from the ENSIT, University of Tunis, Tunisia. Her current research interests include the automatic processing of the Tunisian dialect. Hadhemi Achour is Assistant Professor, teaching Computer Science at the ISGT, University of Tunis, Tunisia. She received her PhD in Computer Science at the University of Paris 7 in France. Her doctoral research was conducted at the France’s National Scientific Research Centre (CNRS). Her main research interests are related to Text Mining, Natural Language Processing and their applications, including Arabic and Tunisian dialect language processing. She participated in several European projects and in ALECSO coordinated studies and research projects. Emna Souissi is Assistant Professor and teaching Computer Science at the ENSIT, University of Tunis, Tunisia. She holds a PhD in Computer Science from the University of Paris 7, France. Her research interests are mainly related to the field of natural language processing and its applications, with a focus on the Arabic NLP. Her PhD research was conducted within the CNRS. In this context, she has participated in several European and Canadian projects. She is currently conducting research on the treatment of Arabic dialects and mainly Tunisian. Ahmed Ferchichi has been a professor of computer science since 1980. He is a PhD in computer science from Joseph-Fourrier University of Grenoble. His research interests include teaching programming and software engineering, modeling training curricula and educational systems, achieving sustainable development goals by the use of information technology and artificial intelligence, promoting information technology culture in Arabic. He taught at the University of Tunis from 1980 to 2011, where he directed the academic affairs of the ISGT during the period 2000-2003. Since 2012, he teaches at the University of Jendouba. In 2018, he was member of the national commission for the supervision of computer science study programs.