The International Arab Journal of Information Technology (IAJIT)


Issues of Dialectal Saudi Twitter Corpus Meshrif Alruily

Text mining research relies heavily on the availability of a suitable corpus. This paper presents a dialectal Saudi corpus that contains 207452 tweets generated by Saudi Twitter users. In addition, a comparison between the Saudi tweets dataset, Egyptian Twitter corpus and Arabic top news raw corpus (representing Modern Standard Arabic (MSA) in various aspects, such as the differences between formal and colloquial texts was carried out. Moreover, investigation into the issues and phenomena, such as shortening, concatenation, colloquial language, compounding, foreign language, spelling errors and neologisms on this type of dataset was performed.

[1] Abozinadah A. and Jones J., “Improved Micro- Blog Classification for Detecting Abusive Arabic Twitter Accounts,” International Journal of Data Mining and Knowledge Management Process, vol. 6, no. 6, pp. 17-28, 2016.

[2] Al-Kabi M., Alsmadi I., Khasawneh R., and Wahsheh H., “Evaluating Social Context In Arabic Opinion Mining,” The International Arab Journal of Information Technology, vol. 15, no. 6, pp. 974-982, 2017.

[3] Al-Twairesh N., Al-Khalifa H., Al-Salman A., and Al-Ohali Y., “Arasenti-Tweet: A Corpus for Arabic Sentiment Analysis of Saudi Tweets,” Procedia Computer Science, vol. 117, pp. 63-72, 2017.

[4] Albogamy F. and Ramsay A., “Pos Tagging for Arabic Tweets,” in Proceedings of the International Conference Recent Advances in Natural Language Processing, Hissar, pp. 1-8, 2015.

[5] Albogamy F. and Ramsay A., “Fast and Robust Pos Tagger for Arabic Tweets Using Agreement- Based Bootstrapping,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation, Portorož, pp. 1500- 1506, 2016.

[6] Almas Y. and Ahmad K., “Lolo: a System Based On Terminology for Multilingual Extraction,” in Proceedings of the Workshop on Information Extraction Beyond the Document, Sydney, pp. 56- 65, 2006.

[7] Alqahtni H., “The Structure and Context of Idiomatic Expressions in the Saudi Press,” Phd Thesis, the University of Leeds, 2014.

[8] Alwakid G., Osman T., and Hughes-Roberts T., “Challenges in Sentiment Analysis for Arabic Social Networks,” Procedia Computer Science, vol. 117, pp. 89-100, 2017.

[9] Assiri A., Emam A., and Al-Dossari H., “Saudi Twitter Corpus for Sentiment Analysis,” International Journal of Computer and Information Engineering, vol. 10, no. 2, pp. 272- 275, 2016.

[10] Baly R., El-Khoury G., Moukalled R., Aoun R., Hajj H., Shaban K., and El-Hajj W., “Comparative Evaluation of Sentiment Analysis Methods Across Arabic Dialects,” Procedia Computer Science, vol. 117, pp. 266-273, 2017.

[11] Bani-Hani A., Majdalawieh M., and Obeidat F., “The Creation of an Arabic Emotion Ontology Based on E-Motive,” Procedia Computer Science, vol. 109, pp. 1053-1059, 2017.

[12] Cherif W., Madani A., and Kissi M., “Towards an Efficient Opinion Measurement in Arabic Comments,” Procedia Computer Science, vol. 73, pp. 122-129, 2015.

[13] D’Andrea E., Ducange P., Lazzerini B., and Marcelloni F., “Real-Time Detection of Traffic from Twitter Stream Analysis,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 4, pp. 2269-2283, 2015.

[14] Elhadad N., Gravano L., Hsu D., Balter S., Reddy V., and Waechter H., “Information Extraction from Social Media for Public Health,” in Proceedings of KDD at Bloomberg Workshop, Data Frameworks Track, New York, pp. 1-14, 2014.

[15] Elmadany A., Abdou S., and Gheith M., “Towards Understanding Egyptian Arabic Dialogues,” International Journal of Computer Applications, vol. 120, no. 22, pp. 7-12, 2015.

[16] Hashtagify, “Find and Analyse Top Twitter and Instagram Hashtags

[online],”, Last Visited, 2017.

[17] Ibrahim H., Abdou S., and Gheith M., “Idioms- Proverbs Lexicon for Modern Standard Arabic and Colloquial Sentiment Analysis,” International Journal of Computer Applications, vol. 118, no. 11, pp. 26-31, 2015.

[18] Ibrahim H., Abdou S., and Gheith M., “Sentiment Analysis for Modern Standard Arabic and Colloquial,” International Journal on Natural Language Computing, vol. 4, no. 2, pp. 95-109, 2015.

[19] Kulkarni R., Dhanawade S., Raut S., and Lavhakare D., “Twitter Stream Analysis for Traffic Detection in Real Time,” International Journal of Advance Research, Ideas and Innovations in Technology, vol. 2, no 5, pp. 1-5, 2016.

[20] Lamb A., Paul M., and Dredze M., “Separating Fact from Fear: Tracking Flu Infections on Twitter,” in Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, pp. 789-795, 2013.

[21] Macmillan Dictionary, 2002, “Colloquial- definition and synonyms

[online],” Available from y/british/colloquial, Last Visited, 2017.

[22] Mallek F., Belainine B., and Sadat F., “Arabic Social Media Analysis and Translation,” Procedia Computer Science, vol. 117, pp 298- 303, 2017.

[23] Mubarak H. and Darwish K., “Using Twitter To Collect A Multi-Dialectal Corpus of Arabic,” in Proceedings of the EMNLP Workshop on Arabic Natural Language Processing, Doha, pp. 1-7, 2014.

[24] Nabil M., Aly M., and Atiya A., “Astd: Arabic Sentiment Tweets Dataset,” in Proceedings of the Conference on Empirical Methods in Natural 374 The International Arab Journal of Information Technology, Vol. 17, No. 3, May 2020 Language Processing, Lisbon, pp. 2515-2519, 2015.

[25] Refaee E. and Rieser V., “An Arabic Twitter Corpus for Subjectivity and Sentiment Analysis,” in Proceedings of 9th International Conference on Language Resources and Evaluation, Reykjavik, pp. 2268-2273, 2014.

[26] Salim F., “Social Media and the Internet of Things towards Data-Driven Policymaking In The Arab World: Potential, Limits And Concerns,” Technical Report, Arab Social MediaReport, 2017.

[27] Shoukry A. and Rafea A., “Preprocessing Egyptian Dialect Tweets For Sentiment Mining,” in Proceedings of The 4th Workshop on Computational Approaches to Arabic Script- based Languages, California, pp. 47-56, 2012.

[28] Somanova L., “Words Recently Coined and Blended: Analysis of New English Lexical Items,” Phd Thesis, Masaryk University, 2017.

[29] Tartir S. and Abdul-Nabi I., “Semantic Sentiment Analysis in Arabic Social Media,” Journal of King Saud University-Computer and Information Sciences, vol. 29, no. 2, pp. 229-233, 2017. Meshrif Alruily is an Assistant professor, department of Computer and Information Sciences at Jouf University, Saudi Arabia. He received his PhD in Computer Science from the University of De Montfort UK, in 2012. He published many conference papers and journal articles. He has published papers in the European Conference on Artificial Intelligence (ECAI) and Information processing & Management journal. His research interests are related to Arabic text mining field, such as information extraction, summarization, text classification and clustering and data analysis.