The International Arab Journal of Information Technology (IAJIT)


Issues of Dialectal Saudi Twitter Corpus Meshrif Alruily

Text mining research relies heavily on the availability of a suitable corpus. This paper presents a dialectal Saudi corpus that contains 207452 tweets generated by Saudi Twitter users. In addition, a comparison between the Saudi tweets dataset, Egyptian Twitter corpus and Arabic top news raw corpus (representing Modern Standard Arabic (MSA) in various aspects, such as the differences between formal and colloquial texts was carried out. Moreover, investigation into the issues and phenomena, such as shortening, concatenation, colloquial language, compounding, foreign language, spelling errors and neologisms on this type of dataset was performed.

Meshrif Alruily is an Assistant professor, department of Computer and Information Sciences at Jouf University, Saudi Arabia. He received his PhD in Computer Science from the University of De Montfort UK, in 2012. He published many conference papers and journal articles. He has published papers in the European Conference on Artificial Intelligence (ECAI) and Information processing & Management journal. His research interests are related to Arabic text mining field, such as information extraction, summarization, text classification and clustering and data analysis.