The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Direct Text Classifier for Thematic Arabic Discourse Documents

Maintaining the topical coherence while writing a discourse is a major challenge confronting novice and non- novice writers alike. This challenge is even more intense with Arabic discourse because of the complex morphology and the widespread of synonyms in Arabic language. In this research, we present a direct classification of Arabic discourse document while writing. This prescriptive proposed framework consists of the following stages: data collection, pre-processing, construction of Language Model (LM), topics identification, topics classification, and topic notification. To prove and demonstrate our proposed framework, we designed a system and applied it on a corpus of 2800 Arabic discourse documents synthesized into four predefined topics related to: Culture, Economy, Sport, and Religion. System performance was analysed, in terms of accuracy, recall, precision, and F-measure. The results demonstrated that the proposed topic modeling-based decision framework is able to classify topics while writing a discourse with accuracy of 91.0%.


[1] Ababneh J., Almanmomani O., Hadi W., El- Omari N., and Al-Ibrahim A., “Vector Space Models to Classify Arabic Text,” International Journal of Computer Trends and Technology, vol. 7, no. 4, pp. 219-223, 2014.

[2] Aggarwal C. and Zhai C., Mining Text Data, Springer Science and Business Media, 2012.

[3] Al-Alwani A. and Beseiso M., “Arabic Spam Filtering Using Bayesian Model,” International Journal of Computer Applications, vol. 79, no. 7, pp. 11-14, 2013.

[4] Al-Anzi F. and Abu-Zeina D., “Toward an Enhanced Arabic Text Classification Using Cosine Similarity and Latent Semantic Indexing,” Journal of King Saud University- Computer and Information Sciences, vol. 29, no. 2, pp. 189-95, 2017.

[5] Al-diabat M., “Arabic Text Categorization Using Classification Rule Mining,” Applied Mathematical Sciences, vol. 6, no. 81, pp. 4033- Direct Text Classifier for Thematic Arabic Discourse Documents 401 4046, 2012.

[6] Al-Hawamdeh S. and Khan G., “Content Based Indexing and Retrieval in a Digital Library of Arabic Scripts and Calligraphy,” in Proceedings of International Conference on Theory and Practice of Digital Libraries, Lisbon, pp. 14-23, 2000.

[7] Al-Jaloud F., Bin-Hezam R., and Aoun-Allah M., “Classifying Arabic Web Pages Toolkit,” in Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, Craiova, pp. 1-4, 2012.

[8] Al-Shalabi R. and Obeidat R., “Improving KNN Arabic Text Classification with N-Grams Based Document Indexing,” in Proceedings of the 6th International Conference on Informatics and Systems, Cairo, pp. 108-112, 2008.

[9] Al-Tahrawi M. and Al-Khatib S., “Arabic Text Classification Using Polynomial Networks,” Journal of King Saud University-Computer and Information Sciences, vol. 27, no. 4, pp. 437-449, 2015.

[10] Ali A., Bell P., Glass J., Messaoui Y., Mubarak H., Renals S., and Zhang Y., “The MGB-2 Challenge: Arabic Multi-Dialect Broadcast Media Recognition,” in Proceedings of IEEE Spoken Language Technology Workshop, San Diego, pp. 279-284, 2016.

[11] Almaden D., “An Analysis of the Topical Structure of Paragraphs Written by Filipino Students,” The Asia-Pacific Education Research, vol. 15, no. 2, pp. 127-53, 2006.

[12] Almujaiwel S. and Al-Thubaity A., “Arabic Corpus Processing Tools for Corpus Linguistics and Language Teaching,” in Proceedings of The Globalization of 2nd Language Acquisition and Teacher Education, Fukuoka, pp. 4-6, 2016.

[13] Alsaleem S., “Automated Arabic Text Categorization Using SVM and NB,” International Arab Journal of E-Technology, vol. 2, no. 2, pp. 124-128, 2011.

[14] Candlin C. and Hyland K., Writing: Texts, Processes and Practices, Routledge, 2014.

[15] Debole F. and Sebastiani F., “An Analysis of the Relative Hardness of Reuters-21578 Subsets,” Journal of the American Society for Information Science and Technology, vol. 56, no. 6, pp. 584- 596, 2005.

[16] Delen D., Real-World Data Mining: Applied Business Analytics and Decision Making, Financial Times Press, 2015.

[17] El-Masri M., Altrabsheh N., and Mansour H., “Successes and Challenges of Arabic Sentiment Analysis Research: A Literature Review,” Social Network Analysis and Mining, vol. 7, no. 1, pp. 54, 2017.

[18] El-Kourdi M., Ben-Said A., and Rachidi T., “Automatic Arabic Document Categorization Based on the Naive Bayes Algorithm,” in Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, Geneva, pp. 51-58, 2004.

[19] Hidayatullah A., Ratnasaret C., and Wisnugroho S., “Analysis of Stemming Influence on Indonesian Tweet Classification,” Telkomnika Telecommunication Computing Electronics and Contro, vol. 14, no. 2, p. 665-673, 2016.

[20] Hijazi M., Zeki A., and Ismail A., “Arabic Text Classification: Review Study,” Journal of Engineering and Applied Sciences, vol. 11, no. 3, pp. 528-36, 2016.

[21] Hillard D., Purpura S., and Wilkerson J., “Computer-Assisted Topic Classification for Mixed-Methods Social Science Research,” Journal of Information Technology and Politics, vol. 4, no. 4, pp. 31-46, 2008.

[22] JurafskyD. and Martin J., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, 2014.

[23] Kanan T. and Fox E., “Automated Arabic Text Classification with P-Stemmer, Machine Learning, and a Tailored News Article Taxonomy,” Journal of the Association for Information Science and Technology, vol. 67, no. 11, pp. 2667-2683, 2016.

[24] Khatatneh K., “Classified Arabic Documents Using Semi-Supervised Technique,” International Journal of Advanced Computer Science and Applications. vol. 7, no. 5, pp. 13- 17, 2016.

[25] Khreisat L., “Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study,” in Proceedings of the International Conference on Data Minin, Las Vegas, pp. 78- 82, 2006.

[26] Romo J. and Araujo L., “Detecting Malicious Tweets in Trending Topics Using a Statistical Analysis of Language,” Expert Systems with Applications, vol. 40, no. 8, pp. 2992-3000, 2013.

[27] Mesleh A., “Chi Square Feature ExtractionBased Svms Arabic Language Text Categorization System,” Journal of Computer Science, vol. 3, no. 6, pp. 430-435, 2007.

[28] Nahar K., “Off-Line Arabic Hand-Writing Recognition Using Artificial Neural Network With Genetics Algorithm,” The International Arab Journal of Information Technology, vol. 15, no. 4, pp. 701-707, 2018.

[29] Oraby S., El-Sonbaty Y., and El-Nasr M., “Exploring the Effects of Word Roots for Arabic Sentiment Analysis,” in Proceedings of the 6th 402 The International Arab Journal of Information Technology, Vol. 17, No. 3, May 2020 International Joint Conference on Natural Language Processing, Nagoya, pp. 471-479, 2013.

[30] Peng F., Huang X., Schuurmans D., and Wang S., “Text Classification in Asian Languages without Word Segmentation,” in Proceedings of the 6th International Workshop on Information Retrieval with Asian Languages, Sapporo, pp. 41-48, 2003.

[31] Ponte J. and Croft W., “A Language Modeling Approach To Information Retrieval,” in Proceedings of the 21st annual international ACM SIGIR Conference on Research and Development in Information Retrieval, Sapporo, pp. 275-281, 1998.

[32] Purohit A., Atre D., Jaswani P., and Asawara P., “Text Classification in Data Mining,” International Journal of Scientific and Research Publications, vol. 5, no. 6, pp. 1-6, 2015.

[33] Roark B., Saraclar M., and Collins M., “Discriminative N-Gram Language Modeling,” Computer Speech and Language, vol. 21, no. 2, pp. 373-392, 2007.

[34] Rushdi-Saleh M., Martín‐Valdivia T., Ureña‐López A., and Perea‐Ortega J., “OCA: Opinion Corpus for Arabic,” Journal of the Association for Information Science and Technology, vol. 62, no. 10, pp. 2045-2054, 2011.

[35] Said D., Wanas N., Darwish N., and Hegazy N., “A Study of Text Preprocessing Tools For Arabic Text Categorization,” in Proceedings of the 2nd International Conference on Arabic Language, Cairo, pp. 230-236, 2009.

[36] Saif H., He Y., and Alani H., “Semantic Sentiment Analysis of Twitter,” in Proceedings of International Semantic Web Conference, Boston, pp. 508-524, 2012.

[37] Sharda R., Delen D., and Turban E., Businesss Intelligence and Analytics: Systems for Decision Support, Pearson, 2014.

[38] Shoukry A. and Rafea A., “Sentence-Level Arabic Sentiment Analysis,” in Proceedings of International Symposium on Collaboration, Social Computing, New Media and Networks, Denver, pp. 546-550, 2012.

[39] Simpson J., “Topical Structure Analysis of Academic Paragraphs in English and Spanish,” Journal of Second Language Writing, vol. 9, no. 3, pp. 293-309, 2000.

[40] Stanford NLP. The Stanford NLP (Natural Language Processing) Group. 2012.

[41] Syiam M., Fayed Z., and Habib M., “An Intelligent System for Arabic Text Categorization,” International Journal of Cooperative Information Systems, vol. 6, no. 1, pp. 1-19, 2006.

[42] Turan M. and Sönmez C., “Automatize Document Topic and Subtopic Detection with Support of a Corpus,” Procedia-Social and Behavioral Sciences, vol. 177, pp. 169-177, 2015.

[43] Wallach H., “Topic Modeling: Beyond Bag-of- Words,” in Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, pp. 977-984, 2006.

[44] Zhai C. and Lafferty J., “A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval,” ACM SIGIR Forum, vol. 51, no. 2, pp. 268-276, 2017.

[45] Zhang X. and Wang T., “Topic Tracking with Dynamic Topic Model and Topic-Based Weighting Method,” Journal of Software, vol. 5, no. 5, pp. 482-489, 2010. Direct Text Classifier for Thematic Arabic Discourse Documents 403 Khalid Nahar is an assistant professor in the Department of Computer Sciences-Faculty of IT, Yarmouk University, Irbid-Jordan. He received his BS and MS degrees in computer sciences from Yarmouk University in Jordan, in 1992 and 2005 respectively. He was awarded a full scholarship to continue his PhD in Computer Sciences and Engineering from King Fahd University of Petroleum and Minerals (KFUPM), KSA. In 2013 he completed his PhD and started his job as an assistant professor at Tabuk University, KSA for 2 years. In 2015 he backs to Yarmouk University, and for now he is the assistant dean for quality control. His research interests include: continuous speech recognition, Arabic computing, natural language processing, multimedia computing, content-based retrieval, Artificial Intelligence (AI), Machine Learning,IOT, and Data Science. Ra’ed Al-Khatib is an Assistant Professor in the Department of Computer Sciences-Faculty of Information Technology and Computer Sciences, at Yarmouk University, Irbid-Jordan, email: raed.m.alkhatib@yu.edu.jo. He received his BSc in Computer sciences from Mu’tah University-Jordan, and his MSc in Computer Science & Engineering from Yarmouk University in Jordan, in 2006, and then he received his PhD degree in Computer Science from Universiti Sains Malaysia (USM), Penang, Malaysia in 2012. He worked as an Assistant Professor at Jerash University-Jordan, before he moved to work as an Assistant Professor at Yarmouk University, Jordan in 2016. His research interests include: Artificial Intelligence (AI), Machine Learning, Natural Language Processing (NLP), High Parallel computing (HPC), IoT’s, WSNs, Data Science, and Biometrics-Recognition Techniques. Moy'awiah Al-Shannaq (CS- Department Chairman) is an Assistant Professor of Computer Sciences in the Faculty of Information Technology and Computer Science, Yarmouk University. Before joining Yarmouk University, Dr. Al-Shannaq has been working as a Lecturer in the Department of Computer Sciences at Kent State University, Ohio, USA for two years. He received his MSc and BSc in Computer Sciences from Yarmouk University, Jordan. He received his PhD in Computer Sciences from Kent State University, Ohio, USA. His research interests include: Natural Language Processing, Algorithmic graph and hypergraph theory, computational geometry, and network algorithms. Mohammad Daradkeh is an Assistant Professor of Software and Information Technology in the Faculty of Information Technology and computer Science, Yarmouk University. Before joining Yarmouk University, Dr. Daradkeh has been working as a Lecturer in the Department of Informatics and Enabling Technologies at Lincoln University, New Zealand for two years. He received his PhD in Software and Information Technology from Lincoln University, New Zealand, and MSc. and BSc. in Computer Science from Yarmouk University, Jordan. His research interests lie primarily in the areas of visual analytics, business intelligence and analytics, decision support systems, and uncertainty and risk management. He is currently teaching in undergraduate and graduate courses related to decision support systems, business intelligence and analytics, and information technology project management. Rami Malkawi is an Assistant Professor in the Department of Computer Information Systems- Faculty of Information Technology and Computer Sciences, Yarmouk University, Irbid-Jordan. He received his BSc in Computer Science from Mu’tah University- Jordan, and his MSc in Computer Science from Nottingham Trent University-UK in 2003, and then he received his PhD degree in Computer Science and Information Technology from the University of South Wales-UK in 2013. In 2014 he worked as an Assistant Professor at Jadara University, Jordan before he moved to work as an Assistant Professor at Yarmouk University, Jordan in 2016. His research interests include: Multimedia, Social Media, Data Analysis, e- Learning, Natural Languages processing, and Digital Storytelling technologies.