The International Arab Journal of Information Technology (IAJIT)


An Enhanced Corpus for Arabic Newspapers

In this paper, we propose our enhanced approach to create a dedicated corpus for Algerian Arabic newspapers comments. The developed approach has to enhance an existing approach by the enrichment of the available corpus and the inclusion of the annotation step by following the Model Annotate Train Test Evaluate Revise (MATTER) approach. A corpus is created by collecting comments from web sites of three well know Algerian newspapers. Three classifiers, support vector machines, naïve Bayes, and k-nearest neighbors, were used for classification of comments into positive and negative classes. To identify the influence of the stemming in the obtained results, the classification was tested with and without stemming. Obtained results show that stemming does not enhance considerably the classification due to the nature of Algerian comments tied to Algerian Arabic Dialect. The promising results constitute a motivation for us to improve our approach especially in dealing with non Arabic sentences, especially Dialectal and French ones.

[1] Abdul-Mageed M. and Diab M., “AWATIF: A Multi-Genre Corpus for Modern Standard Arabic Subjectivity and Sentiment Analysis,” in Proceedings of Language Resources and Evaluation Conference, Istanbul, pp. 3907-3914, 2012.

[1] Al-Harbi O., “Classifying Sentiment of Dialectal Arabic Reviews: A Semi-Supervised Approach,” The International Arab Journal of Information Technology, vol. 16, no. 6, pp. 995-1002, 2019.

[2] Al-Kabi M., Al-ayyoub M., Alsmadi I., and Wahsheh H., “A Prototype for a Standard Arabic Sentiment Analysis Corpus,” The International Arab Journal of Information Technology, vol. 13, no. 2, pp. 163-170, 2016.

[3] Al-Kabi M., Al-qwaqenah A., Gigieh A., Alsmearat K., Al-ayyoub M., and Alsmadi I., “Building a Standard Dataset for Arabic Sentiment Analysis,” in Proceedings of the IEEE/ACS 13th International Conference of Computer Systems and Applications, Agadir, pp. 1-6, 2016.

[4] Alotaibi S. and Anderson C., “Extending the Knowledge of the Arabic Sentiment Classification Using a Foreign External Lexical Source,” International Journal on Natural Language Computing, vol. 5, no. 3, pp. 1-11, 2016.

[5] Atia S. and Shaalan K., “Increasing the accuracy of opinion mining in Arabic,” in Proceedings of Conference: International Conference on Arabic Computational Linguistics, Cairo, pp. 106-113, 2015.

[6] Badaro G., Baly R., Hajj H., El-Hajj W., Shaban K ., Habash N ., Al-Sallab A., and Hamdi A., “A Survey of Opinion Mining in Arabic: A Comprehensive System Perspective Covering Challenges and Advances in Tools, Resources, Models, Applications, and Visualizations,” ACM Transactions on Asian Language Information Processing, vol. 18, no. 3, pp. 1-52, 2019.

[7] Ben-Hur A. and Weston J., “A user’s Guide To Support Vector Machines,” Methods in molecular biology, vol. 609, pp. 223-239, 2010.

[8] Bougrine S., Cherroun H., and Abdelali A., “Altruistic Crowdsourcing for Arabic Speech Corpus Annotation,” Procedia Computer Science, vol. 117, pp. 133-144, 2017.

[9] Brahimi B., Touahria M., and Tari A., “Data and Text Mining Techniques for Classifying Arabic Tweet Polarity,” Journal of Digital Information Management, vol. 14, no. 1, pp. 15-25, 2016.

[10] Carletta J., “Assessing Agreement on Classification Tasks: the Kappa Statistic,” Computational Linguistics, vol. 22, no. 2, pp. 249-254, 1996. An Enhanced Corpus for Arabic Newspapers Comments 797

[11] Cherif W., Madani A., and Kissi M., “Towards an Efficient Opinion Measurement in Arabic Comments,” Procedia Comput Science, vol. 73, pp. 122-129, 2015.

[12] Cortes C. and Vapnik V., “Support-Vector Networks,” Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.

[13] Elnagar A. and Einea O., “BRAD 1 . 0 : Book Reviews in Arabic Dataset,” in Proceedings of IEEE/ACS 13th International Conference of Computer Systems and Applications, Agadir, pp. 1-8, 2016.

[14] Elnagar A., Khalifa Y., and Einea A., Intelligent Natural Language Processing: Trends and Applications, Springer International Publishing, 2018.

[15] Guellil I., Adeel A., Azouaou F., and Hussain A., “SentiALG : Automated Corpus Annotation for Algerian Sentiment Analysis Introduction,” in Proceedings of the International Conference on Brain Inspired Cognitive Systems, Xi'an, pp. 557- 567, 2018.

[16] Habash N., Soudi A., and Buckwalter T., “On Arabic Transliteration,” in Arabic Computational Morphology: Knowledge-based and Empirical Methods, vol. 49, no. 4, pp. 15-22, 2007.

[17] Hamdi A., Shaban K., and Zainal A., “CLASENTI : A Class-Specific Sentiment Analysis Framework,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 17, no. 4, pp. 32, 2018.

[18] Harrat S., Meftouh K., Abbas M., Hidouci K., and Smaili K., “An Algerian Dialect : Study and Resources,” International Journal of Advanced Computer Science and Applications, vol. 7, no. 3, pp. 384-396, 2016.

[19] Ibrahim H., Abdou S., and Gheith M., “MIKA: A Tagged Corpus for Modern Standard Arabic and Colloquial Sentiment Analysis,” in Proceedings of IEEE 2nd International Conference on Recent Trends in Information Systems, Kolkata, pp 353- 358, 2015.

[20] Ide N. and Pustejovsky J., Handbook of Linguistic Annotation, Springer, 2017.

[21] Jarrar M., Habash N., Alrimawi F., Akra D., and Zalmout N., “Curras: an Annotated Corpus for the Palestinian Arabic Dialect,” Language Resources and Evaluation, vol. 51, no. 3, pp. 745-775, 2017.

[22] Kohavi R., “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, pp. 1137-1143, 1995.

[23] Korayem M., Aljadda K., and Crandall D., “Sentiment/Subjectivity Analysis Survey for Languages other than English,” Social Network Analysis and Mining, vol. 6, no. 1, pp. 1-17, 2016.

[24] Liu B., “Sentiment Analysis and Opinion Mining,” in Proceedings of Synthesis Lectures on Human Language Technologies, pp. 1-167, 2012.

[25] McCallum A. and Nigam K., “A Comparison of Event Models for Naive Bayes Text Classification,” in Proceedings of AAAI-98 Workshop on Learning for Text Categorization, Madison, pp. 41-48, 1998.

[26] Mountassir A., Benbrahim H., and Berraba I., “Sentiment Classification on Arabic Corpora. A Preliminary Cross-Study,” Document Numerique, vol. 16, no. 1, pp.73-96, 2013.

[27] Petrillo M. and Baycroft J., Introduction to Manual Annotation, Fairview Research, 2010,

[28] Pustejovsky J. and Stubbs A., Natural Language Annotation for Machine Learning, O’Reilly Media, 2012.

[29] Rahab H., Zitouni A., and Djoudi M., “ARAACOM: ARAbic Algerian Corpus for Opinion Mining,” in Proceedings of the International Conference on Computing for Engineering and Sciences, Istanbul, pp. 35-39, 2017.

[30] Refaeilzadeh P., Tang L., and Liu H., in Encyclopedia of Database Systems, Springer Science and Business Media, 2009.

[31] Rushdi-Saleh M., Martín-Valdivia M., Ureña López L., and Perea-Ortega J., “OCA: Opinion Corpus for Arabic,” Journal of the American Society for Information Science and Technology, vol. 62, no. 10, pp. 2045-2054, 2011.

[32] Salzberg S., “On Comparing Classifiers : Pitfalls to Avoid and a Recommended Approach,” Data Mining and Knowledge Discovery, vol. 1, pp. 317-328, 1997.

[33] Tunga G., Handbook of Natural Language Processing, Second Edition, CRC Press, 2010.

[34] Vinodhini G. and Chandrasekaran R., “Sentiment Analysis and Opinion Mining : A Survey,” International Journal of Advanced Research in Computer Science and Software Engineering, vol. 2, no. 6, pp. 282-292, 2012.

[35] Wang X., An K., Tang L., and Chen X., “Short Term Prediction of Freeway Exiting Volume Based on SVM and KNN,” International Journal of Transportation Science and Technology, vol. 4, no. 3, pp. 337-352, 2015.

[36] Zaidan O. and Callison-burch C., “The Arabic Online Commentary Dataset : an Annotated Dataset of Informal Arabic with High Dialectal Content,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, pp. 37-41, 2011. 798 The International Arab Journal of Information Technology, Vol. 17, No. 5, September 2020 Hichem Rahab is currently working as an Assistant Professor in department of Mathematics and computer science in the University of Khenchela, Algeria. He obtained his Master degree in Computer science from Batna University, Algeria, 2012. His resaerch interest includes machine learning, Arabic opinion mining and sentiment analysis. Abdelhafid Zitouni received his PhD in computer science in 2008 from the University of Constantine, Algeria. He is currently working as Professor in University of Constantine 2 Abdelhamid Mehri. His research interests include Cloud Computing, Security, and Arabic text mining field. Pr. Abdelhafid Zitouni has published many articles in International Journals and Conferences. He peer- reviewed conference and journal papers in the above research topics. Mahieddine Djoudi received a PhD in Computer Science from the University of Nancy, France, in 1991. His PhD thesis research was in Acoustic Phonetic Decoding for Standard Arabic Speech Recognition. He is currently working at Computer Science Department, Faculty of Fundamental and Applied Sciences at the University of Poitiers, France and member of TechNE Technology Enhanced Learning Research Laboratory. His main scientific interests are: e-Learning, Mobile Learning, Cloud Computing, Information Literacy and Learning Analytics. He has published over 100 scientific papers. He is also a member of program committees, editor or reviewer for international journals or conferences proceedings.