The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Word Embedding as a Semantic Feature Extraction Technique in Arabic Natural Language Processing: An Overview

Feature extraction has transformed the field of Natural Language Processing (NLP) by providing an effective way to represent linguistic features. Various techniques are utilised for feature extraction, such as word embedding. This latter has emerged as a powerful technique for semantic feature extraction in Arabic Natural Language Processing (ANLP). Notably, research on feature extraction in the Arabic language remains relatively limited compared to English. In this paper, we present a review of recent studies focusing on word embedding as a semantic feature extraction technique applied in Arabic NLP. The review primarily includes studies on word embedding techniques applied to the Arabic corpus. We collected and analysed a selection of journal papers published between 2018 and 2023 in this field. Through our analysis, we categorised the different feature extraction techniques, identified the Machine Learning (ML) and/or Deep Learning (DL) algorithms employed, and assessed the performance metrics utilised in these studies. We demonstrate the superiority of word embeddings as a semantic feature representation in ANLP. We compare their performance with other feature extraction techniques, highlighting the ability of word embeddings to capture semantic similarities, detect contextual associations, and facilitate a better understanding of Arabic text. Consequently, this article provides valuable insights into the current state of research in word embedding for Arabic NLP.

[1] Abdelali A., Durrani N., Dalvi F., and Sajjad H., “Interpreting Arabic Transformer Models,” arXiv Preprint, vol. arXiv:2201.07434v1, 2022. https://www.researchgate.net/publication/357952 504_Interpreting_Arabic_Transformer_Models

[2] Abdelali A., Hassan S., Mubarak H., Darwish K., and Samih Y., “Pre-Training BERT on Arabic Tweets: Practical Considerations,” arXiv Preprint, vol. arXiv:2102.10684v1, 2021. https://arxiv.org/abs/2102.10684

[3] Abdul-Mageed M., Elmadany A., and Nagoudi E., “ARBERT and MARBERT: Deep Bidirectional Transformers for Arabic,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Bangkok, pp. 7088-7105, Word Embedding as a Semantic Feature Extraction Technique in Arabic Natural Language ... 323 2021. doi:10.18653/v1/2021.acl-long.551

[4] Alayba A. and Palade V., “Leveraging Arabic Sentiment Classification Using an Enhanced CNN-LSTM Approach and Effective Arabic Text Preparation,” Journal of King Saud University- Computer and Information Sciences, vol. 34. no. 10, 2021. doi: 10.1016/j.jksuci.2021.12.004

[5] Alharbi A. and Lee M., “Multi-task Learning Using a Combination of Contextualised and Static Word Embeddings for Arabic Sarcasm Detection and Sentiment Analysis,” in Proceedings of the 6th Arabic Natural Language Processing Workshop, Kyiv, pp. 318-322, 2021. https://aclanthology.org/2021.wanlp-1.39

[6] Al-Hashedi A., Al-Fuhaidi B., Mohsen A., Ali Y., and Al-Kaf H., “Ensemble Classifiers for Arabic Sentiment Analysis of Social Network (Twitter Data) towards COVID-19-Related Conspiracy Theories,” Applied Computational Intelligence Soft Computing, vol. 2022, 2022. doi: 10.1155/2022/6614730.

[7] Almuzaini H. and Azmi A., “Impact of Stemming and Word Embedding on Deep Learning-Based Arabic Text Categorization,” IEEE Access, vol. 8, pp. 127913-127928, 2020. doi:10.1109/ACCESS.2020.3009217

[8] Antoun W., Baly F., and Hajj H., “AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding,” arXiv Preprint, arXiv:2012.15516v2, 2020. http://arxiv.org/abs/2012.15516

[9] Ashi M., Siddiqui M., and Farrukh N., “Pre- Trained Word Embeddings for Arabic Aspect- Based Sentiment Analysis of Airline Tweets,” in Proceedings of the International Conference of Advanced Intelligent Systems and Informatics, Cairo, pp. 7-9, 2020. https://doi.org/10.1007/978- 3-319-99010-1_22

[10] Chaimae A., Rybinski M., Yacine E., and Montes J., “Comparative Study of Arabic Word Embeddings: Evaluation and Application,” International Journal Computer Information System and Industrial Management Applications, vol. 12, pp. 349-362, 2020. https://www.mirlabs.org/ijcisim/regular_papers_2 020/IJCISIM_31.pdf

[11] Chouikhi H., Chniter H., and Jarray F., “Arabic Sentiment Analysis Using BERT Model,” Communications in Computer and Information Science, vol. 1463, pp. 621-632, 2021. doi: 10.1007/978-3-030-88113-9_50

[12] Chowdhury S., Abdelali A., Darwish K., Soon- Gyo J., Salminen J., and Jansen B., “Improving Arabic Text Categorization Using Transformer Training Diversification,” in Proceedings of the 5th Arabic Natural Language Processing Workshop, Barcelona, pp. 226-236, 2020. https://www.aclweb.org/anthology/2020.wanlp- 1.21

[13] Darwish K., Habash N., Abbas M., Al-Khalifa H., and Al-Natsheh H., “A Panoramic Survey of Natural Language Processing in the Arab World,” Communications of ACM, vol. 64, no. 4, pp. 72- 81, 2021. doi: 10.1145/3447735.

[14] El Mahdaouy A., Gaussier E., and El Alaoui S., “Arabic Text Classification Based on Word and Document Embeddings,” in Proceedings of the International Conference on Advanced Intelligent Systems and Informatics, Cairo, pp. 32-41, 2017. doi: 0.1007/978-3-319-48308-5_4

[15] El Moubtahij H., Abdelali H., and Tazi E., “AraBERT Transformer Model for Arabic Comments and Reviews Analysis,” International Journal of Artificial Intelligence, vol. 11, no. 1, pp. 379-387, 2022. doi: 10.11591/ijai.v11.i1.pp379- 387

[16] Elfaik H. and Nfaoui E., “Combining Context- Aware Embeddings and an Attentional Deep Learning Model for Arabic Affect Analysis on Twitter,” IEEE Access, vol. 9, pp. 111214-111230, 2021. doi: 10.1109/ACCESS.2021.3102087

[17] Fawzy M., Fakhr M., and Rizka M., “Word Embeddings and Neural Network Architectures for Arabic Sentiment Analysis,” in Proceedings of the 16th International Computer Engineering Conference, Cairo, pp. 92-96, 2020, doi:10.1109/ICENCO49778.2020.9357377.

[18] Fouad A., Mahany A., Aljohani N., Abbasi R., and Hassan S., “Ar-WordVec: Efficient Word Embedding Models for Arabic Tweets,” Soft Computing, vol. 24, pp. 8061-8068, 2020. doi: https://doi.org/10.1007/s00500-019-04153-6.

[19] Gamon M., “Linguistic Correlates of Style: Authorship Classification with Deep Linguistic Analysis Features,” in Proceedings of the 20th International Conference on Computational Linguistics, Geneva Switzerland, pp. 611-es, 2004. doi:10.3115/1220355.1220443.

[20] Guellil I., Adeel A., Azouaou F., Benali F., and Hachani A., “A Semi-supervised Approach for Sentiment Analysis of Arab(ic+izi) Mes-sages: Application to the Algerian Dialect,” SN Computer Science, vol. 2, no. 118, pp. 1-18, 2021. doi: 10.1007/s42979-021-00510-1

[21] Ibrahim K., El Habib N., and Satori H, “Sentiment Analysis Approach Based on Combination of Word Embedding Techniques,” in Proceedings of the Embedded Systems and Artificial Intelligence Conference, Fez, vol. 1076, pp.805-813, 2019. DOI:10.1007/978-981-15-0947-6_76

[22] Jurafsky D. and Martin J., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Last Visited, 2023. https://web.stanford.edu/~jurafsky/slp3/ed3book. pdf 324 The International Arab Journal of Information Technology, Vol. 21, No. 2, March 2024

[23] Kaibi I., Satori H., and Nfaoui E., “A Comparative Evaluation of Word Embeddings Techniques for Twitter Sentiment Analysis,” in Proceedings of the International Conference on Wireless Technologies, Embedded and Intelligent Systems, Fez, pp. 1-4, 2019. doi: 10.1109/WITS.2019.8723864.

[24] Lample G., Conneau A., Ranzato M., Denoyer L., and Jégou H., “Word Translation Without Parallel Data,” in Proceedings of the 6th International Conference on Learning Representations, Vancouver, pp. 1-14, 2018. https://doi.org/10.48550/arXiv.1710.04087

[25] Li Y. and Yang T., “Word Embedding for Understanding Natural Language: A Survey,” Guide to Big Data Applications, vol. 5, no. 2, pp. 48-56, 2013. https://doi.org/10.1007/978-3-319- 53817-4_4

[26] Liddy E., SURFACE SURFACE Center for Natural Language Processing School of Information Studies (iSchool) 2001 Natural Language Processing Natural Language Processing Natural Language Processing 1, 2001, https://surface.syr.edu/cnlp, Last Visited, 2023.

[27] Maxwell J., A Treatise on Electricity and Magnetism, Oxford: Clarendon, 1892. https://doi.org/10.1017/CBO9780511709333

[28] Mikolov T., Chen K., Corrado G., and Dean J., “Efficient Estimation of Word Representations in Vector Space,” in Proceedings of the 1st International Conference on Learning Representations, Arizona, pp. 1-12, 2013. https://doi.org/10.48550/arXiv.1301.3781

[29] Mikolov T., Sutskever I., Chen K., Corrado G., and Dean J., “Distributed Representations of Words and Phrases and Their Compositionality,” Advances in Neural Information Processing Systems, arXiv:1310.4546v1, pp. 1-9, 2013. https://doi.org/10.48550/arXiv.1310.4546

[30] Mikolov T., Yih W., and Geoffrey Z., “Linguistic Regularities in Continuous Space Word Representations,” in Proceedings of the North American Chapter of the Association for Computational Conference of the Linguistics: Human Language Technologies, Atlanta, pp. 746- 751, 2003. doi: 10.3109/10826089109058901

[31] Mohammed A. and Kora R., “Deep Learning Approaches for Arabic Sentiment Analysis,” Social Network Analysis and Mining, vol. 9, no. 1, pp. 1-12, 2019. doi: 10.1007/s13278-019-0596-4

[32] Mohd M., Jan R., and Shah M., “Text Document Summarization Using Word Embedding,” Expert Systems with Applications, vol. 143, pp. 112958, 2020. doi: 10.1016/J.ESWA.2019.112958.

[33] Moudjari L., Benamara F., and Akli-Astouati K., “Multi-Level Embeddings for Processing Arabic Social Media Contents,” Computer Speech Language, vol. 70, pp. 101240, 2021. doi:10.1016/j.csl.2021.101240.

[34] Mulyo M. and Widyantoro D., “Aspect-based Sentiment Analysis Approach with CNN,” in Proceedings of the 5th International Conference on Electrical Engineering, Malang, pp. 142-147, 2018. doi: 10.1109/EECSI.2018.8752857.

[35] Naaima B., Soumia E., Rdouan F., and Thami R., “Exploring the Use of Word Embedding and Deep Learning in Arabic Sentiment Analysis,” Advances in Intelligent Systems and Computing, vol. 1105, pp. 149-156, 2020. doi:10.1007/978-3- 030-36674-2_16.

[36] Naili M., Chaibi A., and Ghezala H., “Comparative Study of Arabic Stemming Algorithms for Topic Identification,” Procedia Computer Science, vol. 159, pp. 794-802, 2019, doi: 10.1016/j.procs.2019.09.238.

[37] Nurkasanah A. and Hayaty M., “Feature Extraction Using Lexicon on the Emotion Recognition Dataset of Indonesian Text,” Ultimatics Jurnal Teknik Informatika, vol. 14, No. 1, pp. 20-27, 2022. doi: 10.31937/TI.V14I1.2540.

[38] Ombabi A., Ouarda W., and Alimi A., “Deep Learning CNN-LSTM Framework for Arabic Sentiment Analysis Using Textual Information Shared in Social Networks,” Social Network Analysis and Mining, vol. 10, no. 53, pp. 1-13, 2020. https://doi.org/10.1007/s13278-020-00668- 1

[39] Pennington J., Socher R., and Manning C., “GloVe: Global Vectors for Word Representation,” in Proceedings of the Empirical Methods in Natural Language Processing Conference, Doha, pp. 1532-1543, 2014. doi: 10.3115/v1/D14-1162

[40] Ramakrishnan D. and Radhakrishnan K., “Applying Deep Convolutional Neural Network Algorithm in the Cloud Autonomous Vehicles Traffic,” The International Arab Journal of Information Technology, vol. 19, no. 2, pp. 186- 194, 2022. 2021. https://doi.org/10.34028/iajit/19/2/5

[41] Soliman A., Eissa K., and El-Beltagy S., “AraVec: A Set of Arabic Word Embedding Models for Use in Arabic NLP,” Procedia Computer Science, vol. 117, pp. 256-265, 2017. doi: 10.1016/j.procs.2017.10.117

[42] Statista, “Most common languages used on the internet as of January 2020, by share of internet users,” Last Visited, 2023. https://www.statista.com/statistics/262946/share- of-the-most-common-languages-on-the-internet/,

[43] Statista, “The World’s Most Spoken Languages,” Last Visited, 2023. https://www.statista.com/statistics/266808/the- most-spoken-languages-worldwide/

[44] Suleiman D. and Awajan A., “Comparative Study of Word Embeddings Models and Their Usage in Word Embedding as a Semantic Feature Extraction Technique in Arabic Natural Language ... 325 Arabic Language Applications,” in Proceedings of The 19th International Arab Conference on Information Technology, Werdanye, pp. 1-7, 2019. doi: 10.1109/ACIT.2018.8672674.

[45] Zyout M. and Hassan N., “Sentiment Analysis of Arabic Tweets about Violence Against Women using Machine Learning,” in Proceedings of the 12th International Conference on Information and Communication Systems Sentiment, Valencia, pp. 171-176, 2021. doi:10.1109/ICICS52457.2021.9464600.