The International Arab Journal of Information Technology (IAJIT)


Semantic Similarity Analysis for Corpus Development and Paraphrase Detection in Arabic

Paraphrase detection allows determining how original and suspect documents convey the same meaning. It has attracted attention from researchers in many Natural Language Processing (NLP) tasks such as plagiarism detection, question answering, information retrieval, etc., Traditional methods (e.g., Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and Latent Semantic Analysis (LSA)) cannot capture efficiently hidden semantic relations when sentences may not contain any common words or the co-occurrence of words is rarely present. Therefore, we proposed a deep learning model based on Global Word embedding (GloVe) and Recurrent Convolutional Neural Network (RCNN). It was efficient for capturing more contextual dependencies between words vectors with precise semantic meanings. Seeing the lack of resources in Arabic language publicly available, we developed a paraphrased corpus automatically. It preserved syntactic and semantic structures of Arabic sentences using word2vec model and Part-Of-Speech (POS) annotation. Overall experiments shown that our proposed model outperformed the state-of-the-art methods in terms of precision and recall.

[1] Al-Anzi F. and AbuZeina D., “Toward an Enhanced Arabic Text Classification Using Cosine Similarity and Latent Semantic Indexing,” Journal of King Saud University, Computer and Information Sciences, vol. 29, pp. 189-195, 2017.

[2] AL-Smadi M., Jaradat Z., AL-Ayyoub M., and Jararweh Y., “Paraphrase identification and Semantic Text Similarity Analysis in Arabic News Tweets Using Lexical, Syntactic, and Semantic Features,” Information Processing and Management, vol. 53, no. 3, pp. 640-652, 2016.

[3] Ameer A. and Juzaiddin A., “Enhanced Tf-Idf Weighting Scheme for Plagiarism Detection Model for Arabic Language,” Australian Journal on Basic Application Sciences, vol. 9, no. 23, pp. 90-96, 2015.

[4] Batita M. and Zrigui M., “Derivational Relations in Arabic Wordnet,” in Proceedings of 9th Global WordNet Conference, Singapore, pp. 137-144, 2018.

[5] Daud A., Khan J., Nasir J., Abbasi R., Aljohani N., and Alowibdi J., “Latent Dirichlet Allocation and POS Tags Based Method for External Plagiarism Detection,” International Journal on Semantic Web and Information Systems, vol. 14, no. 3, pp. 53-69, 2018.

[6] Haffar N., Hkiri E., and Zrigui M., “TimeML Annotation of Events and Temporal Expressions in Arabic Texts,” in Proceedings of International Conference on Computational Collective Intelligence, Auditorium Antoine D'Abbadie, Hendaye, pp. 207-218, 2019.

[7] He H., Gimpel K., and Lin J., “Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Lisbon, pp. 1576- 1586, 2015.

[8] Hkiri E., Mallat S., and Zrigui M., “Arabic- English Text Translation Leveraging Hybrid NER,” in Proceedings of 31st Pacifc Asia Conference on Language, Information and Computation, Philippines, pp. 124-131, 2017.

[9] Hkiri E., Mallat S., Zrigui M., and Mars M., “Constructing a Lexicon of Arabic-English Named Entity Using SMT and Semantic Linked Data,” The International Arab Journal of Information Technology, vol. 14, no. 16, pp. 820- 825, 2017.

[10] Lee C. and Cheah Y., “Paraphrase Detection Using String Similarity with Synonyms,” in Proceedings of 4th Asian Conference on Information Systems, Malisya, 2015.

[11] Mahmoud A. and Zrigui M., “Semantic Similarity Analysis for Paraphrase Identification in Arabic Texts,” in Proceedings of 31st Pacific Asia Conference on Language, Information and Computation, Philippine, pp. 274-281, 2017.

[12] Mahmoud A., Zrigui A., and Zrigui M., “A Text Semantic Similarity Approach for Arabic Paraphrase Detection,” in Proceedings of 18th International Conference on Computational Linguistics and Intelligent Text Processing, Budapest, pp. 338-349, 2017.

[13] Mahmoud A. and Zrigui M., “Artificial Method for Building Monolingual Plagiarized Arabic Corpus,” Computacion y Sistemas, vol. 22, no. 3, pp. 767-776, 2018.

[14] Mahmoud A. and Zrigui M., “Sentence Embedding and Convolutional Neural Network for Semantic Textual Similarity Detection in Arabic Language,” Arabian for Science and Engineering Journal, vol. 44, no. 11, pp. 9263- 9274, 2019.

[15] Mahmoud A. and Zrigui M., “Deep Neural Networks Models for Paraphrased Text Classifcation in The Arabic Language,” in Proceedings of the International Conference on Natural Language and Information Systems, Salford, pp. 3-16, 2019.

[16] Mahmoud A. and Zrigui M., “Machine Learning Based Approach for Detecting Arabic Paraphrases,” in Proceedings of the International Business Information Management Association, Granada, pp. 5035-5048, 2019.

[17] Mansouri S., Charhad M., and Zrigui M., “A Heuristic Approach to Detect and Localize Text in Arabic News Video,” Computacion y Sistemas, vol. 23, no.1, pp. 75-82, 2018.

[18] Nagoudi E., Khorsi A., Cherroun H., and Schwab D., “A Two-Level Plagiarism Detection System for Arabic Documents,” Cybernetics and Information Technologies, vol. 18, no. 1, pp. 1- 18, 2018.

[19] Oussalah M. and Kostakos P., “On Web Based Sentence Similarity for Paraphrasing Detection,” in Proceedings of the 9th International Joint Conference on Knowledge Discovery, Engineering and Knowledge Management, Funchal, pp. 289-292, 2017. Semantic Similarity Analysis for Corpus Development and Paraphrase Detection in Arabic 7

[20] Saad M. and Ashour W., “OSAC: Open Source Arabic Corpora,” in Proceedings of the 6th International Conference on Electrical and Computer Systems, Lefke, pp. 1-6, 2010.

[21] Sharjeel M., Rayson P., and Nawab R., “UPPC - Urdu Paraphrase Plagiarism Corpus,” in Proceedings of the 10th International Conference on Language Resources and Evaluation, Portorož, pp. 1832-1836, 2016.

[22] Sharjeel M., Nawab R., and Rayson P., “COUNTER: Corpus of Urdu News Text Reuse,” Language Resources and Evaluation, vol. 51, pp. 777-803, 2017.

[23] Shenoy N. and Potey M., “Semantic Similarity Search Model for Obfuscated Plagiarism Detection In Marathi Language Using Fuzzy and Naïve Bayes Approaches,” IOSR Journal of Computer Engineering, vol. 18, no. 3, pp. 83-88, 2016.

[24] Zubarev D. and Sochenkov I., “Paraphrased Plagiarism Detection Using Sentence Similarity,” in Proceedings of the International Conference Dialogue, Moscow, pp. 1-10, 2017. Adnen Mahmoud is a PhD student in the Higher Institute of Computer Science and Communication Techniques ISITCom, Hammam Sousse, Tunisia. He is member of Research Laboratory in Algebra, Numbers Theory and Intelligent Systems RLANTIS, Monastir, Tunisia. His areas of interest include natural language processing (Arabic language), machine learning, data mining and information retrieval. He has published many research papers in international journals and conferences. Mounir Zrigui received his PhD from the Paul Sabatier University, Toulouse, France in 1987and his HDR from the Stendhal University, Grenoble, France in 2008. Since 1986, he is a Computer Sciences Assistant Professor in Brest University, France, and after in Faculty of Science of Monastir, Tunisia. He has started his research, focused on all aspects of automatic natural language processing (written and oral), in RIADI laboratory and after in LaTICE Laboratory. In addition, he is member of Research Laboratory in Algebra, Numbers Theory and Intelligent Systems RLANTIS, Monastir, Tunisia. He has run many research projects and published many research papers in reputed international journals/conferences.