Semantic Similarity Analysis for Corpus Development and Paraphrase Detection in Arabic

Paraphrase detection allows determining how original and suspect documents convey the same meaning. It has attracted attention from researchers in many Natural Language Processing (NLP) tasks such as plagiarism detection, question answering, information retrieval, etc., Traditional methods (e.g., Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and Latent Semantic Analysis (LSA)) cannot capture efficiently hidden semantic relations when sentences may not contain any common words or the co-occurrence of words is rarely present. Therefore, we proposed a deep learning model based on Global Word embedding (GloVe) and Recurrent Convolutional Neural Network (RCNN). It was efficient for capturing more contextual dependencies between words vectors with precise semantic meanings. Seeing the lack of resources in Arabic language publicly available, we developed a paraphrased corpus automatically. It preserved syntactic and semantic structures of Arabic sentences using word2vec model and Part-Of-Speech (POS) annotation. Overall experiments shown that our proposed model outperformed the state-of-the-art methods in terms of precision and recall.

[24] Zubarev D. and Sochenkov I., “Paraphrased Plagiarism Detection Using Sentence Similarity,” in Proceedings of the International Conference Dialogue, Moscow, pp. 1-10, 2017. Adnen Mahmoud is a PhD student in the Higher Institute of Computer Science and Communication Techniques ISITCom, Hammam Sousse, Tunisia. He is member of Research Laboratory in Algebra, Numbers Theory and Intelligent Systems RLANTIS, Monastir, Tunisia. His areas of interest include natural language processing (Arabic language), machine learning, data mining and information retrieval. He has published many research papers in international journals and conferences. Mounir Zrigui received his PhD from the Paul Sabatier University, Toulouse, France in 1987and his HDR from the Stendhal University, Grenoble, France in 2008. Since 1986, he is a Computer Sciences Assistant Professor in Brest University, France, and after in Faculty of Science of Monastir, Tunisia. He has started his research, focused on all aspects of automatic natural language processing (written and oral), in RIADI laboratory and after in LaTICE Laboratory. In addition, he is member of Research Laboratory in Algebra, Numbers Theory and Intelligent Systems RLANTIS, Monastir, Tunisia. He has run many research projects and published many research papers in reputed international journals/conferences.