The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Texts Semantic Similarity Detection Based Graph

#
 Similarity  of  text  documents  is  important  to  analyz e  and  extract  useful  information  from  text  document s  and  generation  of  the  appropriate  data.  Several  cases  o f  lexical  matching  techniques  offered  to  determine the  similarity  between  documents  that  have  been  successful  to  a  certain  li mit  and  these  methods  are  failing  to  find  the  seman tic  similarity  between  two texts. Therefore, the semantic similarity appro aches were suggested, such as corpus-based methods  and knowledge based  methods  e.g.,  WordNet  based  methods.  This  paper,  of fers  a  new  method  for Paraphrase  Identification  (PI)  in  order  to,  measuring  the  semantic  similarity  of  texts  using  an   idea  of  a  graph.  We  intend  to  contribute  to  the  or der  of  the  words  in  sentence. We offer a graph based algorithm with spe cific implementation for similarity identification that makes extensive use  of word similarity information extracted from  WordN et. Experiments performed on the Microsoft research paraphrase corpus  and we show our approach achieves appropriate perfo rmance. 


[1] Bhagat R., Hovy E., and Patwardhan S., Acquiring Paraphrases From Text Corpora, in Proceedings of the 5 th International Conference on Knowledge Capture , New York, USA, pp. 161-168, 2009.

[2] Dagan I., Glickman O., and Magnini B., The Pascal Recognising Textual Entailment Challenge, in Proceedings of the 1 st PASCAL Machine Learning Challenges Workshop , Southampton, UK, pp. 177-190, 2006.

[3] Dolan B., Quirk C., and Brockett C., Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources, in Proceedings of the 20 th International Conference on Computational Linguistics , NJ, USA, pp. 350-356, 2004.

[4] Elberrichi Z. and Abidi K., Arabic Text Categorization: A Comparative Study of Different Representation Modes, the International Arab Journal of Information Technology , vol. 9, no. 5, pp. 465-470, 2012.

[5] Fernando S. and Stevenson M., A Semantic Similarity Approach to Paraphrase Detection, in Proceedings of the 11 th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics , Oxford, UK, pp. 45- 52, 2008.

[6] Indurkhya N. and Damerau F., Handbook of Natural Language Processing , CRC Press, 2010.

[7] Jiang J. and Conrath W., Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy, in Proceedings of International Conference Research on Computational Linguistics , Taiwan, pp. 1-15, 1997.

[8] Landauer K., Foltz W., and Laham D., An Introduction to Latent Semantic Analysis, Discourse Processes , vol. 25, no. 2, pp. 259-284, 1998.

[9] Leacock C. and Chodorow M., Combining Local Context and Wordnet Sense Similarity for Word Sense Identification, WordNet: An Electronic Lexical Database , Publisher: MIT Press, 2013.

[10] Lesk M., Automatic Sense Disambiguation using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone, in Proceedings of the 5 th Annual International Conference on Systems Documentation , New York, USA, pp. 24-26, 1986.

[11] Lin D., An Information-Theoretic Definition of Similarity, in Proceedings of the 5 th International Conference on Machine Learning , California, USA, pp. 296-304, 1998.

[12] Madnani N., Tetreault J., and Chodorow M., Re-examining Machine Translation Metrics for Paraphrase Identification, in Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Montr eal, Canada, pp. 182-190, 2012.

[13] Maximum Matching., available at: http://www.cs.dartmouth.edu/~ac/Teach/CS105- Winter05/Notes/kavathekar-scribe.pdf, last visited 2013.

[14] Mihalcea R., Corley C., and Strapparava C., Corpus-based and Knowledge-based Measures of Text Semantic Similarity, in Proceedings of Texts Semantic Similarity Detection Based Graph Approach 251 the American Association for Artificial Intelligence , Boston, USA, pp. 775-780, 2006.

[15] Pedersen T., Patwardhan S., and Michelizzi J., WordNet::Similarity: Measuring the Relatedness of Concepts, in Proceedings of the 19 th National Conference on Artificial Intelligence , California, USA, pp. 1024-1025, 2004.

[16] Rajkumar A. and Chitra A., Paraphrase Recognition using Neural Network Classification, the International Journal of Computer Application , vol. 1, no. 29, pp. 43-48, 2010.

[17] Ramage D., Rafferty N., and Manning D., Random Walks for Text Semantic Similarity, in Proceedings of Workshop on Graph-based Methods for Natural Language Processing , Pennsylvania, USA, pp. 23-31, 2009.

[18] Resnik P., Using Information Content to Evaluate Semantic Similarity in a Taxonomy , in Proceedings of the 14 th International Joint Conference on Artificial Intelligence , San Francisco, USA pp. 448-453, 2013.

[19] Rus V., McCarthy P., Lintean M., McNamara D., and Graesser A., Paraphrase Identification with Lexico-Syntactic Graph Subsumption, in Proceedings of the 21 st International Florida Artificial Intelligence Research Society Conference , Florida, USA, pp. 201-206, 2008.

[20] Salton G. and Buckley C., Term Weighting Approaches in Automatic Text Retrieval, Information Processing and Management , vol. 24, no. 5, pp. 513-523, 1988.

[21] Sparck-Jones K., A Statistical Interpretation of Term Specificity and its Application in Retrieval, the Journal of Documentation , vol. 28, no. 1, pp. 11-21, 1972.

[22] Toutanova K., Klein D., Manning C., and Singer Y., Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network, in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology , Edmonton, Canada, pp. 252-259, 2003.

[23] Unsupervised Learning., available at: http:// en.wikipedia.org/wiki/Unsupervised_learning, last visited 2013 .

[24] Wu Z. and Palmer M., Verb Semantics and Lexical Selection, in Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics , New Mexico, USA, pp. 133-138, 1994.

[25] Wubben S., Van den A., and Krahmer E., Paraphrase Generation as Monolingual Translation: Data and Evaluation, available at: http://ilk.uvt.nl/~swubben/publications/INLG201 0.pdf, last visited 2010.

[26] Zia U. and Wasif A., Paraphrase Identification using Semantic Heuristic Features, Research Journal of Applied Sciences , Engineering and Technology , vol. 4, no. 22, pp. 4894-4904, 2012. Majid Mohebbi received the MSc degree in software engineering from Shahid Beheshti University in 2013, Iran. His research interests include semantic similarity and NLP. Alireza Talebpour received his MSc degree in Artificial Intelligence and PhD degrees in Image Processing from University of Surrey, United Kingdom. His research interests include image processing and pattern recognition, intelligent methods for classification of massive data.