The International Arab Journal of Information Technology (IAJIT)


Conceptual Persian Text Summarizer: A New Model in Continuous Vector Space

Traditional methods of summarization are not cost-effective and possible today. Extractive summarization is a process that helps to extract the most important sentences from a text automatically, and generates a short informative summary. In this work, we propose a novel unsupervised method to summarize Persian texts. The proposed method adopt a hybrid approach that clusters the concepts of the text using deep learning and traditional statistical methods. First we produce a word embedding based on Hamshahri2 corpus and a dictionary of word frequencies. Then the proposed algorithm extracts the keywords of the document, clusters its concepts, and finally ranks the sentences to produce the summary. We evaluated the proposed method on Pasokh single-document corpus using the ROUGE evaluation measure. Without using any hand-crafted features, our proposed method achieves better results than the state-of-the-art related work results. We compared our unsupervised method with the best supervised Persian methods and we achieved an overall improvement of ROUGE-2 recall score of 7.5%.

[1] AleAhmad A., Amiri H., Darrudi E., Rahgozar M., and Oroumchian F., “Hamshahri: A standard Persian text collection,” Knowledge Based System., vol. 22, issue 5, pp. 382-387, 2009.

[2] Alkım E., Çebi Y., “Machine Translation Infrastructure for Turkic Languages (MT- Turk),” The International Arab Journal of Information Technology, vol. 16, no. 3, pp. 380- 388, 2019.

[3] Baxendale P., “Machine-Made Index for Technical Literature-An Experiment,” IBM Journal of Research and Development, vol. 2, no. 4, pp. 354-361, 1958.

[4] Bazghandi M., Tabrizi G., Jahan M., “Extractive Summarization of Farsi Documents Based on PSO Clustering,” International Journal of Computer Science Issues, vol. 9, no. 4, pp. 329- 332, 2012.

[5] Bengio Y., Ducharme R., Vincent P., and Jauvin C., “A Neural Probabilistic Language Model,” Journal of Machine Learning Research, vol. 3, pp. 1137-1155, 2003.

[6] Berger A., Mittal V., “Query-Relevant Summarization Using Faqs,” in Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Stroudsburg, pp. 294-301, 2000.

[7] Brants T., “Large Language Models in Machine Translation,” in Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, pp. 858- 867 2007.

[8] Chelba C., Mikolov T., Schuster M., Ge Q., Brants T., Koehn P., and Robinson T., “One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling,” ArXiv Prepr, ArXiv13123005, Last Visited 2014.

[9] Chen Z., Lin W., Chen Q., Chen X., Wei S., Jiang X., and Znu X., “Revisiting Word Embedding for Contrasting Meaning,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 106-115, 2015. Conceptual Persian Text Summarizer: A New Model in Continuous Vector Space 537

[10] Collobert R., Weston J., Bottou L., Kavukcuoglu K., and Kuksa P., “Natural Language Processing (Almost) from Scratch,” Journal of Machine Learning Research, vol. 12, pp. 2493-2537, 2011.

[11] Collobert R. and Weston J., “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning,” in Proceedings of the 25th International Conference on Machine Learning, Helsinki, pp. 160-167, 2008.

[12] Devlin J., Huang Z., Lamar T., Schwartz R., and Makhoul J., “Fast and Robust Neural Network Joint Models for Statistical Machine Translation,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics Baltimore, pp. 1370-1380, 2014.

[13] Edmundson H., “New Methods in Automatic Extracting,” Journal of the ACM, vol. 16, no. 2, 264-285, 1969.

[14] Hassel M., Mazdak N., “Farsisum: A Persian Text Summarizer,” in Proceedings of the Workshop on Computational Approaches to Arabic Script- Based Languages, Stroudsburg, pp. 82-84, 2004.

[15] Hinton G., McClelland J., Parallel Distributed Processing: Explorations in the Microstructure of Cognition, MIT Press, 1986.

[16] Hinton G., Roweis S., “Stochastic Neighbor Embedding,” Advances in Neural Information Processing Systems Advances-Neural Information Processing Systems, 2003.

[17] Honarpisheh M., Sani G., and Mirroshandel G., “A Multi-Document Multi-Lingual Automatic Summarization System,” in Proceedings of the 3rd International Joint Conference on Natural Language Processing, Hyderabad, pp. 733-738, 2008.

[18] Jin F., Huang M., and Zhu X., “A Comparative Study on Ranking and Selection Strategies for Multi-Document Summarization,” in Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Beijing, pp. 525-533, 2010.

[19] Khanpour H., Sentence Extraction for Summarization and Note Taking, University of Malaya, 2011.

[20] Kiyomarsi F. and Esfahani F., “Optimizing Persian Text Summarization Based on Fuzzy Logic Approach,” International Conference on Intelligent Building and Management, Iran, pp. 264-269, 2011.

[21] Leskovec J., Mining of Massive Datasets, Cambridge university press, 2014.

[22] Lin C., “Rouge: A Package for Automatic Evaluation of Summaries,” in Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, pp. 74-81, 2004.

[23] Liu Q., Jiang H., Wei S., Liang H., and Hu H., “Learning Semantic Word Embeddings Based on Ordinal Knowledge Constraints,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing pp. 1501-1511, 2015.

[24] Luhn H., “The Automatic Creation of Literature Abstracts,” IBM Journal of Research and Development, vol. 2, no. 2, pp. 159-165, 1958.

[25] Mikolov T., Sutskever I., Chen K., Corrado G., Dean J.,Greg and Corrado., “Distributed Representations of Words and Phrases and their Compositionality,” in Proceedings of the 26th International Conference on Neural Information Processing Systems, New Yourk, pp. 3111-3119, 2013.

[26] Mikolov T., Deoras A., Kombrink S., Burget L., “Empirical Evaluation and Combination of Advanced Language Modeling Techniques,” in Proceedings of 12th Annual Conference of the International Speech Communication Association, Florence, pp.605-608, 2011.

[27] Miller G.A., "WordNet: a lexical database for English," Communications of the ACM 38, no. 11, pp. 39–41, 1995.

[28] Moghaddas B., Kahani M., Toosi S., Estiri A., “Pasokh: A Standard Corpus for the Evaluation of Persian Text Summarizers,” in Proceedings of International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, pp. 471-475, 2013.

[29] Asef P., Mohsen K., Ahmad T., Ahmad E., and Hadi Q., “Ijaz: An Operational System for Single-Document Summarization of Persian News Texts,” Signal and Data Processing vol. 11, no.1, pp. 33-48, 2014.

[30] Roweis S., and Saul L., “Nonlinear Dimensionality Reduction by Locally Linear Embedding,” Science, vol. 290, no. 5500, pp. 2323-2326, 2000.

[31] Schwenk H., “Continuous Space Language Models,” Computer Speech and Languge, vol. 21, no. 3, pp. 492-518, 2007.

[32] Shafiee F. and Shamsfard M., “Similarity Versus Relatedness: A Novel Approach in Extractive Persian Document Summarisation,” Journal of Information Science, vol. 44, no. 3, pp. 314-330, 2018.

[33] Shakeri H., Gholamrezazadeh S., Salehi M., and Ghadamyari F., “A New Graph-Based Algorithm for Persian Text Summarization,” Computer Science and Convergence, vol. 114 pp. 21-30, 2012.

[34] Shamsfard M., Hesabi A., Fadaei H., and Mansoory N., “Developing FarsNet: A lexical Ontology for Persian,” in Proceedings of 4th Global WordNet Conference, Szeged, 2008. 538 The International Arab Journal of Information Technology, Vol. 17, No. 4, July 2020

[35] Shamsfard M., Akhavan T., Joorabchi M., “Persian Document Summarization by Parsumist,” World Applied Sciences Journal, vol. 7, pp. 199-205, 2009.

[36] Shamsfard M., Hesabi A., Fadaei H., Mansoory N., Famian A., Bagherbeigi S., Fekr E., Monshizadeh M., Assi M., “Semi Automatic Development of Farsnet; the Persian Wordnet,” in Proceedings of 5th Global WordNet Conference, Mumbai, 2010.

[37] Song W., Choi L., Park S., and Diang X., “Fuzzy Evolutionary Optimization Modeling and its Applications to Unsupervised Categorization and Extractive Summarization,” Expert Systems with Applications, vol. 38, no. 8, pp. 9112-9121, 2011.

[38] Strutz T., Data Fitting and Uncertainty: A Practical Introduction to Weighted Least Squares and Beyond, Vieweg and Teubner, 2010.

[39] Sutskever I., Vinyals O., and Le Q., “Sequence to Sequence Learning with Neural Networks,” Advances in neural information processing systems, vol. 2, pp. 3104-3112, 2014.

[40] Tang J., Yao L., and Chen D., “Multi-Topic Based Query-Oriented Summarization,” in Proceedings of the International Conference on Data Mining, Nevada, pp. 1148-1159, 2009.

[41] Tenenbaum J., Silva V., and Langford J., “A Global Geometric Framework for Nonlinear Dimensionality Reduction,” Science, vol. 290, no. 5500, pp. 2319-2323, 2000.

[42] Tofighy M., Kashefi O., Zamanifar O., and Javadi H., “Persian Text Summarization Using Fractal Theory,” in Proceedings of International Conference on Informatics Engineering and Information Science, Kuala Lumpurm pp. 651- 662, 2011.

[43] Tofighy S., Raj R., and Javadi H., “AHP Techniques for Persian Text Summarization,” Malaysian Journal of Computer Science, vol. 26, no, 1, pp. 1-8, 2013.

[44] Turney P. and Pantel P., “from Frequency to Meaning: Vector Space Models Of Semantics,” Journal of Artificial Intelligence Research, vol. 37, no. 1, pp. 141-188, 2010.

[45] Zamanifar A., Bidgoli B., and Sharifi M.; “A New Hybrid Farsi Text Summarization Technique Based on Term Co-Occurrence and Conceptual Property of the Text,” in Proceedings of International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, Phuket, pp. 635-639, 2008.

[46] Zamanifar A. and Kashefi O., “AZOM: A Persian Structured Text Summarizer,” International Conference on Applications of Natural Language to Information Systems, Paris pp. 234-237, 2011. Mohammad Ebrahim Khademi received his M.S. degree in computer engineering from the Malek Ashtar University of Technology, Iran, in 2013. He is currently a PhD candidate in computer engineering there. His research interests include machine learning (deep learning) and natural language processing. Mohammad Fakhredanesh received his B.S., M.S. and PhD degree in computer science and Engineering from the Amirkabir University of Technology (Tehran Polytechnic), Iran, in 2005, 2007, and 2014 respectively. He is currently an assistant professor at the Malek Ashtar University of Technology. His research interests are the fields of artificial intelligence, pattern recognition, and text summarization. Seyed Mojtaba Hoseini received his B.S. degree in Electronic Engineering from Malek Ashtar University of Technology in 1991. He also received his M.S. and PhD degrees in Computer Architecture Engineering from Amirkabir University of Technology in 1995 and 2011 respectively. His research interests include Wireless sensor Networks, with an emphasis on target coverage and tracking applications, image and signal processing, and evolutionary computing.