Designing Punjabi Poetry Classifiers Using Machine Learning and Different Textual Features

Author Jasleen Kaur1 and Jatinderkumar Saini2,

Keywords #Classification #naïve bayes #hyper pipes #k-nearest neighbour #Punjabi #poetry #support vector machine #word net

Abstract Analysis of poetic text is very challenging from computational linguistic perspective. Computational analysis of literary arts, especially poetry, is very difficult task for classification. For library recommendation system, poetries can be classified on various metrics such as poet, time period, sentiments and subject matter. In this work, content-based Punjabi poetry classifier was developed using Weka toolset. Four different categories were manually populated with 2034 poems Nature and Festival (NAFE), Linguistic and Patriotic (LIPA), Relation and Romantic (RORE), Philosophy and Spiritual (PHSP) categories consists of 505, 399, 529 and 601 numbers of poetries, respectively. These poetries were passed to various pre-processing sub phases such as tokenization, noise removal, stop word removal, and special symbol removal. 31938 extracted tokens were weighted using Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF) weighting scheme. Based upon poetry elements, three different textual features (lexical, syntactic and semantic) were experimented to develop classifier using different machine learning algorithms. Naive Bayes (NB), Support Vector Machine, Hyper pipes and K-nearest neighbour algorithms were experimented with textual features. The results revealed that semantic feature performed better as compared to lexical and syntactic. The best performing algorithm is SVM and highest accuracy (76.02%) is achieved by incorporating semantic information associated with words.

References

[1] Alsharif O., Alshamaa D., and Ghneim N., “Emotion Classification in Arabic Poetry using Machine Learning,” International Journal of Computer Application, vol. 5, no. 16, pp. 10-15, 2013.

[2] Article Poetry Analysis accessed from https://en.wikipedia.org/wiki/Poetry_analysis, Last Visited, 2015.

[3] Barros L., Rodriguez P., and Ortigosa A., “Automatic Classification of Literature Pieces by Emotion Detection: A Study on Quevedo’s Poetry,” in Proceedings of Humaine Association Conference on Affective Computing and Intelligent Interaction, Geneva, pp. 141-146, 2013.

[4] Can E., Can F., Duygulu P., and Kalpakli M., “Automatic Categorization of Ottoman Literary Texts by Poet and Time Period,” Computer and Information Science-II, pp. 51-57, 2012.

[5] Gupta V., “Automatic Stemming of Words for 3XQMDEL /DQJXDJH´Advances in Signal Processing and Intelligent Recognition Systems, vol. 264, pp. 73-84, 2014.

[6] Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., and Witten I., “The WEKA Data Mining Software: An Update,” ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10- 18, 2009.

[7] Hamidi S., Razzazi F., and Ghaemmaghami M., “Automatic Meter Classification in Persian Poetries using Support Vector Machines,” in Proceedings of IEEE International Symposium on Signal Processing and Information Technology, Ajman, pp. 563-567, 2009.

[8] Jamal N., Mohd M., and Noah S., “Poetry Classification Using Support Vector Machines,” Journal of Computer Science, vol. 8, no. 9, pp. 1441-1446, 2009.

[9] Kaur J. and Saini J., “A Natural Language Processing Approach for Identification of Stop Words in Punjabi Language,” International Journal of Data Mining and Emerging Technology, Indian Journals, vol. 5, no. 2, pp. 114-120, 2015.

[10] Kaur J. and Saini J., “A Study and Analysis of Opinion Mining Research in Indo-Aryan, Dravidian and Tibeto-Burman Language Families,” International Journal of Data Mining and Emerging Technology, vol. 4, no. 2, pp. 53- 60, 2014.

[11] Kaur J. and Saini J., “Automatic Punjabi Poetry Classification Using Machine Learning Algorithms with Reduced Feature Set,” International Journal of Artificial Intelligence and Soft Computing, vol. 5, no. 4, pp. 311-319, 2016.

[12] Kaur J. and Saini J., “POS Word Class based Categorization of Gurmukhi Stemmed Stop Words,” in Proceedings of 1st International Conference on Information and Communication Technology for Intelligent Systems, Ahmedabad, pp. 3-10, 2015. 44 The International Arab Journal of Information Technology, Vol. 17, No. 1, January 2020

[13] Kaur J. and Saini J., “Punjabi Stop Words: A Gurmukhi, Shahmukhi and Roman Scripted Chronicle,” in Proceedings of ACM Symposium WIR’16, Indore, pp. 32-37, 2016.

[14] Kaur J. and Saini J., “Punjabi Poetry Classification: The Test of 10 Machine Learning Algorithms,” in Proceedings of International Conference on Machine Learning and Computing, Singapore, pp. 1-5, 2017.

[15] Kumar V. and Minz S., “Poem Classification using Machine Learning,” in Proceedings of International Conference on Soft Computing for Problem Solving, Jaipur, pp. 675-682, 2012.

[16] Lou A., Inkpen D., and Tan C., “Multi-Category Subject-Based Classification of Poetry,” in Proceedings of the 28th International Florida Artificial Intelligence Research Society Conference, Florida, pp. 187-192, 2015.

[17] Punjabi language. from https://simple.wikipedia.org/wiki/Punjabi_langua ge, Last Visited, 2015.

[18] Punjabi Part of Speech Tagger accessed from http://punjabipos.learnpunjabi.org/ Last Visited, 2015.

[19] Punjabi Poetry. Accessed from http://www.punjabi-kavita.com/, Last Visited, 2015.

[20] Punjabi Poetry. Accessed from http://www.punjabizm.com./ Last Visited, 2015.

[21] Punjabi Poetry, from http://punjabimaaboli.com/, Last Visited, 2015.

[22] Punjabi WordNet. Accessed from http://wordnet.thapar.edu/wordnetcms/public/wor dnet/wordnet.php?langid=19&id=2, Last Visited, 2016.

[23] Rakhsit G., Ghosh A., Bhattacharyya P., and Haffari G., “Automated Analysis of Bangla Poetry for Classification and Poet Identification,” in Proceedings of 12th International Conference on Natural Language Processing, Trivandrum, pp. 247-253, 2015.

[24] Sarmah J., Sahara N., and Sarma S., “A Novel Approach for Document Classification using Assamese WordNet,” in Proceedings of International Global Wordnet Conference, Japan, pp. 324-329, 2012.

[25] Sebastiani F., “Machine Learning In Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.

[26] Singh S. and Siddiqui T., “Utilizing Corpus Statistics for Hindi Word Sense Disambiguation,” The International Arab Journal of Information Technology, vol. 12, no. 6A, pp. 755-763, 2015.

[27] Sinha M., Reddy M., and Bhattacharya P., “Hindi :RUG 6HQVH 'LVDPELJXDWLRQ´ in Proceedings of International Symposium on Machine Translation, Natural Language Processing and Translation Support Systems, Delhi, 2004.

[28] Unicode Table., from http://www.tamasoft.co.jp/en/general- info/unicode-decimal.html, Last Visited, 2015. Jasleen Kaur had done Bachelor of Technology, Computer Science and Engineering from Guru Teg Bahadur Khalsa Institute of Engineering Technology, Malout, Punjab and Master of Technology (Computer Engineering) from Punjabi University, Patiala, Punjab. She had completed her PhD. from Uka Tarsadia University, Bardoli, Gujarat. She has published 15 papers in various International Journals and had more than 60 citations. She had publications with Indersicence Publishers, Springer and ACM digital Library. Jatinderkumar Saini is Ph.D. from VNSGU, Surat. He secured First Rank in all three years of MCA and has been awarded Gold Medals forthis. Besides being University Topper, he is IBM Certified Database Associate (DB2) as well as IBM Certified Associate Developer (RAD). Associated with more than 50countries, he has been the Member of Program Committee for more than 50 International Conferences (including those by IEEE) and Editorial Board Member or Reviewer for more than 30 International Journals (including many those with Thomson Reuters Impact Factor). He has more than 55 research paper publications and nearly 20 presentations in reputed International and National Conferences and Journals. He is member of ISTE, IETE, ISG and CSI.