Downloads 1k

..............................

Views 3k

..............................

Cited by 7

..............................

Received date September 5, 2015

Accepted date Jun 1, 2016 1. Introd

Comprehensive Stemmer for Morphologically Rich Urdu Language

Author Mubashir Ali1, Shehzad Khalid2, and Muhammad Saleemi2,

Abstract Urdu language is used by approximately 200 million people for spoken and written communication. Bulk of unstructured Urdu textual data is available in the world. We can employ data mining techniques to extr act useful information from such a large potential information base. There are many text processing systems that are available. However, these systems are mostly language specific with the large proportion of systems are applicable to English text. This is primarily due to the language dependant pre-processing systems ma inly the stemming requirement. Stemming is a vital pre-processing step in the text mining process and its core aim is to r educe many grammatical words form e.g., parts of sp eech, gender, tense etc. to their root form. In this proposed work, we have developed a rule based comprehensive stemming metho d for Urdu text. This proposed Urdu stemmer has the ability to generate t he stem of Urdu words as well as loan words (words belonging to borrowed language i.e. Arabic, Persian, Turkish, et c) by removing prefix infix, and suffix. This proposed stemming technique introduced six novel Urdu infix words classes and m inimum word length rule. In order to cope with the challenge of Urdu infix stemming, we have developed infix stripping rules f or introduced infix words classes and generic rules for prefix and suffix stemming. The experimental results show the superio rity of our proposed stemming approach as compared to existing technique.

References

[1] Ali M., Khalid S., and Saleemi M., “A Novel Stemming Approach for Urdu language,” Journal of Applied Environmental and Biological Sciences , vol. 4, no. 7S, pp. 436+443, 2014.

[2] Akram Q., Naseer A., and Hussain S., “Assas+ Band, an Affix+Exception+List Based Urdu Stemmer. An Affix+ Exception+List Based Urdu Stemmer,” in Proceedings of the 7 th Workshop on Asian Language Resources , Suntec, pp. 40+47, 2009.

[3] Al+Khuli M., A Dictionary of Theoretical Linguistics: English-Arabic with an Arabic- English Glossary , Library of Lebanon, 1991.

[4] Bento C., A Cardoso A., and Dias G., “Progress in Artificial Intelligence,” in Proceedings of the 12 th Portuguese Conference on Artificial Intelligence , Covilha, pp. 693+701, 2005.

[5] Bacchin M., Ferro N., and Melucci M., “Experiments to Evaluate A Statistical Stemming Algorithm,” The CLEF 2002 Workshop Monolingual Information Retrieval , Rome, pp. 161+168, 2002.

[6] John D., “Suffix Removal and Word Conflation,” ALLC Bulletin , vol. 2, no. 3, pp. 33+46, 1974.

[7] Khan S., Anwar W., Bajwa U., and Wang X., “A Light Weight Stemmer for Urdu Language: A Scarce Resourced Language,” in Proceedings of the 3 rd Workshop on South and Southeast Asian Natural Language Processing , Mumbai, pp. 69+ 78, 2012.

[8] Khan S., Anwar W., and Bajwa U., “Challenges in Developing A Rule based Urdu Stemmer,” in Proceedings of the 2 nd Workshop on South and Southeast Asian Natural Language Processing , Chiang Mai, pp. 46+51, 2011.

[9] Khan S., Anwar W., Bajwa U., and Wang X., Template Based Affix Stemmer for a Morphologically Rich Language,” The International Arab Journal of Information Technology , vol. 12, no. 2, pp. 146+154, 2015.

[10] Khoja S. and Garside R., “Stemming Arabic Text,” Computing Department, Lancaster University, 1999.

[11] Lovins J., “Development of A Stemming Algorithm,” Mechanical Translation and Computer Linguistic , vol. 11, no. 1+2, pp. 22+31, 1968.

[12] Mokhtaripour A. and Jahanpour S., “Introduction to A New Farsi Stemmer,” in Proceedings of the 15 th ACM International Conference on Information and Knowledge Management , Arlington, pp. 826+827, 2006.

[13] Mayfield J. and McNamee P., “Single Ngram Stemming,” in Proceedings of the 26 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , Toronto, pp. 415+416, 2003.

[14] Majumder P., Mitra M., Parui S., Kole G., Mitra P., and Datta K., “YASS: Yet Another Suffix Stripper,” ACM Transactions on Information Systems , vol. 25, no. 4, pp. 18, 2007.

[15] Melucci M. and Orio N., “A Novel Method for Stemmer Generation Based on Hidden Markov Models,” in Proceedings of the 12 th International Conference on Information and Knowledge Management , New Orleans, pp. 131+ 138, 2003.

[16] Paice C., “Another Stemmer,” ACM SIGIR Forum , vol. 24, no. 3, pp. 56+61, 1990.

[17] Porter M., “An Algorithm for Suffix Stripping,” Program , vol. 14, no. 3, pp. 130+137, 1980.

[18] Porter M., Snowball: A language for Stemming Algorithms, 2001.

[19] Tashakori M., Meybodi M., and Oroumchian F., “Bon: First Persian Stemmer,” in Proceedings of Eurasian Conference on Information and Communication Technology , Shiraz, pp. 487+ 494, 2002.

[20] Thabet N., “Stemming the Qur’an,” in Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages , Geneva, pp. 85+88, 2004. Comprehensive Stemmer for Morphologically Rich Urdu Language 147 Mubashir Ali received the MS degree from Bahria University, Islamabad, Pakistan in 2014. He received the BS degree in Computer Science from Allama Iqbal Open University, Islamabad, Pakistan in 2010. Currently he is working as an Assistant Professor in the department of Computer Science and IT, The University of Lahore, Gujrat Campus. Mubashir Ali is an active researcher and hi s areas of interest is in text mining, social network mining, natural language processing, computational linguistic and software repository mining. Shehzad Khalid is a professor and Head of Computer Engineering Department. He is a qualified academician and researcher with more than 70 International publications in conferences and journals. He has also authored various books and book chapters. Dr. Shehzad has graduated from Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Pakistan, in 2000. He received the M. Sc. degree from National University of Science and Technology, Pakistan in 2003 and the Ph.D. degree from the University of Manchester, U.K., in 2009. Muhammad Saleemi is a linguestic expert, he received the MA degree in Urdu from punjab university, lahore, Pakistan in 1971. He got the BA degree from punjab university, lahore, Pakistan, in 1967. He worked as a principle in govt high school khanki head for eight years. Moreover, he is an active member of punjab educatio n dept, to promote education he setup free tution cen ter for needy and poors people. He managed to publish different grammer books for Urdu language. His area s of interest are Urud, Arabic, Persian, and English languages.