The International Arab Journal of Information Technology (IAJIT)


Comprehensive Stemmer for Morphologically Rich Urdu Language

Urdu  language  is  used  by  approximately  200  million  people  for  spoken  and  written  communication.  Bulk  of  unstructured Urdu textual data is available in the  world.  We can employ data mining techniques to extr act useful information  from  such  a  large  potential  information  base.  There   are  many  text  processing  systems  that  are  available.  However,  these  systems are mostly language specific with the large  proportion of systems are applicable to English text. This is primarily due  to  the  language  dependant  pre-processing  systems  ma inly  the  stemming  requirement.  Stemming  is  a  vital pre-processing  step  in the text mining process and its core aim is to r educe many grammatical words form e.g., parts of sp eech, gender, tense etc.  to their root form. In this proposed work, we have  developed a rule based comprehensive stemming metho d for Urdu text. This  proposed  Urdu  stemmer  has  the  ability  to  generate  t he  stem  of  Urdu  words  as  well  as  loan  words  (words  belonging  to  borrowed  language  i.e.  Arabic,  Persian,  Turkish,  et c)  by  removing  prefix  infix,  and  suffix.  This  proposed  stemming  technique  introduced six novel Urdu infix words classes and m inimum word length rule. In order to cope with the challenge of Urdu infix  stemming,  we  have  developed  infix  stripping  rules  f or  introduced  infix  words  classes  and  generic  rules   for  prefix  and  suffix  stemming.  The  experimental  results  show  the  superio rity  of  our  proposed  stemming  approach  as  compared  to  existing  technique.

[1]  Ali M., Khalid S., and Saleemi M., “A Novel Stemming Approach for Urdu language,” Journal  of  Applied  Environmental  and  Biological  Sciences , vol. 4, no. 7S, pp. 436+443, 2014.

[2]  Akram Q., Naseer A., and Hussain S., “Assas+ Band, an Affix+Exception+List Based Urdu Stemmer. An Affix+ Exception+List Based Urdu Stemmer,” in Proceedings of the 7 th Workshop on  Asian  Language  Resources , Suntec, pp. 40+47, 2009.

[3]  Al+Khuli M., A  Dictionary  of  Theoretical  Linguistics:  English-Arabic  with  an  Arabic-  English Glossary , Library of Lebanon, 1991.

[4]  Bento C., A Cardoso A., and Dias G., “Progress in Artificial Intelligence,” in  Proceedings  of  the  12 th  Portuguese  Conference  on  Artificial  Intelligence , Covilha, pp. 693+701, 2005.

[5]  Bacchin M., Ferro N., and Melucci M., “Experiments to Evaluate A Statistical Stemming Algorithm,” The  CLEF  2002  Workshop  Monolingual  Information  Retrieval , Rome, pp. 161+168, 2002.

[6]  John D., “Suffix Removal and Word Conflation,” ALLC Bulletin , vol. 2, no. 3, pp. 33+46, 1974.

[7]  Khan S., Anwar W., Bajwa U., and Wang X., “A Light Weight Stemmer for Urdu Language: A Scarce Resourced Language,” in  Proceedings  of  the  3 rd  Workshop  on  South  and  Southeast  Asian  Natural  Language  Processing , Mumbai, pp. 69+ 78, 2012.

[8]  Khan S., Anwar W., and Bajwa U., “Challenges in Developing A Rule based Urdu Stemmer,” in  Proceedings  of  the  2 nd  Workshop  on  South  and  Southeast  Asian  Natural  Language  Processing , Chiang Mai, pp. 46+51, 2011.

[9]  Khan S., Anwar W., Bajwa U., and Wang X., Template Based Affix Stemmer for a Morphologically Rich Language,” The  International  Arab  Journal  of  Information  Technology , vol. 12, no. 2, pp. 146+154, 2015.

[10]  Khoja S. and Garside R., “Stemming Arabic Text,” Computing Department, Lancaster University, 1999.

[11]  Lovins J., “Development of A Stemming Algorithm,” Mechanical  Translation  and  Computer Linguistic , vol. 11, no. 1+2, pp. 22+31, 1968.

[12]  Mokhtaripour A. and Jahanpour S., “Introduction to A New Farsi Stemmer,” in  Proceedings  of  the  15 th  ACM  International  Conference  on  Information  and  Knowledge  Management , Arlington, pp. 826+827, 2006.

[13]  Mayfield J. and McNamee P., “Single Ngram Stemming,” in  Proceedings  of  the  26 th  Annual  International  ACM  SIGIR  Conference  on  Research  and  Development  in  Information  Retrieval , Toronto, pp. 415+416, 2003.

[14]  Majumder P., Mitra M., Parui S., Kole G., Mitra P., and Datta K., “YASS: Yet Another Suffix Stripper,” ACM  Transactions  on  Information  Systems , vol. 25, no. 4, pp. 18, 2007.

[15]  Melucci M. and Orio N., “A Novel Method for Stemmer Generation Based on Hidden Markov Models,” in  Proceedings  of  the  12 th  International  Conference  on  Information  and  Knowledge Management , New Orleans, pp. 131+ 138, 2003.

[16]  Paice C., “Another Stemmer,” ACM  SIGIR  Forum , vol. 24, no. 3, pp. 56+61, 1990.

[17]  Porter M., “An Algorithm for Suffix Stripping,” Program , vol. 14, no. 3, pp. 130+137, 1980.

[18]  Porter M., Snowball: A language for Stemming Algorithms, 2001.

[19]  Tashakori M., Meybodi M., and Oroumchian F., “Bon: First Persian Stemmer,” in Proceedings of  Eurasian  Conference  on  Information  and  Communication  Technology , Shiraz, pp. 487+ 494, 2002.

[20]  Thabet N., “Stemming the Qur’an,” in  Proceedings  of  the  Workshop  on  Computational  Approaches  to  Arabic  Script-Based  Languages , Geneva, pp. 85+88, 2004. Comprehensive Stemmer for Morphologically Rich Urdu Language 147 Mubashir Ali received the MS degree from Bahria University, Islamabad, Pakistan in 2014. He received the BS degree in Computer Science from Allama Iqbal Open University, Islamabad, Pakistan in 2010. Currently he is working as an Assistant Professor in the department of Computer Science and IT, The University of Lahore, Gujrat Campus. Mubashir Ali is an active researcher and hi s areas of interest is in text mining, social network mining, natural language processing, computational linguistic and software repository mining. Shehzad Khalid is a professor and Head of Computer Engineering Department. He is a qualified academician and researcher with more than 70 International publications in conferences and journals. He has also authored various books and book chapters. Dr. Shehzad has graduated from Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Pakistan, in 2000. He received the M. Sc. degree from National University of Science and Technology, Pakistan in 2003 and the Ph.D. degree from the University of Manchester, U.K., in 2009. Muhammad Saleemi is a linguestic expert, he received the MA degree in Urdu from punjab university, lahore, Pakistan in 1971. He got the BA degree from punjab university, lahore, Pakistan, in 1967. He worked as a principle in govt high school khanki head for eight years. Moreover, he is an active member of punjab educatio n dept, to promote education he setup free tution cen ter for needy and poors people. He managed to publish different grammer books for Urdu language. His area s of interest are Urud, Arabic, Persian, and English languages.