The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


A Hybrid Approach for Urdu Sentence Boundary

 Sentence  boundary  identification  is  a  preliminary  s tep  for  preparing  a  text  document  for  Natural  Langu age  Processing tasks, e.g., machine translation, POS ta gging, text summarization and etc. We present a hyb rid approach for Urdu  sentence boundary disambiguation comprising of unig ram statistical model and rule based algorithm.  After implementing this  approach,    we  obtained  99.48%  precision,  86.35%  rec all  and  92.45%  F1-Measure  while  keeping  training  and  testing  data  different from each other, and with same training a nd testing data, we obtained  99.36% precision, 96. 45% recall and 97.89%  F1-Measure.    


[1] Agarwal N., Ford K., and Shneider M., Sentence Boundary Detection Using a MaxEnt Classifier, in Proceedings of MISC , CA, pp. 1@6, 2005.

[2] Anwar W., Wang X., and Li L., A Statistical Based Part of Speech Tagger for Urdu Language, in Proceedings of Machine Learning and Cybernatics , Hong Kong, pp. 3418@3424, 2007.

[3] Manning C. and Schutze H., Foundations of Statistical Natural Language processing , Massachusetts Institute of Technology Press, UK, 1999.

[4] Dincer B. and Karaoglan B., Sentence Boundary Detection in Turkish, in Proceedings of Advances in Information Systems , Berlin, pp. 255@ 262, 2004.

[5] Kiss T. and Strunk J., Unsupervised Multilingual Sentence Boundary Detection, Journal of MIT Press , vol. 32, no. 4, pp. 485@525, 2006.

[6] Kiss T. and Strunk J., Viewing Sentence Boundary Detection as Collocation Identification, in Proceedings of KONVENS , pp. 75@82, 2002.

[7] Malik A., A Hybrid Model for Urdu Hindi Translation, in Proceedings of the Named Entities Workshop , Singapore, pp. 177@185, 2009.

[8] Mikheev A., Tagging Sentence Boundaries, in Proceedings of the 1 st North American Chapter of the Association for Computational Linguistics Conference , pp. 264@271, 2000.

[9] Mobarakeh I. and Bidgoli M., Verb Detection in Persian Corpous, International Journal of Digital Content Technology and its Applications , vol. 3, no. 1, pp. 58@65, 2009.

[10] Palmer D. and Hearst M., Adaptive Sentence Boundary Disambiguation, in Proceedings of the 4 th Conference on Applied Natural Language processing , Germany, pp. 73@83, 1994.

[11] Phuong H. and Vinh T., A Maximum Entropy Approach to Sentence Boundary Detection of Vietnamese Texts, in Proceedings of IEEE Conference of Research , pp. 1@5, 2008.

[12] Reynar J. and Ratnaparkhi A., A Maximum Entropy Approach to Identifying Sentence Boundaries, in Proceedings of the 5 th Conference on Applied Natural Language Processing , USA, pp. 16@19, 1997.

[13] Rezaei S., Tokenizing an Arabic Script Language, Arabic NLP Workshop at ACL/EACL , France, 2001.

[14] Riaz K., Challenges for Urdu Stemming a Progress Report, in Proceedings of BCS IRSG Symposium: Future Directions in Information Access , pp. 1@6, 2007.

[15] Romportl J., Tihelka D., and Matousek J., Sentence Boundary Detection in Czech TTS System Using Neural Networks, in Proceedings of IEEE , pp. 247@250, 2003.

[16] Walker D., Clements D., Darwin M., and Amtrup J., Sentence Boundary Detection: A Comparison of Paradigms for Improving MT Quality, in Proceedings of Machine Translation in the Information Age , pp. 369@372, 2001.

[17] Wang H. and Huang Y., Bondec@ A Sentence Boundary Detector, CS224N Project, Stanford, 2003. Zobia Rehman is a lecturer at COMSATS Institute of Information Technology, Pakistan since October 2009. She did her MS in computer science from COMSATS in 2009. Her area of interest is natural language processing and artificial neural networks. Waqas Anwar is working in COMSATS Institute of Information Technology, Pakistan as assistant professor since April 2008. He got his PhD degree in Computer application technology from Harbin Institute of Technology, PR China in 2008. He did Masters in computer science from Hamdard University, Pakistan in 2001. He is an acti ve researcher and his areas of interest are Natural language processing and computational intelligence.