Development of a Hindi Named Entity Recognition System without Using Manually Annotated

Author Abstract: Machine learning based approach for Named Entity Recognition (NER) requires sufficient annotated corpus to,

Keywords #Natural language processing #machine learning #named entity recognition #resource scarcity #language transfer #semi-supervised learning

Abstract Machine learning based approach for Named Entity Recognition (NER) requires sufficient annotated corpus to train the classifier. Other NER resources like gazetteers are also required to make the classifier more accurate. But in many languages and domains relevant NER resources are still not available. Creation of adequate and relevant resources is costly and time consuming. However a large amount of resources and several NER systems are available in resource-rich languages, like English. Suitable language adaptation techniques, NER resources of a resource-rich language and minimally supervised learning might help to overcome such scenarios. In this paper we have studied a few such techniques in order to develop a Hindi NER system. Without using any Hindi NE annotated corpus we have achieved a reasonable accuracy of F-Measure 73.87 in the developed system.

References

[1] Becker M., Hachey B., Alex B., and Grover C., Optimising Selective Sampling for Bootstrapping Named Entity Recognition, in Proceedings of ICML Workshop on Learning with Multiple Views, pp. 5-11, 2005.

[2] Benajiba Y., Diab M., and Rosso P., Using Language Independent and Language Specific Features to Enhance Arabic Named Entity Recognition, The International Arab Journal of Information Technology, vol. 6, no. 5, pp. 464- 473, 2009.

[3] Berger A., Pietra V., and Pietra S., A Maximum Entropy Approach to Natural Language Processing, Computational Linguistic, vol. 22, no. 1, pp. 39-71, 1996.

[4] Bikel D., Miller S., Schwartz R., and Weischedel R., Nymble: High-Performance Learning Name- Finder, in Proceedings of 5th Conference on Applied Natural Language Processing, Washington, pp. 194-201, 1997.

[5] Borin L., Briefly noted: Parallel Corpora, Parallel Worlds, Computational Linguistic, vol. 29, no. 1, pp. 149-151, 2003.

[6] Borthwick A., A Maximum Entropy Approach to Named Entity Recognition, Thesis, New York University 1999.

[7] Carreras X., M rquez L., and Padr L., Named Entity Recognition for Catalan Using Spanish Resources, in Proceedings of 10th Conference on European Chapter of the Association for Computational Linguistics, Budapest, pp. 43-50, 2003.

[8] Chieu H. and Ng H., Named Entity Recognition: a Maximum Entropy Approach using Global Information, in Proceedings of 19th International Conference on Computational Linguistics, Taipei, pp. 1-7, 2002.

[9] Cohn D., Atlas L., and Ladner R., Improving Generalization with Active Learning, Machine Learning, vol. 15, pp. 201-221, 1994.

[10] Collier N., Nobata C., and Tsujii J., Extracting the Names of Genes and Gene Products with a Hidden Markov Model, in Proceedings of 18th Conference on Computational Linguistics, Saarbr cken, pp. 201-207, 2000.

[11] Collins M. and Singer Y., Unsupervised Models for Named Entity Classification, in Proceedings of Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 100-110, 1999.

[12] Cucchiarelli A. and Velardi P., Unsupervised Named Entity Recognition using Syntactic and Semantic Contextual Evidence, Computational Linguistics, vol. 27, no. 1, pp. 123-131, 2001.

[13] Cucerzan S. and Yarowsky D., Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence, in Proceedings Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 90-99, 1999.

[14] Das A. and Garain U. CRF-based Named Entity Recognition@ ICON 2013, Carnell University Library, arXiv preprint arXiv:1409.8008, 2014.

[15] Dien D. and Kiem H., POS-Tagger for English- Vietnamese Bilingual Corpus, in Proceedings of the HLT-NAACL, Edmonton, pp. 88-95, 2003.

[16] Ekbal A. and Bandyopadhyay S., A Hidden Markov Model based Named Entity Recognition System: Bengali and Hindi as Case Studies, in Proceedings of International Conference on Pattern Recognition and Machine Intelligence, Kolkata, pp. 545-552, 2007.

[17] Ekbal A. and Saha S., Classifier Ensemble Selection Using Genetic Algorithm for Named Entity Recognition, Research on Language and Computation, vol. 8, no. 1, pp. 73-99, 2010.

[18] Etzioni O., Cafarella M., Downey D., Popescu A., Shaked T., Soderland S., Weld D., and Yates A., Unsupervised Named Entity Extraction from the Web: An Experimental Study, Artificial Intelligence, vol. 165, no. 1, pp. 91- 134, 2005.

[19] Gayen V. and Sarkar, K. An HMM based Named Entity Recognition System for Indian Languages: the JU System at ICON 2013, Carnell University Library, arXiv preprint arXiv:1405.7397, 2014.

[20] GuoDong Z. and Jian S., Exploring Deep Knowledge Resources in Biomedical Name Recognition, in Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, Geneva, pp. 96-99, 2004.

[21] Gupta V. and Lehal G., Named Entity Recognition for Punjabi Language Text Summarization, International Journal of Computer Applications, vol. 33, no. 3, pp. 28-32, 2011.

[22] Hana J., Feldman A., Brew C., and Amaral L., Tagging Portuguese with a Spanish Tagger using Cognates, in Proceedings of International Workshop on Cross-Language Knowledge Induction, Trento, pp. 33-40, 2006.

[23] Kaur A., Josan G., and Kaur J., Named Entity Recognition for Punjabi: A Conditional Random Field Approach, in Proceedings of 7th International Conference on Natural Language Processing, 2009.

[24] Kazama J., Makino T., Ohta Y., and Tsujii J., Tuning Support Vector Machines for Biomedical Named Entity Recognition, in Proceedings of ACL Workshop Natural Development of a Hindi Named Entity Recognition System without Using ... 1097 Language Processing in the Biomedical Domain, Phildadelphia, pp. 1-8, 2002.

[25] Kim J., Kang I., and Choi K., Unsupervised Named Entity Classification Models and Their Ensembles, in Proceedings of the 19th International Conference on Computational Linguistics, Taipei, pp. 1-7, 2002.

[26] Kim W. and Khudanpur S., Lexical Triggers and Latent Semantic Analysis for Cross-Lingual Language Model Adaptation, ACM Transactions on Asian Language Information Processing, vol. 3, no. 2, pp. 94-112, 2004.

[27] K k D., Automatic Compilation of Language Resources for Named Entity Recognition in Turkish by Utilizing Wikipedia Article Titles, Computer Standards and Interfaces, vol. 41, no. c, pp. 1-9, 2015.

[28] Kumar N. and Bhattacharya P., Named Entity Recognition in Hindi using MEMM, Technical Report, 2006.

[29] Lafferty J., McCallum A., and Pereira F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, in Proceedings of 18th International Conference on Machine Learning, pp. 282-289, 2001.

[30] Leaman R. and Gonzalez G., BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition, in Proceedings of Pacific Symposium on Bio computing, Chicago, pp. 652-663, 2008.

[31] Li W. and McCallum A., Rapid Development of Hindi Named Entity Recognition Using Conditional Random Fields and Feature Induction, ACM Transactions on Asian Language Information Processing, vol. 2, no. 3, pp. 290-294, 2003.

[32] Lin W., Yangarber R., and Grishman R., Bootstrapped Learning of Semantic Classes from Positive and Negative Examples, in Proceedings of the 20th International Conference on Machine Learning Workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, Washington, 2003.

[33] Maynard D., Tablan V., and Cunningham H., NE Recognition without Training Data on a Language you don t Speak, in Proceedings of ACL, Sapporo, pp. 33-40, 2003.

[34] Morgan A., Hirschman L., Colosimo M., Yeh A., and Colombe J., Gene Name Identification and Normalization Using a Model Organism Database, Biomedical Informatics, vol. 37, no. 6, pp. 396-410, 2004.

[35] Muslea I., Minton S., and Knoblock C., Selective Sampling with Redundant Views, in Proceedings of 7th National Conference on Artificial Intelligence, pp. 621-626, 2000.

[36] Nadeau D., Semi-Supervised NER: Learning to Recognize 100 Entity Types with Little Supervision, Thesis, University of Ottawa, 2007.

[37] Olsson F., Bootstrapping Named Entity Anotation by Means of Active Machine Learning, thesis, University of Gothenburg, 2008.

[38] Pedersen T., Kulkarni A., Kozareva Z., Angheluta R., and Solorio T., Improving Name ' L V F U L P L Q D W L R Q $ / D Q J X D J H 6 D O D G $ S S U R D F K in Proceedings of Workshop on Cross-Language Knowledge Induction, pp. 25-32, 2006.

[39] Riloff E. and Jones R., Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, in Proceedings of the National Conference on Artificial Intelligence, Menlo Park, pp. 474-479, 1999.

[40] R mcke A. and Johansson C., Named Entity Recognition using the Web, in Proceedings of Workshop on Anaphora Resolution, pp. 83-90, 2008.

[41] Saha S., Mitra P., and Sarkar S., A Semi- Supervised Approach for Maximum Entropy based Hindi Named Entity Recognition, in Proceedings of International Conference on Pattern Recognition and Machine Intelligence, New Delhi, pp. 225-230, 2009.

[42] Saha S., Mitra P., and Sarkar S., A Comparative Study on Feature Reduction Approaches in Hindi and Bengali Named Entity Recognition, Knowledge-Based Systems, vol. 27, pp. 322-332, 2012.

[43] Saha S., Sarkar S., and Mitra P., A hybrid Feature set based Maximum Entropy Hindi Named Entity Recognition, in Proceedings of the 3rd International Joint Conference on Natural Language Processing, pp. 343-349, 2008.

[44] Seung H., Opper M., and Sompolinsky H., Query by Committee, in Proceedings of 5th Annual ACM Conference on Computational Learning Theory, Pennsylvania, pp. 287-294, 1992.

[45] Sharma P., Sharma U., and Kalita J., Named Entity Recognition in Assamese using CRFS and Rules, in Proceedings International Conference on Asian Language Processing, Kuching, pp. 15-18, 2014.

[46] Sharma R. and Goyal V., Name Entity Recognition Systems for Hindi using CRF Approach, International Conference on Information Systems for Indian Languages, Patiala, pp. 31-35, 2011.

[47] Singh A., Named Entity Recognition for South and South East Asian Languages: Taking Stock, in Proceedings of International Joint Conference on Natural Language Processing, Hyderabad, pp. 5-16, 2008. 1098 The International Arab Journal of Information Technology, Vol. 15, No. 6, November 2018

[48] Solorio T. and L pez A., Learning Named Entity Recognition in Portuguese from Spanish, in Proceedings of Computational Linguistics and Intelligent Text Processing, Mexico, pp. 762-768, 2005.

[49] Summerfield N., Zhang Z., and Chen H., Disease Named Entity Recognition using Semi- Supervised Learning and Conditional Random Fields, Journal of American Society for Information Science and Technology, vol. 62, no. 4, pp. 727-737, 2011.

[50] Takeuchi K. and Collier N., Use of Support Vector Machines in Extended Named Entity Recognition, in Proceedings of 6th Conference on Natural language learning, Stroudsburg, pp. 1-7, 2002.

[51] Vapnik V., The Nature of Statistical Learning Theory, Springer-Verlag, 1995.

[52] Yamada H., Kudo T., and Matsumoto Y., Japanese Named Entity Extraction Using Support Vector Machine, Transactions of IPSJ, vol. 43, no. 1, pp. 44-53, 2002. Sujan Kumar Saha is an Assistant Professor in Department of Computer Science and Engineering, Birla Institute of Technology Mesra, Ranchi, India. His main research interests include Natural Language Processing, Machine Learning, and Educational Technologies. Mukta Majumder is an Assistant Professor in Department of Computer Science and Application, University of North Bengal, Siliguri, India. Prior to this he served Vidyasagar University as an Assistant Professor for almost three years. His main research interests include Text Processing, Machine Learning, Micro- fluidic System, and Biochip.