The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Constructing a Lexicon of Arabic-English Named

Named Entity Recognition (NER) is the problem of locating and categorizing atomic entities in a given text. In this work, we used DBpedia Linked datasets and combined existing open source tools to generate from a parallel corpus a bilingual lexicon of Named Entities (NE). To annotate NE in the monolingual English corpus, we used linked data entities by mapping them to Gate Gazetteers. In order to translate entities identified by the gate tool from the English corpus, we used moses, a Statistical Machine Translation (SMT) system. The construction of the Arabic-English NE lexicon is based on the results of moses translation. Our method is fully automatic and aims to help Natural Language Processing (NLP) tasks such as, Machine Translation (MT) information retrieval, text mining and question answering. Our lexicon contains 48753 pairs of Arabic-English NE, it is freely available for use by other researchers.


[1] Abdul-Hamid A. and Darwish K., Simplified Feature Set for Arabic Named Entity Recognition, in Proceeding of Named entities workshop. Association for Computational Linguistics, Uppsala, pp. 110-115, 2010

[2] Abdul-Rauf S. and Schwenk H., On the Use of Comparable Corpora to Improve SMT Performance, in Proceeding of the 12th Conference of the European Chapter of the ACL, Athens, pp. 16-23, 2009.

[3] Agrawal N. and Singla A., Using Named Entity Recognition to Improve Machine Translation, Technical Report, 2012.

[4] Attia M., Toral A., Tounsi L., Monachini M., and Genabith J., An Automatically Built Named Entity Lexicon for Arabic, in Proceeding of International Conference on Language Resources and Evaluation, Valletta, pp. 3614- 3621, 2010.

[5] Ben Mohamed A., Mallat S., Nahdi M., and Zrigui M., Exploring the Potential of Schemes in Building NLP Tools for Arabic Language, The International Arab Journal of Information Technology, vol. 12, no. 6, pp. 566-573, 2015.

[6] Benajiba Y., Rosso P., and Bened Ruiz J., ANERsys: an Arabic Named Entity Recognition System Based on Maximum Entropy, in Proceeding of International Conference on Intelligent Text Processing and Computational Linguistics, Mexico, pp. 143-153, 2007.

[7] Benajiba Y. and Zitouni I., Enhancing Mention Detection using Projection via Aligned Corpora, in Proceeding of Conference on Empirical Methods in Natural Language Processing, Massachusetts, pp. 993-1001, 2010.

[8] El-Jihad A., Yousfi A., and Si-Lhoussain A., Morpho-Syntactic Tagging System Based on the Patterns Words for Arabic Texts, The International Arab Journal of Information Technology, vol. 8, no. 4, pp. 350-354, 2011.

[9] Fehri H., Haddar K., and Ben Hamadou A., Recognition and Translation of Arabic Named Entities with NooJ using a New Representation Model, in Proceeding of 9th International Workshop on Finite State Methods and Natural Language Processing, Blois, pp. 134-142, 2011.

[10] Hassan A., Fahmy H., and Hassan H., Improving Named Entity Translation by Exploiting Comparable and Parallel Corpora, in Proceeding of Conference on Recent Advances in Natural Language Processing, Borovets, pp. 1-6, 2007. Constructing a Lexicon of Arabic-English Named Entity using SMT and Semantic ... 825

[11] Hkiri E., Mallat S., and Zrigui M., Automatic Translation of Arabic Texts Based on Ontology, in Proceeding of International Conference on Web and Information Technologies, Hammamet, pp. 494-501, 2013.

[12] Hkiri E., Mallat S., and Zrigui M., Events Extraction From Arabic Text, The International Journal of Information Retrieval Research, vol. 6, no. 1, pp. 36-51, 2016.

[13] Hkiri E., Mallat S., Maraoui M., and Zrigui M., Automating Event Recognition for SMT Systems, in Proceeding of International Conference on Computational Collective Intelligence, Madrid, pp. 494-502, 2015.

[14] Koehn P., Federico M., Cowan B., Zens R., Dyer C., Bojar O., Constantin A., and Herbst E., Moses: Open Source Toolkit for Statistical Machine Translation, in Proceeding of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Prague, pp.177-180, 2007.

[15] Ling W., Calado P., Martins B., Trancoso I., Black A., and Coheur L., Named Entity Translation using Anchor Texts, in Proceeding of the International Workshop on Spoken Language Translation, San Francisco, pp. 206-213, 2011.

[16] Mallat S., Ben Mohamed A., Hkiri E. , Zouaghi A., and Zrigui M., Semantic and Contextual Knowledge Representation for Lexical Disambiguation: Case of Arabic-French Query Translation, Journal of Computing and Information Technology, vol. 22, no. 3, pp. 191- 215, 2014.

[17] Mallat S., Hkiri E., Maraoui M., and Zrigui M., Lexical Network Enrichment Using Association Rules Model, in Proceeding of 16th International Conference on Intelligent Text Processing and Computational Linguistics, Cairo, pp. 59-72, 2015.

[18] Mohit B., Schneider N., Bhowmick R., Oflazer K., and Smith N., Recall-Oriented Learning of Named Entities in Arabic Wikipedia, in Proceeding of 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, pp. 162-173, 2012.

[19] Nahar K., Al-Muhtaseb H., Al-Khatib W., Elshafei M., and Alghamdi M., Arabic Phonemes Transcription using Data Driven Approach, The International Arab Journal of Information Technology, vol. 12, no. 3, pp. 237-245, 2015.

[20] Nezda L., Hickl A., Lehmann J., and Fayyaz S., What in The World is a Shahab? Wide Coverage Named Entity Recognition for Arabic, in Proceeding of International Conference on Language Resources and Evaluation, Genoa, pp. 41-46, 2006.

[21] Oudah M. and Shaalan K., A Pipeline Arabic Named Entity Recognition using a Hybrid Approach, in Proceeding of International Conference on Computational Linguistics, Mumbai, pp. 2159-2176, 2012.

[22] Zaghouani W., RENAR: A rule-based Arabic named entity recognition system, ACM Transactions Asian Language Information Processing, vol. 11, no. 1, pp. 1-13, 2012. Emna Hkiri is a PhD student in Computer Sciences at the Faculty of Economics and Management of Sfax. She is a member of LaTiCe Laboratory. His main research interests are in natural language processing (Arabic language); text translation, ontologies, NER and machine learning. Souheyl mallat is a PhD student in the Faculty of Economic Sciences and Management of Sfax, Tunisia. He is member of LaTICE Laboratory, Monastir unity (Tunisia). His areas of interest include natural language processing, data mining and information retrieval. Mounir Zrigui I received my PhD from the Paul Sabatier University, Toulouse, France in 1987and my HDR from the Stendhal University, Grenoble, France in 2008. Since 1986, I am a Computer Sciences assistant Professor in Brest University, France, and after in Faculty of Science of Monastir, Tunisia. I have started my research, focused on all aspects of automatic processing of natural language (written and oral), in RIADI laboratory and after in LaTICE Laboratory. I have run many research projects and published many research papers in reputed international journals/conferences. Mourad Mars received his PhD from Grenoble Alpes University, France in 2011. He is member of LaTICE Laboratory, Monastir unity (Tunisia). His areas of interest include natural language processing, machine learning and features extracting.