Exploring the Potential of Schemes in Building NLP

Author LaTICE Laboratory, Faculty of Sciences of Monastir, Tunisia,

Keywords #

Abstract Arabic is known for its sparseness, which explains the difficulty of its automatic processing. The Arabic language is based on schemes; lemmas are produced using derivat ion based on roots and schemes. This latter character presents two major advantages: First, this “hidden side” of the Arabic language composed of schemes suffers much less from sparseness since it represents a finite set, second, schemes k eep a large number of features of the language in a much reduced vocabulary size. Schemes present a very great perspective and have great potential in building accurate natural language processing tools for Arabic. In this work we tried to explore this p otential by building some NLP tools while relying e ntirely on schemes. The work is related to text classification and a Probab ilistic Context Free Grammar (PCFG) parsing.

References

[1] Ayadi R., Maraoui M., and Zrigui M., Intertextual Distance for Arabic Texts Classification, in Proceedings of International Conference for Internet Technology and Secured Transactions , London, UK, pp. 1 6, 2009.

[2] Badii E., m{]Z k yq@R (Glossary of Schemes). >n OEk Z }k mAL Uk DEQk ykLA (World of books for printing, publishing and distribution), 1993.

[3] Ben Mohamed M., Ghoul D., Nahdi M., Mars M., and Zrigui M., Arabic CALL System based on Pedagogically Indexed Text, in Proceedings of International Conference on Artificial Intelligence , Florida, USA, pp. 568 574, 2011.

[4] Ben Mohamed M., Zrigui M., and Maraoui M., Clustering Based Approach Extracting Collocations, available at: http://www.slideshare .net/mohamed achraf ben mohamed/clustering based approach extracting collocations, last visited 2013.

[5] Chen S. and Goodman J., An Empirical Study of Smoothing Techniques for Language Modeling, in Proceedings of the 34 th Annual Meeting on Association for Computational Linguistics , California, USA, pp. 310 318. 1996.

[6] Jakulin A., Machine Learning based on Attribute Interactions, PhD Thesis, University of Ljubljana, 2005.

[7] Jurafsky D. and Martin H., Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition and Computational Linguistics , Prentice Hall, 2009.

[8] Khan B., Alghathbar K., Khan K., Alkelabi A., and Alajaji A., Cyber Security using Arabic CAPTCHA Scheme, the International Arab Journal of Information Technology , vol. 10, no. 1, pp. 76 84, 2013.

[9] Manning D. and Schutze H., Foundations of Statistical Natural Language Processing , MIT Press, 1999.

[10] Maraoui M., Antoniadis G., and Zrigui M., CALL System for Arabic Based on Natural Language Processing Tools, in Proceedings of the 4 th Indian International Conference on Artificial Intelligence , Tumkur, India, pp. 2249 2258, 2009.

[11] Maraoui M., Elaboration D un Dictionnaire Multifonction, a Large Couverture, De La Langue Arabe. Applications Aux Syst mes D alao, PhD Thesis , Stendhal University, 2009.

[12] Mars M. Analyse Morphologique Robuste De L arabe et Applications P dagogiques, PhD Thesis , Stendhal University, 2012.

[13] Marton Y., Habash N., and Rambow O., Dependency Parsing of Modern Standard Arabic with Lexical and Inflectional Features, Computational Linguistics , vol. 39, no. 1, pp. 161 194, 2013

[14] Meftouh K., Smaili K., and Laskri T., Arabic Statistical Language Modeling, in Proceedings of the 9 th International Conference on the Statistical Analysis of Textual Data , Lyon, France, pp. 837 838, 2008.

[15] Merhbene L., Zouaghi A., and Zrigui M., Ambiguous Arabic Words Disambiguation, in Proceedings of the 11th International Conference on Software Engineering Artificial Intelligence Networking and Parallel/Distributed Computing , London, UK, pp. 157 164, 2010.

[16] Merhbene L., Zouaghi A., and Zrigui M., Ambiguous Arabic Word Sense Disambiguation: the Results, available at: http://www. aclweb.org/anthology/R09 2009, last visited 2013.

[17] Motaz S. and Ashour W., OSAC Open Source Arabic Corpora, in Proceedings of the 6 th International Conference on Electrical and Computer Systems , Lefke, North Cyprus, pp. 118 123, 2010.

[18] Saidane T., Zrigui M., and Ben Ahmed M., Arabic Speech Synthesis using a Concatenation of Polyphones: The Results, in Proceedings of the 18 th Conference of the Canadian Society for Computational Studies of Intelligence , Victoria, Canada, pp. 406 411, 2005.

[19] Shaalan K., Rule based Approach in Arabic Natural Language Processing, the International Journal on Information and Communication Technologies , vol. 3, no. 3, pp. 11 19, 2010. Exploring the Potential of Schemes in Building NLP Tools for Arabic Language 573

[20] Sikkel K. and Nijholt A., Parsing of ContextE Free Languages , Springer Berlin Heidelberg, 1997.

[21] Zouaghi A., Merhbene L., and Zrigui M., Combination of Information Retrieval Methods with LESK Algorithm for Arabic Word Sense Disambiguation, Artificial Intelligence Review , vol. 38, no. 4, pp. 257 269, 2012.

[22] Zouaghi A., Zrigui M., and Antoniadis G., Automatic Understanding of Spontaneous Arabic Speech A Numerical Model, TAL , vol. 49, no. 1, pp. 141 166, 2008.

[23] Zrigui M., Ayadi R., Mars M., and Maraoui M., Arabic Text Classification Framework based on Latent Dirichlet Allocation, Journal of Computing and Information Technology , vol. 20, no. 2, pp. 125 140, 2012. Mohamed Achraf Ben Mohamed is a PhD student in the Faculty of Economic Sciences and Management of Sfax, Tunisia. He is member of LaTICE Laboratory, Monastir unity (Tunisia). His areas of interest include natural language processing, computer assisted language learning and machine learning. Souheyl Mallat received his BCs degree in computer science from the Higher Institute of Applied Science and Technology of Sousse, Tunisia and his MSc degree from the Faculty of Sciences of Monastir, Tunisia. He is member of LaTICE Laboratory, Monastir unity (Tunisia). His areas of interest inc lude natural language processing, data mining and information retrieval. Mohamed Amine Nahdi received his BA degree in computer science at the Faculty of Sciences of Monastir, Tunisia and MA at the Grenoble Institute of Technology, France. He is a member of LATICE laboratory in Tunisia and LIDILEM laboratory in Grenoble France. Mounir Zrigui is an associate professor at the University of Monastir, Tunisia. He received his PhD degree from the Paul Sabatier University, Toulouse, France in 1987 and his HDR in computer science from the Stendhal University, Grenoble, France in 2008. He has more than 25 years of experience including teaching and research in all aspects of automatic processing of natural language (written and oral).