The International Arab Journal of Information Technology (IAJIT)


Conditional Arabic Light Stemmer: CondLight Yaser Al-Lahham, Khawlah Matarneh, and Mohammad Hassan

Arabic language has a complex morphological structure, which makes it hard to select index terms for an IR system. The complexity of the Arabic morphology caused by multimode terms, using diacritics, letters have different forms according to its location in the word and affixes can be added at all locations in a word. Several methods were proposed to overcome these problems; such as root extraction and light stemming. Light stemming show better retrieval efficiency, Light10 is the best stemmer among a series of light stemmers, it simply removes suffixes and prefixes if it is listed in a predefined table. Light10 has no restrictions on the affixes, so it is possible to have two different terms having the same token while they have different meanings. This paper proposes CondLight stemmer which adds new prefixes and suffixes to the table of Light10, and imposes a set of conditions on removing these affixes. The implementation and testing of the proposed method show that CondLight gains 38% precision, while Light10 stemmer gains average precision of 36.7%. Moreover CondLight show better average precision either when imposing all conditions or part of them.

[1] Ababneh M., Al-Shalabi R., Kanaan G., and Al- Nobani A., Building An Effective Rule-Based Light Stemmer For Arabic Language To Improve Search Effectiveness, The International Arab Journal Of Information Technology, vol. 9, no. 4, pp. 368-372, 2012.

[2] Alhanini Y. and Aziz M., The Enhancement Of Arabic Stemming By Using Light Stemming and Dictionary-Based Stemming, Journal of Software Engineering and Applications, vol. 4, no. 9, pp. 522-526, 2011.

[3] Al-Hamlawi A., , Dar El Fekr El Araby, 1999.

[4] Aljlayl M. and Frieder O., On Arabic Search: Improving The Retrieval Effectiveness Via A Light Stemming Approach, in Proceedings of the 11th International Conference on Information and Knowledge Management, Virginia, pp. 340- 347, 2002.

[5] Al-Kabi M., Towards Improving Khoja Rule- Based Arabic Stemmer, in Proceedings of Applied Electrical Engineering and Computing Technologies, Amman, pp. 1-6, 2013.

[6] Al-Kabi M., Kazakzeh S., Abu Ata B., Al- Rababah S., and Alsmadi I., A Novel Root Based Arabic Stemmer, Journal Of King Saud University-Computer and Information Sciences, vol. 27, no. 2, pp. 94-103, 2005.

[7] Boudchiche M., Mazroui A., Bebah M., Lakhouaja A., and Boudlal A., AlkhalilMorpho Sys: A Robust Arabic Morpho-Syntactic Analyzer, Journal of King Saud University- Computer and Information Sciences, vol. 29, no. 2, pp. 141-146, 2017.

[8] Boudlal A., Belahbib R., Lakhouaja A., Mazroui A., Meziane A., and Bebah M., A Markovian Approach For Arabic Root Extraction, The International Arab Journal Of Information Technology, vol. 8, no. 1, pp. 91-98, 2011.

[9] Chen A. and Gey F., Building an Arabic Stemmer for Information Retrieval, in Proceedings of Text Retrieval Conference, pp. 631-639, 2002.

[10] Darwish K. and Ali A., Arabic Retrieval Revisited: Morphological Hole Filling, in Proceedings of the 50th Annual Meeting of the 564 The International Arab Journal of Information Technology, Vol. 15, No. 3A, Special Issue 2018 Association for Computational Linguistics: Short Papers-Volume 2, Jeju Island, pp. 218-222, 2012.

[11] El-Beltagy S. and Rafea A., An Accuracy Enhanced Light Stemmer for Arabic Text, ACM Transactions on Speech and Language Processing, vol. 7, no. 2, 2011.

[12] Gey F. and Oard D., The TREC-2001 Cross- Language Information Retrieval Track: Searching Arabic Using English, French Or Arabic Queries, in Proceedings of The 10th Text Retrieval Conference, pp. 16-23, 2001.

[13] Hadni M., Lachkar A., and Ouati k., A New and Efficient Stemming Technique for Arabic Text Categorization, in proceedings of International Conference on Multimedia Computing and Systems, Tangier, pp. 791-796, 2012.

[14] Jaafar Y., Namely D., Bouzoubaa K., and Yousfi A., Enhancing Arabic Stemming Process Using Resources and Benchmarking Tools, Journal of King Saud University- Computer and Information Sciences, vol. 29, no. 2, pp. 164-170, 2017.

[15] Khedr S., Sayed D., and Hanafy A., Arabic Light Stemmer for Better Search Accuracy, International Journal of Cognitive and Language Sciences, vol. 10, no. 11, pp. 3587-3595, 2016.

[16] Kadri Y. and Nie j., Effective Stemming for Arabic information Retrieval, in proceedings of the Challenge of Arabic for NLP/MT Conference, Royaume-Uni, 2006.

[17] Khoja S., Garside R., and Knowles G., A Tag Set For The Morphosyntactic Tagging Of Arabic, in Proceedings of the Corpus Linguistics Conference, vol. 13, Special Issue, pp. 341-354, 2001.

[18] Kim J. and Taylor J., Fast String Matching Using An N- Gram Algorithm, Journal Of Software: Practice And Experience, vol. 24, no. 1, pp. 79-88, 1994.

[19] Larkey L., Ballesteros L., and Connell M., Improving Stemming For Arabic Information Retrieval: Light Stemming And Co-Occurrence Analysis, in Proceedings of The 25th Annual International ACM SIGIR Conference On Research and Development in Information Retrieval, Finland, pp. 275-282, 2002.

[20] Larkey L., Ballesteros L., and Connell M., Light Stemming For Arabic Information Retrieval, Arabic Computational Morphology, Springer, 2007.

[21] Mustafa S., Combining N-Grams And Stemming For Arabic Word-Based Inexact Matching And Term Conflation, Journal of Information and Knowledge Management, vol. 4, no. 1, pp. 29-36, 2005.

[22] Nehar A., Ziadi D., and Cherroun H., Rational Kernels for Arabic Root Extraction And Text Classification, Journal Of King Saud University- Computer And Information Sciences, vol. 28, no. 2, pp. 157-169, 2016.

[23] Porter M., An Algorithm for Suffix Stripping, Program Journal, vol.14, no. 3, pp. 130-137, 1980.

[24] Soudi A., Neumann G., and Van-Den-Bosch A., Arabic Computational Morphology, Springer, 2007.

[25] Taghva K., Elkhoury R., and Coombs J., Arabic Stemming Without A Root Dictionary, in Proceedings of International Conference on Information Technology: Coding and Computing, Las Vegas, pp. 152-157, 2005. Yaser Al-Lahham has received the B.S degree from University of Jordan in 1985, the M.S. degree from Arab Academy (Jordan) in 2004, and the PhD in Computer science from Bradford University (UK) in 2009. He is working as an assistant professor in the Department of Computer Science at Zarqa University in Jordan. His research interest includes P2P information retrieval systems, text clustering, and Databases. Mohammad Hassan has received his BS degree from Yarmouk University in Jordan in 1987, the MS degree from University of Jordan, in 1996, and the PhD degree in computer information systems from Bradford University, UK in 2003. He is working as an associate professor in the department of computer science at Zarqa University in Jordan. His research interest includes information retrieval systems and database systems. Khawla Al Matarneh has received her BS degree in computer science from Mua ta University in Jordan in 2004, the MS degree in computer science from Zarqa University, in 2017. Her research interest includes information retrieval systems and database systems.