The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


New Language Models for Spelling Correction

Correcting spelling errors based on the context is a fairly significant problem in Natural Language Processing (NLP) applications. The majority of the work carried out to introduce the context into the process of spelling correction uses the n-gram language models. However, these models fail in several cases to give adequate probabilities for the suggested solutions of a misspelled word in a given context. To resolve this issue, we propose two new language models inspired by stochastic language models combined with edit distance. A first phase consists in finding the words of the lexicon orthographically close to the erroneous word and a second phase consists in ranking and limiting these suggestions. We have applied the new approach to Arabic language taking into account its specificity of having strong contextual connections between distant words in a sentence. To evaluate our approach, we have developed textual data processing applications, namely the extraction of distant transition dictionaries. The correction accuracy obtained exceeds 98% for the first 10 suggestions. Our approach has the advantage of simplifying the parameters to be estimated with a higher correction accuracy compared to n-gram language models. Hence the need to use such an approach.


[1] Aho A. and Corasick M., “Efficient String Matching: An Aid to Bibliographic Search,” Communications of the ACM, vol. 18, no. 6, pp. 333-340, 1975.

[2] Brucq D. and El Youbi A., “Représentation De Chaînes De Caractères Par Des Chaînes Induites De Markov,” Actes RFIA’96, pp. 651-658, 1996.

[3] Damerau F., “A Technique for Computer Detection and Correction of Spelling Errors,” Communications of the ACM, vol. 7, no. 3, pp. 171-176, 1964.

[4] Farra N., Tomeh N., Rozovskaya A., and Habash N., “Generalized Character-Level Spelling Error Correction,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Maryland, pp. 161-167, 2014.

[5] Fossati D. and Eugenio B., “A Mixed Trigrams Approach for Context Sensitive Spell Checking,” in Proceedings of 8th International Conference on Intelligent Text Processing and Computational Linguistics, Mexico, pp. 623-633, 2007.

[6] Golding A. and Roth D., “A Winnow-Based Approach to Context-Sensitive Spelling Correction,” Machine Learning, vol. 34, pp. 107- 130, 1999.

[7] Gueddah H., Yousfi A., and Belkasmi M., “Introduction of the Weight Edition Errors in the Levenshtein Distance,” International Journal of Advanced Research in Artificial Intelligence, vol. 1, no. 5, pp. 30-32, 2012.

[8] Gueddah H., Yousfi A., and Belkasmi M., “The New Language Models for Spelling Correction 947 Filtered Combination of The Weighted Edit Distance and The Jaro-Winkler Distance to Improve Spellchecking Arabic Texts,” in Proceedings of 12th ACS/IEEE International Conference on Computer Systems and Applications, Marrakech, pp. 16, 2015.

[9] Hadni M., El Alaoui O., and Lachkar A., “Word Sense Disambiguation for Arabic Text Categorization,” The International Arab Journal of Information Technology, vol. 13, no. 1A, pp. 215-222, 2016.

[10] Hamza B., Abdellah Y., Hicham G., and Mostafa B., “For an Independent Spell-Checking System from the Arabic Language Vocabulary,” International Journal of Advanced Computer Science and Applications, vol. 5, no. 1, pp. 113- 116, 2014.

[11] Jelinek F., “Continuous Speech Recognition By Statistical Methods,” Proceedings of the IEEE, vol. 64, no. 4, pp. 532-556, 1976.

[12] Jones M. and Martin J., “Contextual Spelling Correction Using Latent Semantic Analysis,” in Proceedings of 5th Conference on Applied Natural Language Processing, Washington, pp. 166-173, 1997.

[13] Levenshtein V., “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Soviet Physics Doklady, vol. 10, no. 8, pp. 707- 710, 1966.

[14] Nejja M. and Yousfi A., “A Lightweight System for Correction of Arabic Derived Words,” in Proceedings of Mediterranean Conference on Information and Communication Technologies, Saidia, pp. 131-138, 2015.

[15] Nejja M. and Yousfi A., “The Context in Automatic Spell Correction,” in Procedia Computer Science, vol. 73, pp. 109-114, 2015.

[16] Nejja M. and Yousfi A, “Correction of The Arabic Derived Words Using Surface Patterns,” in Proceedings of 5th Workshop on Codes, Cryptography and Communication Systems, El jadida, pp. 153-156, 2014.

[17] Ringlstetter C., Schulz K., Mihov S., and Louka K., “The Same is Not The Same-Postcorrection of Alphabet Confusion Errors in Mixed-Alphabet OCR Recognition,” in Proceedings of 8th International Conference on Document Analysis and Recognition, South Korea, pp. 406-410, 2005.

[18] Salton G. and McGill M., Introduction to Modern Information Retrieval, McGraw-Hill Inc, 1986.

[19] Sharma S. and Gupta S., “A Correction Model for Real-Word Errors,” in Procedia Computer Science, vol. 70, pp. 99-106, 2015.

[20] Zamora E., Pollock J., and Zamora A., “The Use of Trigram Analysis for Spelling Error Detection,” Information Processing and Management, vol. 17, no. 6, pp. 305-316, 1981. 948 The International Arab Journal of Information Technology, Vol. 19, No. 6, November 2022 Saida Laaroussi is currently PhD student in the ES-Lab, at Ibn Tofail University in Kenitra, Morocco. She received her engineering degree in Computer Science from the ENSIAS at Mohamed V University in Rabat, Morocco, in 2010. Her main research interests include Machine Learning and Natural Language Processing. Si Lhoussain Aouragh is permanent qualified professor in the ENSIAS at Mohamed V University in Rabat. He is president of the Association of Arabic Language Engineering in Morocco, and member of several scientific research associations in Morocco. Member of several research teams and laboratories. His main research interests include Computational Linguistics, Artificial Intelligence, Machine Learning, Natural Language Processing. Abdellah Yousfi is Professor at the Faculty of Law, Economics and Social Sciences of Souissi at Mohamed V University in Rabat. He is member of the ICES Team in the ENSIAS, at Mohamed V University in Rabat, Morocco. His research interests include creation of corpora for the Arabic language, Arabic speech recognition, Arabic handwriting recognition and correction of Arabic spelling errors. He is reviewer of several journal such as Journal of King Saud University, Computer and Information Sciences, Egyptian Informatics Journal. Mohamed Nejja received his PhD in Computer Science and Engineering from the ENSIAS at Mohamed V University in Rabat, Morocco, in 2019. His areas of research interests include Natural Language Processing, Machine Learning, Artificial Intelligence. Hicham Geddah is currently Associate Professor of Computer Science in the Department of Computer Science, ENS, at Mohammed V University in Rabat. He holds a doctorate in Computer Science from the ENSIAS at Mohamed V University in Rabat. The scope of his research covers: Natural Language Processing, Data Mining, Machine Learning and Deep Learning. Said Ouatik El Alaoui is working as Professor of Computer Science in the ENSA, Kenitra where he is currently the head of the ES-Lab at Ibn Tofail University, Morocco. His research interests include Machine and Deep Learning and their applications, Natural Language Processing, Information Retrieval, Text summarization, Biomedical Question Answering, Biomedical Information Extraction, and Arabic Document Clustering and Categorization, High-dimensional indexing and Content-Based Image Retrieval.