GovdeTurk: A Novel Turkish Natural Language Processing Tool for Stemming, Morphological Labelling and Verb Negation

Author Sait Yucebas, Rabia Tintin,

Keywords #Natural language processing #stemming #morphological analysis #Turkish language

Abstract

GovdeTurk is a tool for stemming, morphological labeling and verb negation for Turkish language. We designed comprehensive finite automata to represent Turkish grammar rules. Based on these automata, GovdeTurk finds the stem of the word by removing the inflectional suffixes in a longest match strategy. Levenshtein Distance is used to correct spelling errors that may occur during suffix removal. Morphological labeling identifies the functionality of a given token. Nine different dictionaries are constructed for each specific word type. These dictionaries are used in the stemming and morphological labeling. Verb negation module is developed for lexicon based sentiment analysis. GovdeTurk is tested on a dataset of one million words. The results are compared with Zemberek and Turkish Snowball Algorithm. While the closest competitor, Zemberek, in the stemming step has an accuracy of 80%, GovdeTurk gives 97.3% of accuracy. Morphological labeling accuracy of GovdeTurk is 93.6%. With outperforming results, our model becomes foremost among its competitors.

References

[1] Akın A. and Akın M., “Zemberek, an Open Source Nlp Framework for Turkic Languages,” Structure, vol. 10, pp. 1-5, 2007.

[2] Cilden E., “Stemming Turkish Words Using Snowball,” Retrieved 08.05.2019 from: http://snowball.tartarus.org/algorithms/turkish/ste mmer.html, Last Visited, 2006.

[3] Dawson J., “Suffix Removal and Word Conflation,” ALLC Bulletin, vol. 2, no. 3, pp. 33- 46, 1974.

[4] Dinçer B. and Karaoğlan B., “Stemming in Agglutinative Languages: A Probabilistic Stemmer for Turkish,” in Proceedings of Computer and Information Sciences-ISCIS 2003, 18th International Symposium, Antalya, pp. 244- 251, 2003.

[5] Sever H. and Duran G., “Turkce Govdeleme Algoritmalarinin Analizi,” in Proceedings of Annual Conference of TBD'96, İstanbul, pp. 23- 243, 1996.

[6] Eryigit G. and Adali E., “An Affix Stripping Morphological Analyzer for Turkish,” in Proceedings of Artificial Intelligence and Appplications, Innsbruck, pp. 299-304, 2004.

[7] Freund G. and Willett P., “Online Identification Of Word Variants and Arbitrary Truncation Searching Using A String Similarity Measure,” Information Technology: Research and Development, vol. 1, no. 3, pp. 177-187, 1982. GovdeTurk: A Novel Turkish Natural Language Processing Tool for Stemming, ... 157

[8] Hakkani-Tür D., Saraçlar M., Tür G., Oflazer K., and Yuret D., Turkish Natural Language Processing, Springer, 2018.

[9] Hull D. and Grefenstette G., “A Detailed Analysis of English Stemming Algorithms,” Xerox Research and Technology, vol. 6, pp. 1-16, 2016.

[10] Jivani A., “A Comparative Study of Stemming Algorithms,” International Journal of Computer Applications in Technology, vol. 2, no. 6, pp. 1930-1938, 2011.

[11] Jurafsky D. and Martin J., Speech and Language Processing, Prentice Hall PTR, 2018.

[12] Kısla T. and Karaoglan B., “A Hybrid Statistical Approach to Stemming in Turkish: an Agglutinative Language,” Anadolu University Journal of Science and Technology-A Applied Sciences and Engineering, vol. 17, no. 2, pp. 401 -412, 2016.

[13] Khan S., Anwar W., Bajwa W., and Wang X., “Template Based Affix Stemmer for A Morphologically Rich Language,” The International Arab Journal of Information Technology, vol. 12, no. 2, pp. 146-154, 2015.

[14] Koksal A., “Tümüyle Özdevimli Deneysel Bir Belge Dizinleme Ve Erisim Dizgesi: TÜRDER,” in Proceedings of Bilisim 80’ Bildiriler, Ankara, pp. 37-44, 1981.

[15] Krovetz R., “Viewing Morphology As An Inference Process,” in Proceedings of 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, NewYork, pp. 191-202, 1993.

[16] Levenshtein V., “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals,” In Soviet Physics Doklady, vol. 10, no. 8, pp. 707-710, 1996.

[17] Lovins J., “Development of A Stemming Algorithm,” Mechanical Translation and Computational Linguistics, vol. 11 no. 1-2, pp. 22-31, 1986.

[18] Majumder P., Parui S., Mitra M., Kole G., Mitra P., and Datta K., “YASS: Yet Another Suffix Stripper,” ACM Transactions on Information Systems, vol. 25, no. 4, pp. 18, 2007.

[19] Melucci M. and Orio N., “A Novel Method for Stemmer Generation Based on Hidden Markov Models,” in Proceedings of the 12th International Conference on Information and Knowledge Management, New Orleans, pp. 131-138, 2003.

[20] Nabiyev V., Yapay Zeka Insan Bilgisayar Etkileşimi, Seçkin Yayıncılık, 2010.

[21] Oflazer K. and Saraçlar M., Turkish Natural Language Processing, Springer Link, 2018.

[22] Paice C., “Another Stemmer,” ACM Special Interest Group on Information Retrieval Forum, vol. 24, no. 3, pp. 56-61, 1990.

[23] Porter M., “An Algorithm for Suffix Stripping,” Program: Electronic Library and Information Systems, vol. 14, no. 3, pp. 130-137, 1980.

[24] Solak A. and Can F., “Effects of Stemming on Turkish Text Retrieval,” in Proceedings of the Computer and Information Sciences ISCIS’94, Antalya, pp. 49-56, 1994.

[25] Xu J. and Croft W., “Corpus-Based Stemming Using Co-Occurrence of Word Variants,” ACM Transactions on Information Systems, vol. 16 no. 1, pp. 61- 81, 1998.

[26] Yucebas S. and Tintin R., “GovdeTurk: A Turkish Stemming Method,” in Proceedings of International Conference on Computer Science and Engineering UBMK-17, Antalya, pp. 343- 347, 2017.