The International Arab Journal of Information Technology (IAJIT)


Performance Evaluation of Keyword Extraction Techniques and Stop Word Lists on Speech-To-

The dawn of conversational user interfaces, through which humans communicate with computers through voice audio, has been reached. Therefore, Natural Language Processing (NLP) techniques are required to focus not only on text but also on audio speeches. Keyword Extraction is a technique to extract key phrases out of a document which can provide summaries of the document and be used in text classification. Existing keyword extraction techniques have commonly been used on only text/typed datasets. With the advent of text data from speech recognition engines which are less accurate than typed texts, the suitability of keyword extraction is questionable. This paper evaluates the suitability of conventional keyword extraction methods on a speech-to-text corpus. A new audio dataset for keyword extraction is collected using the World Wide Web (WWW) corpus. The performances of Rapid Automatic Keyword Extraction (RAKE) and TextRank are evaluated with different Stoplists on both the originally typed corpus and the corresponding Speech-To-Text (STT) corpus from the audio. Metrics of precision, recall, and F1 score was considered for the evaluation. From the obtained results, TextRank with the FOX Stoplist showed the highest performance on both the text and audio corpus, with F1 scores of 16.59% and 14.22%, respectively. Despite lagging behind text corpus, the recorded F1 score of the TextRank technique with audio corpus is significant enough for its adoption in audio conversation without much concern. However, the absence of punctuation during the STT affected the F1 score in all the techniques.

[1] Al-Jarrah A., Al-Jarrah M., and Albsharat A., “Dictionary Based Arabic Text Compression and Encryption Utilizing Two-Dimensional Random Binary Shuffling Operations,” The International Arab Journal of Information Technology, vol. 19, no. 6, pp. 861-872, 2022.

[2] Arts S., Hou J., and Gomez J., “Natural Language Processing to Identify the Creation and Impact of New Technologies in Patent Text: Code, Data, And New Measures,” Research Policy, vol. 50, no. 2, pp. 104144, 2021.

[3] Bird S., Klein E., and Loper E., Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit, O'Reilly Media Inc, 2009.

[4] Blei D., Ng A., and Jordan M., “Latent Dirichlet Allocation,” Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.

[5] Campos R., Mangaravite V., Pasquali A., Jorge A., Nunes C., and Jatowt A., “Yake! Keyword Extraction from Single Documents Using Multiple Local Features,” Information Sciences, vol. 509, pp. 257-289, 2020.

[6] Google Assistant, your own personal Google,, Last Visited, 2022.

[7] Guda B., Bello Kontagora N., Agajo J., and Aliyu I., STT Dataset, Yxg9mDaCu1iY0LcFHt/view, Last Visited, 2022.

[8] Këpuska V. and Bohouta G., “Comparing Speech Recognition Systems (Microsoft API, Google API and CMU Sphinx),” International Journal of Engineering Research and Applications, vol. 7, no. 03, pp. 20-24, 2017.

[9] Kim Y., Lee J., Choi S., Lee J., Kim J., Seok J., and Joo H., “Validation of Deep Learning Natural Language Processing Algorithm for Keyword Extraction from Pathology Reports in Electronic Health Records,” Scientific Reports, vol. 10, no. 1, pp. 1-9, 2020.

[10] Koizumi Y., Masumura R., Nishida K., Yasuda M., and Saito S., “A Transformer-Based Audio Captioning Model with Keyword Estimation,” arXiv preprint arXiv:2007.00222, 2020.

[11] Kumbhar A., Savargaonkar M., Nalwaya A., Bian C., and Abouelenien M., “Keyword Extraction Performance Analysis,” in Proceeding of Conference on Multimedia Information Processing and Retrieval, San Jose, pp. 550-553, 2019.

[12] Leung A., Evaluating Automatic Keyword Extraction for Internet Reviews, Lorraıne Realself INC, 2016.

[13] Mihalcea R. and Tarau P., “Textrank: Bringing Order Into Text,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona, pp. 404-411, 2004.

[14] Pay T., “Totally Automated Keyword Extraction,” in Proceedings of IEEE International Conference on Big Data, Washington, pp. 3859-3863, 20216.

[15] Pay T. and Lucci S., “Automatic Keyword Extraction: An Ensemble Method,” in Proceedings of IEEE International Conference on Big Data, Boston, 2017.

[16] Ram A., Prasad R., Khatri C., Venkatesh A., Gabriel R., Liu Q., and et al., “Conversational Ai: the Science Behind the Alexa Prize,” arXiv preprint arXiv:1801.03604, 2018.

[17] Rose S., Engel D., Cramer N., and Cowley W., Automatic Keyword Extraction from Individual Documents, Wiley Online Library, 2010.

[18] Siddiqi S. and Sharan A., “Keyword and Keyphrase Extraction Techniques: A Literature Review,” International Journal of Computer Applications, vol. 109, no. 2, pp. 18-23, 2015.

[19] Singhal A. and Sharma D., “Keyword Extraction using Renyi Entropy: A Statistical and Domain Independent Method,” in Proceedings of 7th International Conference on Advanced Computing and Communication Systems, Coimbatore, pp. 1970-1975, 2021.

[20] Siri-Apple,, Last Visited, 2022.

[21] Timonen M., Toivanen T., Kasari M., Teng Y., Cheng C., and He L., “Keyword Extraction from Short Documents using Three Levels of Word Evaluation,” in Proceedings of International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge Management, Barcelona, pp. 130-146, 2012.

[22] Wang X. and Ning H., “Chinese Keyword Extraction Method Based On Context And Word Classification,” in Proceedings of International Conference on Computer Information and Big Data Applications, Guiyang, pp. 344-347, 2020. 140 The International Arab Journal of Information Technology, Vol. 20, No. 1, January 2023

[23] Yao L., Pengzhou Z., and Chi Z., “Research on News Keyword Extraction Technology Based on TF-IDF and Textrank,” in Proceedings of IEEE/ACIS 18th International Conference on Computer and Information Science, Beijing, pp. 452-455, 2019. Blessed Guda received his bachelor's degree in computer engineering from the Federal University of Technology, Minna, 2021. He made several contributions to the ITU Focus Group (FG) on ML for 5G and Autonomous Networks (AN). He has research interests in AI for NLP, network security, 5G and autonomous networks and embedded systems. He is a mentor with WINEST Research group and founder of AI4Africa Research group. He received the Mentors Encouragement award from ITU AI/ML in 5G Challenge, 2021. He is currently an AI engineer at Prunny Technologies and also mentor student research projects with ITU FG-AN. Bello Kontagora Nuhu is a Lecturer in the Department of Computer Engineering at the Federal University of Technology Minna, Nigeria. He obtained his M. Tech in Computer Science and Engineering at the Ladoke Akintola University of Technology Ogbomoso, Nigeria. And had his B.Eng in Electrical & Computer Engineering from the Federal University of Technology Minna, Nigeria in the year 2010. He is currently a doctoral student at the Ahmadu Bello University, Zaria, Nigeria. His research interests include Artificial and Computational Intelligence, localization in sensor networks, Computer/network security, Internet of Things (IoT) and Software-Defined Networking. James Agajo received a Bachelor of Engineering (B.Eng) degree in electrical and computer engineering from the Federal University of Technology Minna, and a Master's of Engineering (M.Eng.) degree in electronics and telecommunication engineering from Nnamdi Azikiwe University, with a PhD in telecommunication and computer engineering from Nnamdi Azikiwe University. Dr James Agajo is presently an associate professor and the Head of Department, Computer Engineering with the Federal University of Technology Minna. Presently, he is a visiting professor at I.C.T University U.S.A, Nile University Abuja, Baze University Abuja, Kebbi State University of Science and Technology Kebbi, Nigeria, University of Pretoria, South Africa, Federal University of Petroleum Resources Effurun, Nigeria. Ibrahim Aliyu received his PhD in Computer Science and Engineering from Chonnam National University, South Korea, in 2022. He also holds a B.Eng and M.Eng Degrees in Computer Engineering at the Federal University of Technology, Minna, Nigeria, in 2014 and 2018, respectively. He is currently a Postdoc researcher at Hyper Intelligence Media Network Platform Lab, Department of ICT Convergence System Engineering, Chonnam National University, Gwangju, South Korea. His research focuses on source routing, in-network computing and cloud-based computing for massive metaverse deployment. His other research interests include Federated Learning, data privacy, Network Security and AI for autonomous networks.