Downloads 203

..............................

..............................

Cited by

..............................

Received date April 01, 2024

Accepted date December 04, 2024

Exceeding Manual Labeling: VADER Lexicon as an Accurate Alternative to Automatic Sentiment Classification

Author Vivine Nurcahyawati, Zuriani Mustaffa, Mohammed Khalaf,

Keywords #Lexicon-based #classification #customer #review #text analysis

Abstract

The number of internet users worldwide has increased dramatically, resulting in a surge of content uploaded over the Internet, particularly in text form. Global Internet users now exceed 5,16 billion, constituting a penetration rate of 64.4 percent of the world’s total population. While only a small fraction of individuals actively expresses their opinions online, sentiment analysis aims to categorize textual information into favorable, negative, or neutral states of mind. When dealing with unlabeled datasets, the Valence Aware Dictionary and sEntiment Reasoner (VADER) Lexicon proves to be an effective tool for extracting feature sentiment. This facilitates the direct application of machine learning techniques such as Support Vector Machine (SVM), Naive Bayes (NB), and K-Nearest Neighbor (KNN) to classify datasets. Fuzzy Matching (FM) serves as a dimensionality reduction technique. Experimental results utilizing three datasets from diverse sources reveal that the combination of FM and SVM yields the highest accuracy. Model validation through K-Fold cross-validation reveals notable accuracy rates across multiple datasets. For dataset A, the accuracy stands at 94.69% with manual labeling and improves slightly to 95.92 % with VADER labeling. Similarly, for dataset B, the accuracy shows a marginal increase from 96.94% manual labeling to 97.01% VADER labeling. Dataset C also displays an enhancement in accuracy, with manual labeling achieving 95.51% accuracy and VADER labeling demonstrating a higher accuracy of 96.73%. These results underscore the effectiveness of both manual and automated labeling techniques in enhancing model performance across diverse datasets.

References

[1] Agatha R. and Polina A., “Analisis Sentimen Terhadap Penggunaan Marketplace di Indonesia Menggunakan Metode Support Vector Machine dengan Seleksi Fitur Chi Square,” Seminar Nasional Riset and Inovasi Teknologi, vol. 1, no. 1, pp. 314-323, 2022. https://e- proceeding.itp.ac.id/index.php/sinarint/article/vie w/63/37

[2] Anggraeni W., Roji F., and Alkautsar M., “Analisis Sentimen Publik Terhadap Kebijakan Insentif Perpajakan Dengan Pendekatan VADER (Valence Aware Dictionary and Sentiment Reasoner),” Jurnal Proaksi, vol. 10, no. 4, pp. 465-477, 2023. DOI: 10.32534/jpk.v10i4.4732

[3] Araslanov E., Komotskiy E., and Agbozo E., “Assessing the Impact of Text Preprocessing in Sentiment Analysis of Short Social Network Messages in the Russian Language,” in Proceedings of the International Conference on Data Analytics for Business and Industry: Way towards a Sustainable Economy, Sakheer, pp. 1-4, 2020. DOI:10.1109/ICDABI51230.2020.9325654

[4] Arief M. and Deris M., “Text Preprocessing Impact for Sentiment Classification in Product Review,” in Proceedings of the 6th International Conference on Informatics and Computing, Jakarta, pp. 1-7, 2021. DOI:10.1109/ICIC54025.2021.9632884

[5] Arya V., Mishra A., and Gonzalez-Briones A., “Sentiments Analysis of Covid-19 Vaccine Tweets Using Machine Learning and VADER Lexicon Method,” Advances in Distributed Computing and Artificial Intelligence Journal, vol. 11, no. 4, pp. 507-518, 2023. DOI: 10.14201/adcaij.27349

[6] Asri Y. and Fajri M., “Sentiment Analysis of PLN Mobile Review Data Using Lexicon VADER and Naive Bayes Classification,” in Proceedings of the International Conference on Networking, Electrical Engineering, Computer Science, and Technology, Bandar Lampung, pp. 132-137, 2023. DOI:10.1109/IConNECT56593.2023.10327064

[7] Barushka A. and Hajek P., “The Effect of Text Preprocessing Strategies on Detecting Fake Consumer Reviews,” in Proceedings of the 3rd ACM International Conference Proceeding Series, Association for Computing Machinery, pp. 13-17, 2019. DOI:10.1145/3383902.3383908

[8] Bashar M., “A Hybrid Approach to Explore Public Sentiments on COVID-19,” SN Computer Science, vol. 3, no. 3, pp. 1-19, 2022. DOI:10.1007/s42979-022-01112-1

[9] Bernhardt M., Castro D., Tanno R., Schwaighofer A., Tezcan K., Monteiro M., Bannur S., Lungren M., Nori A., Glocker B., Alvarez-Valle J., and Oktay O., “Active Label Cleaning for Improved Dataset Quality under Resource Constraints,” Nature Communications, vol. 13, no. 1, pp. 1-11, 2022. DOI:10.1038/s41467-022-28818-3

[10] Borg A. and Boldt M., “Using VADER Sentiment and SVM for Predicting Customer Response Sentiment,” Expert Systems with Applications, vol. 162, pp. 113746, 2020. DOI:10.1016/j.eswa.2020.113746

[11] Chouhan K., “Sentiment Analysis with Tweets Behaviour in Twitter Streaming API,” Computer Systems Science and Engineering, vol. 45, no. 2, pp. 1113-1128, 2023. DOI:10.32604/csse.2023.030842

[12] De Oliveira D. and De Campos Merschmann L., “Joint Evaluation of Preprocessing Tasks with Classifiers for Sentiment Analysis in Brazilian Portuguese Language,” Multimedia Tools and Applications, vol. 80, pp. 15391-15412, 2021. DOI: 10.1007/s11042-020-10323-8

[13] Ding Y., You J., Machulla T., Jacobs J., Sen P., and Hollerer T., “Impact of Annotator Demographics on Sentiment Dataset Labeling,” Proceedings of the ACM on Human-Computer Interaction, vol. 6, no. CSCW2, pp. 1-22, 2022. DOI:10.1145/3555632

[14] Drus Z. and Khalid H., “Sentiment Analysis in Social Media and its Application: Systematic Literature Review,” Procedia Computer Science, vol. 161, pp. 707-714, 2019. https://doi.org/10.1016/j.procs.2019.11.174

[15] Duong H. and Nguyen-Thi T., “A Review: Preprocessing Techniques and Data Augmentation for Sentiment Analysis,” Computational Social Networks, vol. 8, no. 1, pp. 1-16, 2021. DOI:10.1186/s40649-020-00080-x

[16] Ernawati S. and Wati R., “Evaluasi Performa Kernel SVM dalam Analisis Sentimen Review Aplikasi ChatGPT Menggunakan Hyperparameter dan VADER Lexicon,” Jurnal Buana Informatika, vol. 15, no. 01, pp. 40-49, 2024. DOI:10.24002/jbi.v15i1.7925

[17] Es-Sabery F., Es-Sabery I., Hair A., Sainz-De- Abajo B., and Garcia-Zapirain B., “Emotion Processing by Applying a Fuzzy-Based VADER Lexicon and a Parallel Deep Belief Network Over Massive Data,” IEEE Access, vol. 10, pp. 87870- 87899, 2022. https://ieeexplore.ieee.org/document/9863839

[18] Fathoni M., Puspaningrum E., and Sihananto A., Exceeding Manual Labeling: VADER Lexicon as an Accurate Alternative to Automatic ... 233 “Perbandingan Performa Labeling Lexicon InSet dan VADER pada Analisa Sentimen Rohingya di Aplikasi X dengan SVM,” Modem: Jurnal Informatika dan Sains Teknologi, vol. 1, no. 3, pp. 62-76, 2024. https://doi.org/10.62951/modem.v1i3.112

[19] Firmansyah F., Zulfikar W., Maylawati D., Arianti N., Muliawaty L., Septiadi M., and Ramdhani M., “Comparing Sentiment Analysis of Indonesian Presidential Election 2019 with Support Vector Machine and K-Nearest Neighbor Algorithm,” in Proceedings of the 6th International Conference on Computing Engineering and Design, Sukabumi, pp. 1-6, 2020. DOI:10.1109/ICCED51276.2020.9415767

[20] Gao Y., Zhang H., Li S., Shi C., and Gao H., “Short Circuit Fault Location Method of Distribution Network Based on Fuzzy Matching,” in Proceedings of the IEEE 6th Conference on Energy Internet and Energy System Integration (EI2), Chengdu, pp. 1499-1505, 2022. DOI:10.1109/EI256261.2022.10116989

[21] Hamka M. and Tukiran., “Analisis Sentimen Pengguna E-Commerce dan Marketplace Menggunakan Support Vector Machine,” Jurnal Rekayasa Sistem Informasi dan Teknologi, vol. 1, no. 4, pp. 273-282, 2024. https://doi.org/10.59407/jrsit.v1i4.555

[22] Hong Y. and Shao X., “Emotional Analysis of Clothing Product Reviews Based on Machine Learning,” in Proceedings of the 3rd International Conference on Applied Machine Learning, Changsha, pp. 398-401, 2021. DOI:10.1109/ICAML54311.2021.00090

[23] Hossen M. and Dev N., “An Improved Lexicon Based Model for Efficient Sentiment Analysis on Movie Review Data,” Wireless Personal Communications, vol. 120, no. 1, pp. 535-544, 2021. DOI:10.1007/s11277-021-08474-4

[24] Humayun M., Javed D., Jhanjhi N., Almufareh M., and Almuayqil S., “Deep Learning Based Sentiment Analysis of COVID-19 Tweets via Resampling and Label Analysis,” Computer Systems Science and Engineering, vol. 47, no. 1, pp. 575-591, 2023. DOI:10.32604/csse.2023.038765

[25] Khader M., Awajan A., and Al-Naymat G., “The Impact of Natural Language Preprocessing on Big Data Sentiment Analysis,” The International Arab Journal of Information Technology, vol. 16, no. 3A, pp. 506-513, 2019. https://iajit.org/portal/PDF/Special%20Issue%20 2019,%20No.%203A/18596.pdf

[26] Lee E., Rustam F., Shahzad H., Washington P., Ishaq A., and Ashraf I., “Drug Usage Safety from Drug Reviews with Hybrid Machine Learning Approach,” Computer Systems Science and Engineering, vol. 45, no. 3, pp. 3053-3077, 2023. DOI:10.32604/csse.2023.029059

[27] Li M. and Shi Y., “Sentiment Analysis and Prediction Model Based on Chinese Government Affairs Microblogs,” Heliyon, vol. 9, no. 8, pp. 1- 16, 2023. DOI:10.1016/j.heliyon.2023.e19091

[28] Mahilraj J., Tigistu G., and Tumsa S., “Text Preprocessing Method on Twitter Sentiment Analysis Using Machine Learning,” International Journal of Innovative Technology and Exploring Engineering, vol. 9, no. 11, pp. 233-240, 2020. DOI:10.35940/ijitee.K7771.0991120

[29] Mardjo A. and Choksuchat C., “HyVADRF: Hybrid VADER-Random Forest and GWO for Bitcoin Tweet Sentiment Analysis,” IEEE Access, vol. 10, pp. 101889-101897, 2022. DOI:10.1109/ACCESS.2022.3209662

[30] Maree M., Eleyat M., and Mesqali E., “Optimizing Machine Learning-based Sentiment Analysis Accuracy in Bilingual Sentences via Preprocessing Techniques,” The International Arab Journal of Information Technology, vol. 21, no. 2, pp. 257-270, 2024. https://doi.org/10.34028/iajit/21/2/8

[31] Muhammadi R., Laksana T., and Arifa A., “Combination of Support Vector Machine and Lexicon-based Algorithm in Twitter Sentiment Analysis,” Jurnal Ilmu Komputer dan Informatika, vol. 8, no. 1, pp. 59-71, 2022. https://doi.org/10.23917/khif.v8i1.15213

[32] Nasser A. and Sever H., “A Concept-based Sentiment Analysis Approach for Arabic,” The International Arab Journal of Information Technology, vol. 17, no. 5, pp. 778-788, 2020. https://doi.org/10.34028/iajit/17/5/11

[33] Nurcahyawati V. and Mustaffa Z., “Improving Sentiment Reviews Classification Performance Using Support Vector Machine-Fuzzy Matching Algorithm,” Bulletin of Electrical Engineering and Informatics, vol. 12, no. 3, pp. 1817-1824, 2023. DOI:10.11591/eei.v12i3.4830

[34] Oliveira M., Mourthe A., and Duque M., “Extracting Events from Daily Drilling Reports Using Fuzzy String Matching,” The APPEA Journal, vol. 62, no. 2, pp. S158-S161, 2022. DOI:10.1071/aj21118

[35] Patil R., Peshave P., and Kamble M., “Application of Fuzzy Matching Algorithms for Doctors Handwriting Recognition,” in Proceedings of the IEEE Bombay Section Signature Conference, Mumbai, pp. 1-5, 2022. DOI:10.1109/IBSSC56953.2022.10037486

[36] Prasetyo A., Ridwan T., and Voutama A., “Analisis Sentimen Terhadap Aplikasi GBWhatsapp Menggunakan Naive Bayes Classifier dan Random Forest Classifier,” Jurnal Sistem Informasi, vol. 11, no. 1, pp. 1-9, 2024. https://doi.org/10.30656/jsii.v11i1.6936

[37] Qureshi M., Asif M., Hassan M., Mustafa G., 234 The International Arab Journal of Information Technology, Vol. 22, No. 2, March 2025 Ehsan M., Ali A., and Sajid U., “A Novel Auto- Annotation Technique for Aspect Level Sentiment Analysis,” Computers, Materials and Continua, vol. 70, no. 3, pp. 4987-5004, 2022. DOI:10.32604/cmc.2022.020544

[38] Rahman R., Pranatawijaya V., and Sari N., “Analisis Sentimen Berbasis Aspek pada Ulasan Aplikasi Gojek,” Konvergensi Teknologi dan Sistem Informasi, vol. 4, no. 1, pp. 70-82, 2024. DOI:10.24002/konstelasi.v4i1.8922

[39] Rajput G., Kundu S., and Kumar A., “The Impact of Feature Extraction on Multi-Source Sentiment Analysis,” in Proceedings of the 10th International Conference on System Modeling and Advancement in Research Trends, Moradabad, pp. 510-515, 2021. DOI:10.1109/SMART52563.2021.9676201

[40] Rohman I., Aqharabah B., and Solekan R., “Chatbot Untuk Cek Persediaan Stok Barang Menggunakan Metode Fuzzy String Matching Berbasis Mobile,” Prosiding Seminar Nasional Teknologi dan Sains, Kediri: Universitas Nusantara PGRI Kediri, vol. 2, pp. 281-286, 2023. https://doi.org/10.29407/stains.v2i1.2840

[41] Romadhon M. and Kurniawan F., “A Comparison of Naive Bayes Methods, Logistic Regression and KNN for Predicting Healing of Covid-19 Patients in Indonesia,” in Proceedings of the 3rd East Indonesia Conference on Computer and Information Technology, Surabaya, pp. 41-44, 2021. DOI:10.1109/EIConCIT50028.2021.9431845

[42] Rukhsar S., Awan M., Naseem U., Zebari D., Mohammed M., Albahar M., Thanoon M., and Mahmoud A., “Artificial Intelligence Based Sentence Level Sentiment Analysis of COVID- 19,” Computer Systems Science and Engineering, vol. 47, no. 1, pp. 791-807, 2023. DOI:10.32604/csse.2023.038384

[43] Ruz G., Henriquez P., and Mascareno A., “Sentiment Analysis of Twitter Data during Critical Events through Bayesian Networks Classifiers,” Future Generation Computer Systems, vol. 106, pp. 92-104, 2020. DOI:10.1016/j.future.2020.01.005

[44] Safitri Y., Kurniawan R., and Suhardi, “Analisis Sentimen Mengenai Childfree Menggunakan Metode Naïve Bayes,” The Indonesian Journal of Computer Science, vol. 13, no. 4, pp. 6320-6332, 2024. DOI:10.33022/ijcs.v13i4.4136

[45] Setiabudi R., Iswari N., and Rusli A., “Enhancing Text Classification Performance by Preprocessing Misspelled Words in Indonesian Language,” TELKOMNIKA Telecommunication, Computing, Electronics and Control, vol. 19, no. 4, pp. 1234- 1241, 2021. DOI:10.12928/TELKOMNIKA.v19i4.20369

[46] Shaban W., Rabie A., Saleh A., and Abo-Elsoud M., “Accurate Detection of COVID-19 Patients based on Distance Biased Naïve Bayes (DBNB) Classification Strategy,” Pattern Recognition, vol. 119, pp. 1-15, 2021. DOI:10.1016/j.patcog.2021.108110

[47] Shirazi G., Azmi R., and Shakibian H., “A Semi- Automated Labeled Data Generation Approach Based on Deep Learning to Improve Sentiment Analysis in the Persian Language,” in Proceedings of the 9th International Conference on Web Research, Tehran, pp. 242-246, 2023. DOI:10.1109/ICWR57742.2023.10138965

[48] Sutoyo E., Rifai A., Risnumawan A., and Saputra M., “A Comparison of Text Weighting Schemes on Sentiment Analysis of Government Policies: A Case Study of Replacement of National Examinations,” Multimedia Tools and Applications, vol. 81, no. 5, pp. 6413-6431, 2022. DOI:10.1007/s11042-022-11900-9

[49] Wankhade M., Rao A., and Kulkarni C., “A Survey on Sentiment Analysis Methods, Applications, and Challenges,” Artificial Intelligence Review, vol. 55, no. 7, pp. 5731-5780, 2022. DOI:10.1007/s10462-022-10144-1

[50] Yang W., Xu J., Xiang J., Yan Z., Zhou H., Wen B., Kong H., Zhu R., and Li W., “Diagnosis of Cardiac Abnormalities Based on Phonocardiogram Using a Novel Fuzzy Matching Feature Extraction Method,” BMC Med Inform Decis Mak, vol. 22, no. 1, pp. 1-13, 2022. DOI:10.1186/s12911-022-01976-6

[51] Yu H. and Kim J., “Indoor Positioning by Weighted Fuzzy Matching in Lifi Based Hospital Ward Environment,” in Proceedings of 4th International Conference on Control Engineering and Artificial Intelligence, Singapore, pp. 1-6, 2020. DOI:10.1088/1742-6596/1487/1/012010

[52] Zhao W., Guan Z., Chen L., He X., Cai D., Wanget B., and Wang Q., “Weakly-Supervised Deep Embedding for Product Review Sentiment Analysis,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 1, pp. 185-197, 2018. DOI:10.1109/TKDE.2017.2756658