The International Arab Journal of Information Technology (IAJIT)


The Impact of Natural Language Preprocessing on Big Data Sentiment Analysis

The sentiment analysis determines peoples’ opinions, sentiments and emotions by classifying their written text into positive or negative polarity. The sentiment analysis is important for many critical applications such as decision making and products evaluation. Social networks are one of the main sources of sentiment analysis. However, the huge volume of data produced by social networks requires efficient and scalable analysis techniques to be applied. The MapReduce proved its efficiency and scalability in handling big data, thus attracted many researchers to use the MapReduce as a processing framework. In this paper, a sentiment analysis method for big data is studied. The method uses the Naïve Bayes algorithm for classifying texts into positive and negative polarity. Several linguistic and Natural Language Processing (NLP)preprocessing techniques are applied on a Twitter data set, to study their impact on the accuracy of big data classification. The preformed experiments indicates that the accuracy of the sentiment analysis is enhanced by 5%, yielding an accuracy of 73% on the Stanford Sentiment data set.

[1] Chauhan V. and Shukla A., “Sentimental Analysis of Social Networks using MapReduce and Big Data Technologies,” International 512 The International Arab Journal of Information Technology, Vol. 16, No. 3A, Special Issue 2019 Journal of Computer Science and Network, vol. 6, no. 2, pp. 120-130, 2017.

[2] Dean J. and Ghemawat S., “MapReduce: Simplified Data Processing on Large Clusters,” in Proceedings of the 6th Symposium on Operating Systems Design and Implementation, San Francisco, pp. 137-150, 2004.

[3] Etaiwi W. and Naymat G.,“ The Impact of applying Different Preprocessing Steps on Review Spam Detection,” in Proceedings of 8th International Conference on Emerging Ubiquitous Systems and Pervasive Networks, Lund, pp. 273-279, 2017.

[4] Go A., Bhayani R., and Huang L.,, Last Visited 2009.

[5] González C., García-Nieto J., Navas-Delgado I., and Aldana-Monte J.,“A Fine Grain Sentiment Analysis with Semantics in Tweets,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 3, no. 6, pp. 22- 28, 2016.

[6] Ha I., Back B., and Ahn B., “MapReduce Functions to Analyze Sentiment Information from Social Big Data,” International Journal of Distributed Sensor Networks, vol. 11, no. 6, pp. 1-11, 2015.

[7] Khader M., Awajan A., and Al-Naymat G., “Sentiment Analysis Based on MapReduce: A Survey,” in Proceedings of the 10th International Conference on Advances in Information Technology, Bangkok, 2018.

[8] Khuc V., Shivade C., Ramnath R., and Ramanathan J., “Towards Building Large-Scale Distributed Systems For Twitter,” in Proceedings of the 27th Annual ACM Symposium on Applied Computing, Trento, pp. 459-464, 2012.

[9] Lewis D., “Naı̈ ve (Bayes) At Forty: the Independence Assumption in Information Retrieval,” in Proceedings of European Conference on Machine Learning, Chemnitz, pp. 4-15, 1998.

[10] Liu B., Sentiment Analysis: Mining Opinions, Sentiments, and Emotions, Cambridge University Press, 2015.

[11] Liu B., Blasch E., Chen Y., Shen D., and Chen G., “Scalable Sentiment Classification for Big Data Analysis Using Naı̈ ve Bayes Classifier,” in Proceedings of IEEE International Conference on Big Data, Silicon Valley, pp. 99-104, 2013.

[12] Liu B., Pozzi F., Fersini E., and Messina E., Sentiment Analysis in Social Networks, Elsevier Science and Technology, 2016.

[13] Madani Y., Bengourram J., Erritali M., Hssina B., and Birjali M., “Adaptive E-Learning using Genetic Algorithm and Sentiments Analysis in a Big Data System,” International Journal of Advanced Computer Science and Applications, vol. 8, no. 8, pp. 394-403, 2017.

[14] Madani Y., Erritali M., and Bengourram J., “Sentiment Analysis using Semantic Similarity and Hadoop MapReduce,” Knowledge and Information Systems, vol. 59, no. 2. pp. 413-436, 2019.

[15] Madani Y., Mohammed E., and Jamaa B., “A Parallel Semantic Sentiment Analysis,” in Proceedings of 3rd International Conference of Cloud Computing Technologies and Applications, Rabat, pp. 1-6, 2017.

[16] Nicholls C. and Song, F., “Improving Sentiment Analysis With Part-Of-Speech Weighting,” in Proceedings of the 8thInternational Conference on Machine Learning and Cybernetics, Hebei, pp. 1592-1597, 2009.

[17] Opennlp A., /elasticsearch-opennlp-auto- tagging/master/src/main/resources/models/en- lemmatizer.dict, Last Visited 2018.

[18] Owen S., Anil R., Dunning T., and Friedman E., Mahout in Action, Manning, 2011.

[19] Parveen H. and Pandey S., “Sentiment Analysis on Twitter Data-set using Naive Bayes Algorithm,” in Proceedings of 2nd International Conference on Applied and Theoretical Computing and Communication Technology, Bangalore, pp. 416-419, 2016.

[20] Ramesh R., Divya G., Divya D., Kurian M., and Vishnuprabha V., “Big Data Sentiment Analysis using Hadoop,” International Journal for Innovative Research in Science and Technology, vol. 1, no. 11, pp. 92-96, 2015.

[21] Treebank P., ng001/penn_treebank_pos.html, Last Visited 2018.

[22] White T., Hadoop: The Definitive Guide, O'Reilly Media, 2015. The Impact of Natural Language Preprocessing on Big Data Sentiment Analysis 513 Mariam Khader is a PhD Candidate in computer science at Princess Sumaya University for Technology (PSUT), Amman, Jordan. She received the BSc degree in computer networking systems from the World Islamic Science & Education University (WISE) in 2012, Amman, Jordan. She received her MSc Degree in IT security and digital criminology in 2014 from PSUT. Between 2012-2015, she was teacher assistant and then a lecturer at the network department in WISE University. Her interests include digital forensics, network security and big data analytic. Arafat Awajan is a Full Professor at Princess Sumaya University for Technology (PSUT). He received his PhD degree in Computer Science from the University of Franche-Comte, France in 1987. He has held various administrative and academic positions at the Royal Scientific Society and Princess Sumaya University for Technology. Head of the Department of Computer Science (2000-2003) Head of the Department of Computer Graphics and Animation (2005-2006) Dean of the King Hussein School for Information Technology (2004 - 2007) Director of the Information Technology Center, RSS (2008-2010) Dean of Student Affairs (2011 - 2014) Dean of the King Hussein School for Computing Sciences (2014-2017) He is currently the vice president of the university (PSUT). His research interests include: Natural Language Processing, Arabic Text Mining and Digital Image Processing. Ghazi Al-Naymat. He received his PhD degree in May 2009 from the School of Information Technologies at The University of Sydney, Australia. He is working as an Associate Professor in the Department of Computer Science, King Hussein School of Computing Sciences at Princess Sumaya University for Technology (PSUT). In addition, he is currently the chair of the computer science department. His research interests include: Data Mining and machine learning, big data, and data science.