A Novel Approach to Maximize G-mean in Nonstationary Data with Recurrent Imbalance

Author Shifts,

Keywords #Cost-sensitive algorithms #data stream classification #imbalanced data #online learning #population shift #skewed data stream

Abstract One of the noteworthy difficulties in the classification of nonstationary data is handling data with class imbalance. Imbalanced data possess the characteristics of having a lot of samples of one class than the other. It, thusly, results in the biased accuracy of a classifier in favour of a majority class. Streaming data may have inherent imbalance resulting from the nature of dataspace or extrinsic imbalance due to its nonstationary environment. In streaming data, timely varying class priors may lead to a shift in imbalance ratio. The researchers have contemplated ensemble learning, online learning, issue of class imbalance and cost-sensitive algorithms autonomously. They have scarcely ever tended to every one of these issues mutually to deal with imbalance shift in nonstationary data. This correspondence shows a novel methodology joining these perspectives to augment G-mean in no stationary data with Recurrent Imbalance Shifts (RIS). This research modifies the state-of-the-art boosting algorithms,1) AdaC2 to get G-mean based Online AdaC2 for Recurrent Imbalance Shifts (GOA-RIS) and AGOA-RIS (Ageing and G-mean based Online AdaC2 for Recurrent Imbalance Shifts), and 2) CSB2 to get G-mean based Online CSB2 for Recurrent Imbalance Shifts (GOC-RIS) and Ageing and G-mean based Online CSB2 for Recurrent Imbalance Shifts (AGOC- RIS). The study has empirically and statistically analysed the performances of the proposed algorithms and Online AdaC2 (OA) and Online CSB2 (OC) algorithms using benchmark datasets. The test outcomes demonstrate that the proposed algorithms globally beat the performances of OA and OC.

References

[1] Anastasovski G. and Goseva-Popstojanova K., “Classification of Partially Labeled Malicious Web Traffic in the Presence of Concept Drift,” in Proceedings of 8th International Conference on Software Security and Reliability-Companion, San Francisco, pp. 130-139, 2014.

[2] Ancy S. and Paulraj D., “Handling Imbalanced Data with Concept Drift by Applying Dynamic Sampling and Ensemble Classification Model,” Computer Communications, vol. 153, pp. 553- 560, 2020.

[3] Dawid A., “Present Position and Potential Developments: Some Personal Views: Statistical Theory: the Prequential Approach,” Journal of the Royal Statistical Society, Series A (General), vol. 147, no. 2, pp. 278-292, 1984.

[4] Demˇsar J., “Statistical Comparisons of Classifiers over Multiple Data Sets,” Journal of Machine Learning Research, vol. 7, pp. 1-30, 2006.

[5] El-Halees A., “Filtering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques,” The International Arab Journal of Information Technology, vol. 6, no. 1, pp. 52-59, 2009.

[6] Elwell R. and Polikar R., “Incremental Learning of Concept Drift in Nonstationary Environments,” IEEE Transactions on Neural Networks, vol. 22, no. 10, pp. 1517-1531, 2011.

[7] Fan W., Stolfo S., Zhang J., and Chan P., “AdaCost: Misclassification Cost-Sensitive Boosting,” in Proceedings of the 6th International Conference on Machine Learning, San Francisco, pp. 97-105, 1999.

[8] Ferreira L., Gomes H., Bifet A., Bifet A., and Oliveira L., “Adaptive Random Forests with Resampling for Imbalanced Data Streams,” in Proceedings of International Joint Conference on Neural Networks, Budapest, pp. 1-6, 2019.

[9] Freund Y. and Schapire R., “A Decision- Theoretic Generalization of On-line Learning and an Application to Boosting,” Journal of Computer System Sciences, vol. 55, no. 1, pp. 119-139, 1997.

[10] Galar M., Fernandez A., Barrenechea E., Bustince H., and Herrera F., “A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches,” IEEE Transactions on Systems, Man and Cybernetics-Part C: Applications and Reviews, vol. 42, no. 4, pp. 463-484, 2012.

[11] Gama J., Žliobaitė I., Bifet A., Pechenizkiy M., and Bouchachia A., “A Survey on Concept Drift Adaptation,” ACM Computing Surveys, vol. 1, no. 1, pp. 1-35, 2013.

[12] Gao X., Chen Z., Tang S., Zhang Y., and Li J., “Adaptive Weighted Imbalance Learning with Application to Abnormal Activity Recognition,” Neurocomputing, vol. 173, pp. 1927-1935, 2016.

[13] García S., Fernández A., Luengo J., and Herrera F., “Advanced Nonparametric Tests for Multiple Comparisons in the Design of Experiments in Computational Intelligence and Data Mining: Experimental Analysis of Power,” Information Sciences, vol. 180, no. 10, pp. 2044-2064, 2010.

[14] Ghazikhani A., Monsefi R., and Yazdi H., “Online Cost-Sensitive Neural Network Classifiers for Nonstationary and Imbalanced Data Streams,” Neural Computing and Applications, vol. 23, pp. 1283-1295, 2013.

[15] Haixiang G., Yijing L., Shang J., Mingyun G., Yuanyue H., and Bing G., “Learning from Class- Imbalanced Data: Review of Methods and Applications,” Expert Systems with Applications, vol. 73, pp. 220-239, 2017.

[16] Harries M., “Splice-2 Comparative Evaluation: Electricity Pricing,” Technical Report, The University of South Wales, 1999. 112 The International Arab Journal of Information Technology, Vol. 18, No. 1, January 2021

[17] He H. and Garcia E., “Learning from Imbalanced Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263-1284, 2009.

[18] Joshi M., Kumar V., and Agarwal R., “Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements,” in Proceedings of IEEE International Conference Data Mining, San Jose, pp. 257-264, 2001.

[19] Kelly M., Hand D., and Adams N., “The Impact of Changing Populations on Classifier Performance,” in Proceedings of 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, pp. 367- 371, 1999.

[20] Landesa-Vázquez I. and Alba-Castro J., “Revisiting AdaBoost for Cost-Sensitive Classification Part I: Theoretical Perspective,” Computing Research Repository (CoRR), Cornell University, vol. abs/1507/04125, 2015.

[21] Landesa-Vázquez I. and Alba-Castro J., “Revisiting AdaBoost for Cost-Sensitive Classification Part II: Empirical Analysis,” Computing Research Repository (CoRR), Cornell University, vol. abs/1507/04126, 2015.

[22] Lim P., Goh C., and Tan K., “Evolutionary Cluster-Based Synthetic Oversampling Ensemble (ECO-Ensemble) for Imbalance Learning,” IEEE Transactions on Cybernetics, vol. 47, no. 9, pp. 2850-2861, 2017.

[23] Lu Y., Cheung Y., and Tang Y., “Adaptive Chunk-based Dynamic Weighted Majority for Imbalanced Data Streams with Concept Drift,” IEEE Transactions on Neural Networks Learning Systems, vol. 31, no. 8, pp. 2764-2778, 2020.

[24] Nikolaou N., Edakunni N., Kull M., Flach P., and Brown G., “Cost-Sensitive Boosting Algorithms: Do We Really Need Them?,” Machine Learning, vol. 104, pp. 359-384, 2016.

[25] Orriols-Puig A. and Bernad´o-Mansilla E., “Evolutionary Rule-based Systems for Imbalanced Data Sets,” Soft Computing, vol. 13, no. 3, pp. 213-225, 2009.

[26] Oza N., “Online Ensemble Learning,” PhD Dissertation, University of California, 2001.

[27] Park Y., Luo L., Parhi K., and Netoff T., “Seizure Prediction with Spectral Power of EEG Using Cost-Sensitive Support Vector Machines,” Epilepsia, vol. 52, no. 10, pp. 1761-1770, 2011.

[28] Ren S., Zhu W., Liao B., Li Z., Wang P., Li K., Chen M., and Li Z., “Selection-based Resampling Ensemble Algorithm for Nonstationary Imbalanced Stream Data Learning,” Knowledge- Based Systems, vol. 163, pp. 705-722, 2019.

[29] Rexer K., “A Decade of Surveying Analytic Professionals: Survey Highlights,” Rexer Analytics Survey Summary Report, 2017.

[30] Somasundaram A. and Reddy S., “Parallel and Incremental Credit Card Fraud Detection Model to Handle Concept Drift and Data Imbalance,” Neural Computing and Applications, vol. 31, no. 1, pp. 3-14, 2019.

[31] Street W. and Kim Y., “A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification,” in Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, pp. 377-382, 2001.

[32] Sun J., Li H., Fujita H., Fu B., and Ai W., “Class-imbalanced Dynamic Financial Distress Prediction based on Adaboost-SVM Ensemble Combined with SMOTE and Time Weighting,” Information Fusion, vol. 54, pp. 128-144, 2020.

[33] Sun Y., Kamel M., Wong A., and Wang Y., “Cost-Sensitive Boosting for Classification of Imbalanced Data,” Pattern Recognition, vol. 40, no. 12, pp. 3358-3378, 2007.

[34] Ting K., “A Comparative Study of Cost- Sensitive Boosting Algorithms,” in Proceedings of the 17th International Conference on Machine Learning, Stanford University, pp. 983-990, 2000.

[35] Wang B. and Pineau P., “Online Bagging and Boosting for Imbalanced Data Streams,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3353-3366, 2016.

[36] Wang J., Zhao P., and Hoi S., “Cost-Sensitive Online Classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 10, pp. 2425-2438, 2014.

[37] Wang S., Minku L., and Yao X., “A Learning Framework for Online Class Imbalance Learning,” in Proceedings of IEEE Symposium on Computational Intelligence and Ensemble Learning, Singapore, pp. 36-45, 2013.

[38] Wang S., Minku L., and Yao X., “Resampling- Based Ensemble Methods for Online Class Imbalance Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 5, pp. 1356-1368, 2015.

[39] Wong M., Seng K., and Wong P., “Cost- Sensitive Ensemble of Stacked Denoising Autoencoders for Class Imbalance Problems in Business Domain,” Expert Systems with Applications, vol. 141, pp. 112918, 2020.

[40] Wu D., Wang Z., Chen Y., and Zhao H., “Mixed-kernel based Weighted Extreme Learning Machine for Inertial Sensor based Human Activity Recognition with Imbalanced Dataset,” Neurocomputing, vol. 190, pp. 35-49, 2016.

[41] Wu X., Kumar V., Quinlan J., Ghosh J., Yang Q., Motoda H., McLachlan G., Ng A., Liu B., Yu P., Zhou Z., Steinbach M., Hand D., and A Novel Approach to Maximize G-mean in Nonstationary Data with Recurrent … 113 Steinberg D., “Top 10 Algorithms in Data Mining,” Knowledge and Information Systems, vol. 14, no. 1, pp. 1-37, 2008.

[42] Yin Q., Zhang J., Zhang C., and Liu S., “An Empirical Study on the Performance of Cost- Sensitive Boosting Algorithms with Different Levels of Class Imbalance,” Mathematical Problems in Engineering, vol. 2013, 2013.

[43] Zhu T., Lin Y., and Liu Y., “Synthetic Minority Oversampling Technique for Multiclass Imbalance Problems,” Pattern Recognition, vol. 72, pp. 327-340, 2017. Radhika Kulkarni received the M.Tech. degree in Computer Science and Technology from the Shivaji University, Kolhapur, India in 2011. She is currently pursuing the Ph.D. degree with the Department of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai India. She is working as an Assistant Professor in the Department of Information Technology, Pune Institute of Computer Technology, Pune, India. Her current research interests include Machine Learning, Data Analytics and Big Data. Subramanion Revathy (S. Revathy) is presently working as an Associate Professor in the Department of Information Technology, Sathyabama Institute of Science and Technology, Chennai India. Her research interest includes Machine Learning, Data Analytics and Big Data. She has published over twenty papers in refereed journals. Suhas Patil received the Ph.D. degree in Computer Science and Engineering from Bharati Vidyapeeth Deemed University, Pune, India in 2009. He is currently working as a Professor in the Department of Computer Science and Engineering, Bharati Vidyapeeth Deemed University College of Engineering, Pune, India. His research area includes Machine Learning, Expert System, Computer Network, Operating System, System Software. He has published over 65 papers in international journals, 36 papers in international conferences and 42 papers in national conferences.