The International Arab Journal of Information Technology (IAJIT)


A Novel Feature Selection Method Based on Maximum Likelihood Logistic Regression for

The most frequently used machine learning feature ranking approaches failed to present optimal feature subset for accurate prediction of defective software modules in out-of-sample data. Machine learning Feature Selection (FS) algorithms such as Chi-Square (CS), Information Gain (IG), Gain Ratio (GR), RelieF (RF) and Symmetric Uncertainty (SU) perform relatively poor at prediction, even after balancing class distribution in the training data. In this study, we propose a novel FS method based on the Maximum Likelihood Logistic Regression (MLLR). We apply this method on six software defect datasets in their sampled and unsampled forms to select useful features for classification in the context of Software Defect Prediction (SDP). The Support Vector Machine (SVM) and Random Forest (RaF) classifiers are applied on the FS subsets that are based on sampled and unsampled datasets. The performance of the models captured using Area Ander Receiver Operating Characteristics Curve (AUC) metrics are compared for all FS methods considered. The Analysis Of Variance (ANOVA) F-test results validate the superiority of the proposed method over all the FS techniques, both in sampled and unsampled data. The results confirm that the MLLR can be useful in selecting optimal feature subset for more accurate prediction of defective modules in software development process.

[1] Asadi S., Abdullah R., Safaei M., and Nazir S., “An Integrated SEM-Neural Network Approach for Predicting Determinants of Adoption of Wearable Healthcare Devices,” Mobile Information Systems, pp. 1-9, 2019.

[2] Bashir K., Li T., and Yohannese C., “An Empirical Study for Enhanced Software Defect Prediction Using A Learning-Based Framework,” International Journal of Computational Intelligence Systems, vol. 12, no. 1, pp. 282-298, 2018.

[3] Bashir K., Li T., Yohannese C., and Mahama Y., “Enhancing Software Defect Prediction Using Supervised-Learning Based Framework,” in Proceedings of 12th International Conference on Intelligent Systems and Knowledge Engineering, Nanjing, pp. 1-6, 2017.

[4] Chandrashekar G. and Sahin F., “A Survey on Feature Selection Methods,” Computers and Electrical Engineering, vol. 40, no. 1, pp. 16-28, 2014.

[5] Chawla N., Bowyer K., Hall L., and Kegelmeyer W., “SMOTE: Synthetic Minority over Sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, no. 1, pp. 321-357, 2002.

[6] Chubato W. and Li T., “A Combined-Learning Based Framework for Improved Software Fault Prediction,” International Journal of Computational Intelligence Systems, vol. 10, no. 1, pp. 647-662, 2017.

[7] Czepiel S., “Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation,” Scott A Czepiels Homepage, pp. 1-23, 2009.

[8] DAmbros M., Lanza M., and Robbes R., “Evaluating Defect Prediction Approaches: A Benchmark and an Extensive Comparison,” Empirical Software Engineering, vol. 17, no. 4- 5, pp. 531-577, 2012.

[9] Guyon I. and Elisseeff A., “An Introduction to Variable and Feature Selection,” Journal of Machine Learning Research, vol. 3, pp. 1157- 1182, 2003.

[10] Hall T., Beecham S., Bowes D., Gray D., and Counsell S., “A Systematic Literature Review On Fault Prediction Performance in Software Engineering,” IEEE Transactions on Software Engineering, vol. 38, no. 6, pp. 1276-1304, 2012.

[11] Haq A., Li J., Memon M., Malik A., Ahmad T., Ali A., Nazir S., Ahad I., Shahid M., and khan J., “Feature Selection Based on L1-Norm Support Vector Machine and Effective Recognition System for Parkinsons Disease Using Voice Recordings,” IEEE Access, vol. 7, pp. 37718- 37734, 2019.

[12] Haq A., Li J., Memon M., Nazir S., and Sun R., “A Hybrid Intelligent System Framework for the Prediction of Heart Disease Using Machine Learning Algorithms,” Mobile Information Systems, pp. 1-21, 2018.

[13] Janecek A., Gansterer W., Demel M., and Ecker G., “On the Relationship between Feature Selection and Classification Accuracy,” Journal of Machine Learning Research, pp. 90-105, 2008.

[14] Khoshgoftaar T., Gao K., and Seliya N., “Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction,” in Proceedings of 22nd IEEE International Conference on Tools with Artificial Intelligence, Arras, pp. 137-144, 2010.

[15] Khoshgoftaar T., Gao K., Napolitano A., and Wald R., “A Comparative Study of Iterative and Non-Iterative Feature Selection Techniques for Software Defect Prediction,” Information Systems Frontiers, vol. 16, no. 5, pp. 801-822, 2014.

[16] Kohavi R. and John G., “Wrappers for Feature Subset Selection,” Artificial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997.

[17] Kumar V. and Minz S., “Feature Selection,” Smart CR, vol. 4, no. 3, pp. 211-229, 2014.

[18] Kuswanto H., Asfihani A., Sarumaha Y., and Ohwada H., “Logistic Regression Ensemble for Predicting Customer Defection with Very Large Sample Size,” Procedia Computer Science, vol. 72, pp. 86-93, 2015.

[19] Landgrebe T. and Duin R., “Efficient Multiclass Roc Approximation by Decomposition Via Confusion Matrix Perturbation Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, pp. 810-822, 2008. 730 The International Arab Journal of Information Technology, Vol. 17, No. 5, September 2020

[20] Liu H. and Yu L., “Toward Integrating Feature Selection Algorithms for Classification and Clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 4, pp. 491- 502, 2005.

[21] Mccullagh P., “Generalized Linear Models,” European Journal of Operational Research, vol. 16, no. 3, pp. 285-292, 1984.

[22] Menzies T., Greenwald J., and Frank A., “Data Mining Static Code Attributes to Learn Defect Predictors,” IEEE Transactions on Software Engineering, vol. 33, no. 1, pp. 2-13, 2007.

[23] Menzies T., Krishna R., and Pryor D., “The Promise Repository of Empirical Software Engineering Data,” URL http://openscience. Us/repo, Last Visited, 2015.

[24] Nazir S., Khan M., Anwar S., Khan H., and Nazir M., “A Novel Fuzzy Logic-Based Software Component Selection Modeling,” in Proceedings of International Conference on Information Science and Applications, Suwon, pp. 1-6, 2012.

[25] Provost F., in Advances in Distributed and Parallel Knowledge Discovery, MIT Press, 1999.

[26] Riza L., Zainafif A., Nazir S., and Rasim S., “Fuzzy Rule-Based Classification Systems for the Gender Prediction from Handwriting,” Telkomnika, vol. 16, no. 6, pp. 2725-2732, 2018.

[27] Weisberg S., Berenson M., Levine D., Goldstein M., Cooper R., and Weekes A., “Intermediate Statistical Methods and Applications: A Computer Package Approach,” Journal of the American Statistical Association, vol. 79, no. 386, pp. 471, 1983.

[28] Yu Q., Jiang S., Wang R., and Wang H., “A Feature Selection Approach Based on A Similarity Measure for Software Defect Prediction,” Frontiers of Information Technology and Electronic Engineering, vol. 18, no. 11, pp. 1744-1753, 2017.

[29] Ziani D., “Correlation Dependencies between Variables in Feature Selection on Boolean Symbolic Objects,” The International Arab Journal of Information Technology, vol. 16, no 6, pp. 1063-1073, 2019. Kamal Bashir is currently a Ph.D. candidate at the School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China. He received his MSc. degree in Software Engineering from Khartoum University, Sudan, in 2013. His BSc. degree in Computer Science from Karary University, Sudan, in 2009. His area of research interests Includes Data Mining, Machine Learning, Software Quality Assessment. Tianrui Li received his B.S. degree, M.S. degree and Ph.D. degree from the Southwest Jiaotong University, China in 1992, 1995 and 2002 respectively. He was a Post- Doctoral Researcher at Belgian Nuclear Research Centre (SCK • CEN), Belgium from 2005-2006, a visiting professor at Hasselt University, Belgium in 2008, the University of Technology Sydney, Australia in 2009 and the University of Regina, Canada in 2014. And, he is presently a Professor and the Director of the Key Lab of Cloud Computing and Intelligent Technique of Sichuan Province, Southwest Jiaotong University, China. Since 2000, he has co-edited 6 books, 10 special issues of international journals, 18 proceedings, received 6 Chinese invention patents and published over 360 research papers. Mahama Yahaya is currently a Ph.D. candidate at the Transport and Logistics Engineering, Southwest Jiaotong University, Chengdu, China. He received his MSc. degree in Traffic Engineering FROM Southwest Jiaotong University, China. 2018. His BSc. degree in Geodetic Engineering from Kwame Nkrumah University of Science and Technology, Ghana, in 2007. His area of research interests Includes Machine Learning, Roads Construction Project Management, Road Traffic Survey and Data Analysis.