Assesing the Stability and Selection Performance of Feature Selection Methods Under Different Data Complexity

Author Omaimah Al Hosni, Andrew Starkey,

Keywords #Stability of feature selection #class overlapping #data characteristics #complex data

Abstract

Our study aims to investigate the stability and the selection accuracy of feature selection performance under different data complexity. The motivation behind this investigation is that there are significant contributions in the research community from examining the effect of complex data characteristics such as overlapping classes or non-linearity of the decision boundaries on the classification algorithm's performance; however, relatively few studies have investigated the stability and the selection accuracy of feature selection methods with such data characteristics. Also, this study is interested in investigating the interactive effects of the classes overlapped with other data challenges such as small sample size, high dimensionality associated with irrelevant features, and imbalance classes to provide meaningful insights into the root causes for feature selection methods misdiagnosing the relevant features among different real-world data challenges. This analysis will be extended to real-world data to guide the practitioners and researchers in choosing the correct feature selection methods that are more appropriate for a particular dataset. Our study outcomes indicate that using feature selection techniques with datasets of different characteristics may generate different subsets of features under variations to the training data showing that small sample size and overlapping classes have the highest impact on the stability and selection accuracy of feature selection performance, among other data challenges that have been investigated in this study. Also, in this study, we will provide a survey on the current state of research in the feature selection stability context to highlight the area that requires more attention for other researchers.

References

[1] Abu Shanab A., Khoshgoftaar T., Wald R., Napolitano A., “Impact of Noise and Data Sampling on Stability of Feature Ranking Techniques for Biological Datasets,” in Proceeding of the IEEE 13th International Conference on Information Reuse and Integration, Las Vegas, pp. 415-422, 2012.

[2] Al Hosni O. and StarkeyA., “Stability and Accuracy of Feature Selection Methods on Datasets of Varying Data Complexity,” in Proceeding of the 22nd International Arab Conference on Information Technology, Muscat, pp. 1-11, 2021.

[3] Alelyani S. and Liu H., “Supervised Low Rank Matrix Approximation for Stable Feature Selection, ” in proceeding of the 11th International Conference on Machine Learning and Applications,” Boca Raton, pp. 324-329, 202.

[4] Alelyani S., “Stable bagging feature selection on medical data,” Journal of Big Data, vol. 8, no. 1, pp. 1-18, 2021.

[5] Alelyani S., Liu H., and Wang L., “The Effect of the Characteristics of the Dataset on the Selection Stability,” in Proceeding of the IEEE 23rd International Conference on Tools with Artificial Intelligence, Boca Raton, pp. 970-977, 2011.

[6] Altidor W., Khoshgoftaar T., and Napolitano A., “Measuring Stability of Feature Ranking Techniques: A Noise-Based Approach,” International Journal of Business Intelligence and Data Mining, vol. 7, no. 1-2, p. 80-115, 2012.

[7] Awada W., Khoshgoftaar T., Dittman D., Wald R., and Napolitano A., “A Review of the Stability of Feature Selection Techniques for Bioinformatics Data,” in Proceedings of IEEE 13th International Conference on Information Reuse and Integration, Las Vegas, pp. 356-363, 2012.

[8] Bahekar K. and Gupta A., “Artificial Immune Recognition System-Based Classification Technique,” in Proceeding of the International Conference on Recent Advancement on Computer and Communication. Bhopal, pp.629-635, 2018.

[9] Bania R. and Halder A., “R-HEFS: Rough Set Based Heterogeneous Ensemble Feature Selection Method for Medical Data Classification,” Artificial Intelligence in Medicine, vol. 114, pp. 102049, 2021.

[10] Barella V., Garcia L., de Souto., M., Lorena A., and de Carvalho A., “Data Complexity Measures for Imbalanced Classification Tasks,” in Proceeding of the International Joint Conference on Neural Networks, Rio de Janeiro, pp. 1-8, 2018.

[11] Barella V., Garcia L., de Souto M., Lorena A., and de Carvalho A., “Assessing the Data Complexity of Imbalanced Datasets,” Information Sciences, vol. 553, pp. 83-109, 2021.

[12] Cano J., “Analysis of Data Complexity Measures for Classification,” Expert Systems with Applications, vol 40, pp. 4820-4831, 2013.

[13] Chelvan P. and Perumal K., “A comparative Analysis of Feature Selection Stability Measures,” in Proceeding of the International Conference on Trends in Electronics and Informatics , pp. 124- 128, 2017.

[14] Chelvan P. and Perumal K., “On Feature Selection Algorithms and Feature Selection Stability Measures: A Comparative Analysis,” International Journal of Computer Science and Information Technology, vol. 9, no. 3, pp. 159-168, 2017.

[15] Colombellia F., Kowalskib T., and Recamonde- Mendozaa M., “A Hybrid Ensemble Feature Selection Design for Candidate Biomarkers Discovery from Transcriptome Profiles,” arXiv preprint arXiv:2108.00290, 2021.

[16] Dittman D., Khoshgoftaar T., Wald R., and Napolitano A., “Comparing Two New Gene Selection Ensemble Approaches with the Commonly-Used Approach,” in Proceeding of the 11th International Conference on Machine Learning and Applications, Boca Raton, pp. 184- 191, 2012.

[17] Fraça T., Miranda P., Prudêncio R., Lorenaz A., and Nascimento A., “A Many-Objective 454 The International Arab Journal of Information Technology, Vol. 19, No. 3A, Special Issue 2022 Optimisation Approach for Complexity-Based Data Set Generation,” in Proceeing of the IEEE Congress on Evolutionary Computation, Glasgow, pp. 1-8, 2020.

[18] Geurts P., Ernst D., and Wehenkel L., “Extremely Randomised Trees,” Machine Learning, vol. 63, no. 1, pp. 3-42, 2006.

[19] GitHub. Gravier· ramhiser/datamicroarray Wiki.

[online] Available at: https://github.com/ramhiser/datamicroarray/wiki/ Gravier-%282010%29, Last Visited, 2022.

[20] Gulgezen G., Cataltepe Z., and Yu L., “Stable and Accurate Feature Selection,” in Proceeding of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Bled, pp. 455-468, 2009.

[21] Guyon I., Weston J., Barnhill S., and Vapnik V., “Gene Selection for Cancer Classification Using Support Vector Machines Machine Learning, vol. 46, no. 1, pp. 389-422, 2002.

[22] Han Y. and Yu L., “A Variance Reduction Framework for Stable Feature Selection,” Statistical Analysis And Data Mining: The ASA Data Science Journal, vol. 5, no. 5, pp. 428-445, 2012.

[23] Haury A., Gestraud P., and Vert J., “The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures,” PLoS ONE, vol. 6, no. 12, pp. e28210, 2011

[24] Hoekstra A. and Duin R., “On the Non-Linearity of Pattern Classifiers,” in Procceding of the 13th International Conference on Pattern Recognition, Vienna, 271-275, 1996.

[25] Khaire U. and Dhanalakshmi R., “Stability of Feature Selection Algorithm: A review,” Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 4, 1060-1073, 2019.

[26] Kim S., Koh K., Lustig M., Boyd S., and Gorinevsky D., “An Interior-Point Method for Large-Scale -Regularized Least Squares,” in Proceeding of the IEEE Journal of Selected Topics in Signal Processing, pp. 606-617, 2007.

[27] Kuhn M. and Johnson K., Feature Engineering and Selection A Practical Approach for Predictive Models, CRC press, 2019.

[28] Lei Y., Han Y., and Berens M., “Stable Gene Selection from Microarray Data via Sample Weighting,” IEEE/ACM Transactions on Computational Biology And Bioinformatics, vol. 9, no. 1, pp. 262-272, 2012.

[29] LiY., Li T., and Liu H., “Recent Advances in Feature Selection and Its Applications,” Knowledge and Information Systems, vol. 53, no. 3, pp. 551-577, 2017.

[30] Li Y., Si J., Zhou G., Huang S., and Chen S., “FREL: A Stable Feature Selection Algorithm,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 7, pp.1388-1402, 2014.

[31] Liu Y., Diao X., Cao J., and Zhang L., “Evolutionary Algorithms’ Feature Selection Stability Improvement System,” in Proceeding of the International Conference on Bio-Inspired Computing: Theories and Applications, Harbin, pp. 68-81, 2017.

[32] Lorena A., Garcia L., Lehmann J., Souto M., and Ho T. “How Complex is your Classification Problem,” ACM Computing Surveys, vol. 52, no. 5, 2019.

[33] Mungloo-Dilmohamud Z., Jaufeerally-Fakim Y., and Peña-Reyes C., “Stability of Feature Selection Methods: A Study of Metrics Across Different Gene Expression Datasets,” in Proceeding of the International Work-Conference on Bioinformatics and Biomedical Engineering, Granada, pp. 659- 669, 2020.

[34] Naik A., Kuppili V., and Edla D., “A New Hybrid Stability Measure for Feature Selection,” Applied Intelligence, vol. 50, no. 10, pp. 3471-3486, 2020.

[35] Nogueira S., Sechidis K., and Brown G., “On the Stability of Feature Selection Algorithms,” Journal of Machine Learning Research, vol.18, no. 1, pp. 6345-6398, 2017.

[36] Noureldien N. and Mohammed E., “Measuring Success of Heterogeneous Ensemble Filter Feature Selection Models,” International Journal of Recent Technology and Engineering, vol. 8, no. 6, pp.1153-1158, 2020.

[37] Pascual-Triana J., Charte D., Andrés Arroyo M., Fernández A., and Herrera F., “Revisiting Data Complexity Metrics Based on Morphology for Overlap and Imbalance: Snapshot, New Overlap Number of Balls Metrics and Singular Problems Prospect,” Knowledge and Information Systems, vol. 63, no. 7, pp. 1961-1989 2021.

[38] Pes B., “Ensemble Feature Selection for High- Dimensional Data: A Stability Analysis Across Multiple Domains,” Neural Computing and Applications, vol. 32, no. 10, pp. 5951-5973, 2020.

[39] Ramezani I., Niaki M., Dehghani M., and Rezapour M., “Stability Analysis of Feature Ranking Techniques in The Presence of Noise: A Comparative Study,” International Journal of Business Intelligence and Data Mining, vol. 17, no. 4, pp. 413- 427, 2020.

[40] Ross B., “Mutual Information between Discrete and Continuous Data Sets,” PLoS ONE, vol. 9, no. 2, pp. pp. e87357. 2014.

[41] Saeys Y., Abeel T., and Peer Y., “Robust Feature Selection Using Ensemble Feature Selection Techniques,” in Proceeding of the European Conference on Machine Learning and Knowledge Discovery in Databases, Antwerp, pp. 313-325, Assesing The Stability And Selection Performance Of Feature Selection Methods Under ... 455 2008.

[42] Şahin C. and Diri B., “Robust Feature Selection And Classification Using Heuristic Algorithms Based On Correlation Feature Groups,” Balkan Journal of Electrical and Computer Engineering, vol. 9, no. 1, pp. 23-32, 2021.

[43] Salman R., Alzaatreh A., Sulieman H., and Faisal, S., “A Bootstrap Framework for Aggregating within and between Feature Selection Methods,” Entropy, vol. 23, no. 2, pp. 200, 2021.

[44] Sechidis K., Papangelou K., Nogueira S., Weatherall J., and Brown G., “On The Stability of Feature Selection in the Presence of Feature Correlations,” in Proceeding of the European conference on machine learning and knowledge discovery in databases, Wurzburg, pp. 327-342, 2019.

[45] Seijo-Pardo B., Bolón-Canedo V., Porto-Díaz I., and Alonso-Betanzos A., “Ensemble Feature Selection for Rankings of Features,” in Proceeding of the International Work-Conference on Artificial Neural Networks, Palma de Mallorca, pp. 29-42, 2015.

[46] Shang Z. and Li M., “Feature Selection Based on Grouped Sorting,” in Proceeding of the 9th International Symposium on Computational Intelligence and Design, Hangzhou, pp. 451-454, 2016.

[47] Wang A., Liu H., Liu J., Ding H., Yang J., and Chen G., “Stable and Accurate Feature Selection from Microarray Data with Ensembled Fast Correlation Based Filter,” in Proceeding of the IEEE International Conference on Bioinformatics and Biomedicine, Seoul, pp. 2996-2998, 2020.

[48] Wang H., Khoshgoftaar T., Wald R., and Napolitano A., “A Novel Dataset-Similarity- Aware Approach for Evaluating Stability of Software Metric Selection Techniques,” in Proceeding of the IEEE 13th International Conference on Information Reuse and Integration, Las Vegas, pp. 1-8, 2012.

[49] Yu L., Ding C., and Loscalzo S., “Stable Feature Selection Via Dense Feature Groups,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, pp. 803- 81, 2008.

[50] Zhuo L., Zheng J., Li X., Wang F., Ai B., and Qian J., “A Genetic Algorithm Based Wrapper Feature Selection Method for Classification of Hyperspectral Images Using Support Vector Machine,” in Proceeding of the Geoinformatics and Joint Conference on GIS and Built Environment: Classification of Remote Sensing Images, Guangzhou, pp. 503-511, 2008.