A Method of Extracting Malware Features Based on Gini Impurity Increment and Improved TF-IDF
In recent years, the quantities and types of malwares have grown explosively, which bring many challenges to identify and detect them. In order to improve the identification efficiency of malicious code, a malicious code feature representation method based on feature dimension reduction is proposed. By fusing the Gini impurity increment and the Improved Term Frequency-Inverse Document Frequency algorithm (ITF-IDF), ΔGini-Improving Term frequency inverse document frequency (ΔGini-ITFIDF) method is presented, which can get more valuable assembly instruction features for family detection. ΔGini- ITFIDF standardizes the assembly instructions of the PE disassembly files, then, measures the two indicators of the expected error rate increment and weight of the malicious code assembly instruction features, and obtains more valuable features to identify malicious codes. The experimental results show that the classification accuracy of the ITF-IDF algorithm is significantly improved compared with the ITF-IDF algorithm. At the same time, ΔGini-ITFIDF can effectively improve the classification performance.
[1] Abou-Assaleh T., Cercone N., Keselj V., and Sweidan R., “N-gram-based Detection of New Malicious Code,” in Proceedings of the 28th Annual International Computer Software and Applications Conference, Hong Kong, pp. 41-42, 2004.
[2] Al-Hashmi A., Ghaleb F., Al-Marghilani A., Yahya A., Ebad S., Saqib M., and Darem A., “Deep-Ensemble and Multifaceted Behavioral Malware Variant Detection Model,” IEEE Access, vol. 10, pp. 42762-42777, 2022.
[3] Alhutaish R. and Omar N., “Arabic Text Classification Using K-Nearest Neighbour Algorithm,” The International Arab Journal of Information Technology, vol. 12, no. 2, pp. 190- 195, 2015.
[4] Amer E., Zelinka I., and El-Sappagh S., “A Multi- perspective Malware Detection Approach through Behavioral Fusion of API Call Sequence,” Computers and Security, vol. 110, pp. 102449, 2021.
[5] Arivarasan A. and Karthikeyan M., “Data Mining K-Means Document Clustering Using TFIDF and Word Frequency Count,” International Journal of Recent Technology and Engineering, vol. 8, no. 2, pp. 2542-2549, 2019.
[6] CNCERT., “CNCERT Internet Security Threat Report - March 2022,” Retrieved from: https:// www.cert.org.cn/, Last Visited, 2022.
[7] Dai Y., Li H., Qian Y., and Lu X., “A Malware Classification Method Based on Memory Dump Grayscale Image,” Digital Investigation, vol. 27, pp. 30-37, 2018.
[8] El-Hajj W. and Hajj H., “An Optimal Approach for Text Feature Selection,” Computer Speech and Language, vol. 74, pp. 103164, 2022.
[9] Genuer R., Poggi J., Tuleau-Malot C., and Villa- Vialaneix N., “Random Forests for Big Data,” Big Data Research, vol. 9, pp. 28-46, 2017.
[10] Hsiao S., Kao D., Liu Z., and Tso R., “Malware Image Classification Using One-shot Learning with Siamese Networks,” Procedia Computer Science, vol. 159, pp. 1863-1871, 2019.
[11] Jeon S. and Moon J., “Malware-Detection Method with a Convolutional Recurrent Neural Network Using Opcode Sequences,” Information Sciences, vol. 535, pp. 1-15, 2020.
[12] Kang J., Jang S., Li S., Jeong Y., and Sung Y., “Long Short-term Memory-based Malware Classification Method for Information Security,” Computers and Electrical Engineering, vol. 77, pp. 366-375, 2019.
[13] Laber E. and Murtinho L., “Minimization of Gini Impurity: NP-completeness and Approximation Algorithm via Connections with the k-Means Problem,” Electronic Notes in Theoretical Computer Science, vol. 346, pp. 567-576, 2019.
[14] Li L., Ding Y., Li B., Qiao M., and Ye B., “Malware Classification Based on Double Byte Feature Encoding,” Alexandria Engineering Journal, vol. 61, no. 1, pp. 91-99, 2022.
[15] Microsoft., “Microsoft. Kaggle Dataset,” retrieved from, https://www.kaggle.com/c/malware-classifi cation. Last Visited, 2022.
[16] Mohammed T., Nataraj L., Chikkagoudar S., Chandrasekaran S., and Manjunath B., “Malware Detection Using Frequency Domain-based Image Visualization and Deep Learningm,” in Proceedings of the 54th Hawaii International Conference on System Sciences, Hawaii, pp. 7132, 2021. A Method of Extracting Malware Features Based on Gini Impurity Increment ... 427
[17] Nataraj L., Karthikeyan S., Jacob G., and Manjunath B., “Malware Images: Visualization and Automatic Classification,” in Proceedings of the 8th International Symposium on Visualization for Cyber Security, Pittsburgh Pennsylvania, pp. 1-7, 2011.
[18] O’Shaughnessy S. and Breitinger F., “Malware Family Classification via Efficient Huffman Features,” Forensic Science International: Digital Investigation, vol. 37, pp. 301192, 2021.
[19] Pektaş A. and Acarman T., “Learning to Detect Android Malware via Opcode Sequences,” Neurocomputing, vol. 396, pp. 599-608, 2020.
[20] Şahin D., Kural O., Akleylek S., Kılıç E., “Permission-based Android Malware Analysis by Using Dimension Reduction with PCA and LDA,” Journal of Information Security and Applications, vol. 63, pp. 102995, 2021.
[21] Santos I., Brezo F., Ugarte-Pedrero X., and Bringas P., “Opcode Sequences as Representation of Executables for Data-mining-based Unknown Malware Detection,” Information Sciences, vol. 231, pp. 64-82, 2013.
[22] Singh A., Wadhwa G., Ahuja M., Soni K., and Sharma K., “Android Malware Detection Using LSI-based Reduced Opcode Feature Vector,” Procedia Computer Science, vol. 173, pp. 291- 298, 2020.
[23] Tang J., Li R., Jiang Y., Gu X., and Li Y., “Android Malware Obfuscation Variants Detection Method Based on Multi-granularity Opcode Features,” Future Generation Computer Systems, vol. 129, pp. 141-151, 2022.
[24] Trabelsi A., Elouedi Z., and Lefevre E., “Decision Tree Classifiers for Evidential Attribute Values and Class Labels,” Fuzzy Sets and Systems, vol. 366, pp. 46-62, 2019.
[25] Yufang Z., Shiming P., and Jia L., “Improvement and Application of TFIDF Method Based on Text Classification,” Computer Engineering, vol. 32, no. 19, pp. 76-78, 2006.