The International Arab Journal of Information Technology (IAJIT)


XAI-PDF: A Robust Framework for Malicious PDF Detection Leveraging SHAP-Based Feature Engineering

With the increasing number of malicious PDF files used for cyberattacks, it is essential to develop efficient and accurate classifiers to detect and prevent these threats. Machine Learning (ML) models have successfully detected malicious PDF files. This paper presents XAI-PDF, an efficient system for malicious PDF detection designed to enhance accuracy and minimize decision-making time on a modern dataset, the Evasive-PDFMal2022 dataset. The proposed method optimizes malicious PDF classifier performance by employing feature engineering guided by Shapley Additive Explanations (SHAP). Particularly, the model development approach comprises four phases: data preparation, model building, explainability of the models, and derived features. Utilizing the interpretability of SHAP values, crucial features are identified, and new ones are generated, resulting in an improved classification model that showcases the effectiveness of interpretable AI techniques in enhancing model performance. Various interpretable ML models were implemented, with the Lightweight Gradient Boosting Machine (LGBM) outperforming other classifiers. The Explainable Artificial Intelligence (XAI) global surrogate model generated explanations for LGBM predictions. Experimental comparisons of XAI-PDF with baseline methods revealed its superiority in achieving higher accuracy, precision, and F1-scores with minimal False Positive (FP) and False Negative (FN) rates (99.9%, 100%, 99.89%,0.000, and 0.002, respectively). Additionally, XAI-PDF requires only 1.36 milliseconds per record for predictions, demonstrating increased resilience in detecting evasive malicious PDF files compared to state-of-the-art methods.

[1] Abu Al-Haija Q., Alohaly M., and Odeh A., “A Lightweight Double-Stage Scheme to Identify Malicious DNS over HTTPS Traffic Using a Hybrid Learning Approach,” Sensors, vol. 23, no. 7, pp. 1-19, 2023. 8220/23/7/3489

[2] Abu Al-Haija Q., Odeh A., and Qattous H., “PDF Malware Detection Based on Optimizable Decision Trees,” Electron, vol. 11, no. 19, pp. 1- 18, 2022.

[3] Adhatarao S. and Lauradoux C., “Robust PDF Files Forensics Using Coding Style,” in Proceedings of the 37th IFIP TC-11 International Conference, ICT Systems Security and Privacy Protection, Copenhagen, pp. 179-195, 2022.

[4] Alani M. and Awad A., “PAIRED: An Explainable Lightweight Android Malware Detection System,” IEEE Access, vol. 10, pp. 73214-73228, 2022. DOI:10.1109/ACCESS.2022.3189645

[5] Al-Fawa’reh M., Al-Fayoumi M., Nashwan S., and Fraihat S., “Cyber Threat Intelligence Using PCA-DNN Model to Detect Abnormal Network Behavior,” Egyptian Informatics Journal, vol. 23, no. 2, pp. 173-185, 2022.

[6] Al-Fayoumi M. and Abu Al Haija Q., “Capturing Low-Rate DDoS Attack Based on MQTT Protocol in Software Defined-IoT Environment,” Array, vol. 19, pp. 100316, 2023.

[7] Al Fayoumi M., Al Fawareh M., and Nashwan S., “VPN and Non-VPN Network Traffic Classification Using Time-Related Features,” Computers, Materials and Continua, vol. 72, no. 2, pp. 3091-3111, 2022.

[8] Al-Fayoumi M., Elayyan A., Odeh A., and Al- Haija Q., “Tor network Traffic Classification Using Machine Learning Based on Time-Related Feature,” in Proceedings of the 6th Smart Cities Symposium, IET Conference, Bahrain, pp. 92-97, 2022. DOI: 10.1049/icp.2023.0354

[9] Al-Fayoumi M., Alwidian J., and Abusaif M., “Intelligent Association Classification Technique for Phishing Website Detection,” The International Arab Journal of Information Technology, vol. 17, no. 4, pp. 488-469, 2020.

[10] Baz M., Alhakami H., Agrawal A., Baz A., and Khan R., “Impact of Covid-19 Pandemic: A Cybersecurity Perspective,” Intelligent Automation and Soft Computing, vol. 27, no. 3, pp. 641-652, 2021.

[11] Bose S., Towards Explainability in Machine Learning for Malware Detection, Ph.D Dissertation, Florida State University, 2020. 144 The International Arab Journal of Information Technology, Vol. 21, No. 1, January 2024 76810

[12] Buriro A., Buriro A., Ahmad T., Buriro S., and Ullah S., “MalwD and C: A Quick and Accurate Machine Learning-Based Approach for Malware Detection and Categorization,” Applied Science, vol. 13, no. 4, pp. 1-14, 2023.

[13] Corona I., Maiorca D., Ariu D., and Giacinto G., “Lux0R: Detection of Malicious PDF-Embedded JavaScript Code through Discriminant Analysis of API References,” in Proceedings of the Workshop on Artificial Intelligent and Security Workshop, pp. 47-57, Arizona, 2014.

[14] Corum A., Jenkins D., and Zheng J., “Robust PDF Malware Detection with Image Visualization and Processing Techniques,” in Proceedings of the 2nd International Conference on Data Intelligence and Security, Texas, pp. 108-114, 2019. DOI: 10.1109/ICDIS.2019.00024

[15] Cuan B., Damien A., Delaplace C., and Valois M., “Malware Detection in PDF Files Using Machine Learning,” in Proceedings of the 15th International Joint Conference on e-Business and Telecommunications, Porto, pp. 412-419, 2018. 0/0006884704120419

[16] Falah A., Pan L., Huda S., Pokhrel S., and Anwar A., “Improving Malicious PDF Classifier with Feature Engineering: A Data-Driven Approach,” Future Generation Computer Systems, vol. 115, pp. 314-326, 2021.

[17] Ferrag M., Maglaras L., Argyriou A., Kosmanos D and Janicke H., “Security for 4G and 5G Cellular Networks: A Survey of Existing Authentication and Privacy-Preserving Schemes,” Journal of Network and Computer Applications, vol. 101, pp. 55-82, 2018.

[18] Gorment N., Selamat A., Cheng L., and Krejcar O., “Machine Learning Algorithm for Malware Detection: Taxonomy, Current Challenges and Future Directions,” IEEE Access, vol. 11, pp. 141045-141089, 2023. DOI: 10.1109/ACCESS.2023.3256979

[19] He K., Zhu Y., He Y., Liu L., Lu B., and Lin W., “Detection of Malicious PDF Files Using a Two‐ Stage Machine Learning Algorithm,” Chinese Journal of Electronics, vol. 29, no. 6, pp. 1165- 1177, 2020.

[20] Index of /CICDataset/CICEvasivePDFMal2022/Dataset, DFMal2022/Dataset/, Last Visited, 2023.

[21] Issakhani M., Victor P., Tekeoglu A., and Lashkari A., “PDF Malware Detection Based on Stacking Learning,” SciTePress, vol. 1, pp. 562- 570, 2022. DOI: 10.5220/0010908400003120

[22] Jeong Y., Woo J., and Kang A., “Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks,” Security and Communication Networks, vol. 2019, pp. 1-10, 2019.

[23] Kang A., Jeong Y., Kim S., and Woo J., “Malicious PDF Detection Model against Adversarial Attack Built from Benign PDF Containing JavaScript,” Applied Science, vol. 9, no. 22, pp. 1-17, 2019.

[24] Kattamuri S., Penmatsa R., Chakravarty S., and Madabathula V., “Swarm Optimization and Machine Learning Applied to PE Malware Detection towards Cyber Threat Intelligence,” Electron, vol. 12, no. 2, pp. 1-25, 2023.

[25] Kumar D. and Das S., “Machine Learning Approach for Malware Detection and Classification Using Malware Analysis Framework,” International Journal Intelligent Systems Applications Engineering, vol. 11, no. 1, pp. 330-338, 2023. 543/1126

[26] Kumar R. and Subbiah G., “Explainable Machine Learning for Malware Detection Using Ensemble Bagging Algorithms,” in Proceedings of the 14th International Conference on Contemporary Computing, Noida, pp. 453-460, 2022.

[27] Li K., Gu Y., Zhang P., An W., and Li W., “Research on KNN Algorithm in Malicious PDF Files Classification under Adversarial Environment,” in Proceedings of the 4th International Conference on Big Data and Computing, Guangzhou, pp. 156-159, 2019.

[28] Li M., Liu Y., Yu M., Li G., Wang Y., and Liu C., “FEPDF: A Robust Feature Extractor for Malicious PDF Detection,” in Proceedings of the IEEE Trustcom/BigDataSE/ICESS, Sydney, pp. 218-224, 2017. DOI:10.1109/Trustcom/BigDataSE/ICESS.2017. 240

[29] Li Y., Wang X., Shi Z., Zhang R., Xue J., and Wang Z., “Boosting Training for PDF Malware Classifier Via Active Learning,” International Journal of Intelligent Systems, vol. 37, no. 4, pp. 2803-2821, 2022.

[30] Li Y., Wang Y., Wang Y., Ke L., and Tan Y., “A Feature-Vector Generative Adversarial Network for Evading PDF Malware Classifiers,” XAI-PDF: A Robust Framework for Malicious PDF Detection Leveraging SHAP-Based Feature ... 145 Information Sciences, vol. 523, pp. 38-48, 2020.

[31] Lin Y. and Chang X., “Towards Interpreting ML- Based Automated Malware Detection Models: A Survey,” arXiv Preprint, vol. arXiv:2101.06232v1, pp. 1-39, 2021.

[32] Liu Y., Tantithamthavorn C., Li L., and Liu Y., “Explainable AI for Android Malware Detection: Towards Understanding Why the Models Perform So Well?,” in Proceedings of the IEEE 33rd International Symposium on Software Reliability Engineering, North Carolina, pp. 169-180, 2022. DOI:10.1109/ISSRE55969.2022.00026

[33] Lu K., Cheng J., and Yan A., “Malware Detection Based on the Feature Selection of a Correlation Information Decision Matrix,” Mathematics, vol. 11, no. 4, pp. 1-17, 2023.

[34] Maiorca D., Biggio B., and Giacinto G., “Towards Adversarial Malware Detection: Lessons Learned from PDF-Based Attacks,” ACM Computing Surveys, vol. 52, no. 4, pp. 1-36, 2019.

[35] Maiorca D., Giacinto G., and Corona I., “A Pattern Recognition System for Malicious PDF Files Detection,” in Proceedings of the 8th International Conference of Machine Learning and Data Mining in Pattern Recognition, Berlin, pp. 510- 524, 2012. 31537-4_40

[36] Mejjaouli S. and Guizani S., “PDF Malware Detection Based on Fuzzy Unordered Rule Induction Algorithm (FURIA),” Applied Science, vol. 13, no. 6, pp. 1-13, 2023.

[37] Mohammed T., Nataraj L., Chikkagoudar S., Chandrasekaran S., and Manjunath B., “HAPSSA: Holistic Approach to PDF Malware Detection Using Signal and Statistical Analysis,” in Proceedings of the IEEE Military Communications Conference MILCOM, San Diego, pp. 709-714, 2021. DOI:10.1109/MILCOM52596.2021.9653097

[38] Nwakanma C., Ahakonye L., Njoku J., Odirichukwu J., Okolie S., and Uzondu C., “Explainable Artificial Intelligence (XAI) for Intrusion Detection and Mitigation in Intelligent Connected Vehicles: A Review,” Applied Sciences, vol. 13, no. 3, pp. 1-29, 2023.

[39] Ogiriki I., Machine Learning Models Interpretability for Malware Detection Using Model Agnostic Language for Exploration and Explanation, Master Thesis, Rowan University, 2022. file:///C:/Users/user/Desktop/ELEC5200_6200% 20Performance.pdf

[40] Rahman T., Ahmed N., Monjur S., Haque F., and Hossain M., “Interpreting Machine and Deep Learning Models for PDF Malware Detection using XAI and SHAP Framework,” in Proceedings of the 2nd International Conference Innovation in Technology, Bangalore, pp. 1-9, 2023. DOI:10.1109/INOCON57975.2023.10101116

[41] Sayed S. and Shawkey M., “Data Mining Based Strategy for Detecting Malicious PDF Files,” in Proceedings of the 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications and 12th IEEE International Conference on Big Data Science and Engineering, Trustcom/BigDataSE, New York, pp. 661-667, 2018. DOI:10.1109/TrustCom/BigDataSE.2018.00097

[42] Scalas M., Malware Analysis and Detection with Explainable Machine Learning, Ph.D Thesis, Università degli Studi di Cagliari, 2021. e053- 3a05fe0a5d97/tesi%20di%20dottorato_Michele %20Scalas.pdf

[43] Singh P., Tapaswi S., and Gupta S., “Malware Detection in PDF and Office Documents: A Survey,” Information Security Journal: A Global Perspective, vol. 29, no. 3, pp. 134-153, 2020.

[44] Smutz C. and Stavrou A., “Malicious PDF Detection Using Metadata and Structural Features,” in Proceedings of the 28th Annual Computer Security Applications Conference, Florida, pp. 239-248, 2012.

[45] Smutz C. and Stavrou A., “When a Tree Falls: Using Diversity in Ensemble Classifiers to Identify Evasion in Malware Detectors,” in Proceedings of the Network and Distributed System Security Symposium, San Diego, pp. 1-15, 2016. DOI:10.14722/ndss.2016.23078

[46] Tay K., Chua S., Chua M., and Balachandran V., “Towards Robust Detection of PDF-Based Malware,” in Proceedings of the 12th ACM Conference on Data and Application Security and Privacy, Maryland, pp. 370-372, 2022.

[47] Ullah F., Alsirhani A., Alshahrani M., Alomari A., Naeem H., and Shah S., “Explainable Malware Detection System Using Transformers-Based Transfer Learning and Multi-Model Visual Representation,” Sensors, vol. 22, no. 18, pp. 1- 22, 2022.

[48] Younisse R., Ahmad A., and Abu Al-Haija Q., “Explaining Intrusion Detection-Based Convolutional Neural Networks Using Shapley Additive Explanations (SHAP),” Big Data and 146 The International Arab Journal of Information Technology, Vol. 21, No. 1, January 2024 Cognitive Computing, vol. 6, no. 4, pp. 1-20, 2022.

[49] Zhang J., “MLPdf: An Effective Machine Learning Based Approach for PDF Malware Detection,” arXiv Preprint, arXiv:1808.06991v1, pp. 1-6, 2018.