Highly Accurate Spam Detection with the Help of Feature Selection and Data Transformation
The amount of spam is increasing rapidly while the popularity of emails is increasing. This situation has led to the need to filter spam emails. To date, many knowledge-based, learning-based, and clustering-based methods have been developed for filtering spam emails. In this study, machine-learning-based spam detection was targeted, and C4.5, ID3, RndTree, C- Support Vector Classification (C-SVC), and Naïve Bayes algorithms were used for email spam detection. In addition, feature selection and data transformation methods were used to increase spam detection success. Experiments were performed on the UC Irvine Machine Learning Repository (UCI) spambase dataset, and the results were compared for accuracy, Receiver Operating Characteristic (ROC) analysis, and classification speed. According to the accuracy comparison, the C-SVC algorithm gave the highest accuracy with 93.13%, followed by the RndTree algorithm. According to the ROC analysis, the RndTree algorithm gave the best Area Under Curve (AUC) value of 0.999, while the C4.5 algorithm gave the second-best result. The most successful methods in terms of classification speed are Naïve Bayes and RndTree algorithms. In the experiments, it was seen that feature selection and data transformation methods increased spam detection success. The binary transformation that increased the classification success the most and the feature selection method was forward selection.
[1] Abdulhamid S., Shuaib M., Osho O., Ismaila I., and Alhassan J., “Comparative Analysis of Classification Algorithms for Email Spam Detection,” International Journal of Computer Network and Information Security, vol. 10, no. 1, pp. 60-67, 2018.
[2] AbdulNabi I. and Yaseen Q., “Spam Email Detection Using Deep Learning Techniques,” Procedia Computer Science, vol. 184, pp. 853-858, 2021.
[3] Ablel-Rheem D., Ibrahim A., Kasim S., Almazroi A., and Ismail M., “Hybrid Feature Selection and Ensemble Learning Method for Spam Email Classification,” International Journal, vol. 9, no. 1.4, pp. 217-223, 2020.
[4] Almeida T., Almeida J., and Yamakami A., “Spam Filtering: How The Dimensionality Reduction Affects The Accuracy of Naive Bayes Classifiers,” Journal of Internet Services and applications, vol. 1, no. 3, pp. 183-200, 2011.
[5] Androutsopoulos I., Paliouras G., and Michelakis E., “Learning to Filter Unsolicited Commercial E- Mail Technical Report,” Technical Report 2004/2, NCSR Demokritos, 2006.
[6] Awad M. and Foqaha M., “Email Spam Classification Using Hybrid Approach of RBF Neural Network and Particle Swarm Optimization,” International Journal of Network Security and its Applications, vol. 8, no. 4, pp. 17- 28, 2016.
[7] Balakumar C. and Ganeshkumar D., “A Data Mining Approach on Various Classifiers in Email Spam Filtering,” International Journal for Research in Applied Science and Engineering Technology, vol. 3, no. 1, pp. 8-14, 2015.
[8] Bassiouni M., Ali M., and El-Dahshan E., “Ham and Spam E-Mails Classification Using Machine Learning Techniques,” Journal of Applied Security Research, vol. 13, no. 3, pp. 315-331, 2018.
[9] Bouguila N. and Amayri O., “A Discrete Mixture- Based Kernel For Svms: Application to Spam and Image Categorization,” Information Processing and Management, vol. 45, no. 6, pp. 631-642, 2009. 36 The International Arab Journal of Information Technology, Vol. 20, No. 1, January 2023
[10] Breiman L., “RANDOM FORESTS,” Machine Learning, vol. 45, pp. 5-32, 2001.
[11] Cao Y., Liao X., and Li Y., “An E-Mail Filtering Approach Using Neural Network,” in International Symposium on Neural Networks, pp. 688-694, 2004.
[12] DeBarr D. and Wechsler H., “Spam Detection Using Random Boost,” Pattern Recognition Letters, vol. 33, no. 10, pp. 1237-1244, 2012.
[13] Dhanaraj K. and Thiag H., “Email Classification for Spam Detection Using Word Stemming,” International Journal of Computer Applications, vol. 1, no. 5, pp. 45-47, 2010.
[14] Fdez-Riverola F., Iglesias E., Díaz F., Méndez J., and Corchado J., “SpamHunting: An Instance- Based Reasoning System for Spam Labelling and Filtering,” Decision Support Systems, vol. 43, no. 3, pp. 722-736, 2007.
[15] Feng W., Sun J., Zhang L., Cao C., and Yang Q., “A Support Vector Machine Based Naive Bayes Algorithm for Spam Filtering,” in Proceedings IEEE 35th International Performance Computing and Communications Conference, Las Vegas, pp. 1-8, 2016.
[16] Gu Q., Li Z., and Han J., “Generalized Fisher Score for Feature Selection,” in Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, Barcelona, pp. 266-273, 2011.
[17] Guzella T. and Caminhas W., “A Review of Machine Learning Approaches to Spam Filtering,” Expert Systems with Applications, vol. 36, no. 7, pp. 10206-10222, 2009.
[18] Halinski R. and Feldt L., “The Selection of Variables in Multiple Regression Analysis,” Journal of Educational Measurement, vol. 7, no. 3, pp. 151-157, 1970.
[19] Heron S., “Technologies for Spam Detection,” Network Security, vol. 2009, no. 1, pp. 11-15, 2009.
[20] Kira K. and Rendell L.,“The Feature Selection Problem: Traditional Methods and A New Algorithm,” Aaai, vol. 2, no. 1992a, pp. 129-134, 1992.
[21] Klimt B. and Yang Y., “Introducing the Enron Corpus,” CEAS, 2004.
[22] Kontsewayaa Y., Antonova E., Artamonovb A., “Evaluating the Effectiveness of Machine Learning Methods for Spam Detection,” Procedia Computer Science, vol. 190, pp. 479-486, 2021.
[23] Kumar S., Gao X., Welch I., and Mansoori M., “A Machine Learning Based Web Spam Filtering Approach,” in Proceedings of the IEEE 30th International Conference on Advanced Information Networking and Applications (AINA), Crans-Montana, pp. 973-980, 2016.
[24] Mallampati D., “An Efficient Spam Filtering using Supervised Machine Learning Techniques,”International Journal of Scientific Research in Computer Science and Engineering, vol. 6, no. 2, pp. 33-37, 2018.
[25] Manisha A. and Jain M., “Data Pre-Processing in Spam Detection,” International Journal of Science Technology and Engineering, vol. 1, no. 11, p. 33-37, 2015.
[26] Manoharan S., Sugumaran P., and Kumar K., “Multichannel Based IoT Malware Detection System Using System Calls and Opcode Sequences,” The International Arab Journal of Information Technology, vol. 19, no. 2, pp. 261- 271, 2022.
[27] Mashaleh A., Binti Ibrahim N., Al-Betar M., Mustafa H., and Yaseen Q., “Detecting Spam Email with Machine Learning Optimized with Harris Hawks Optimizer (HHO) Algorithm,”Procedia Computer Science, vol. 201, pp. 659-664, 2022.
[28] Mccord M. and Chuah M., “Spam Detection on Twitter Using Traditional Classifiers,” in Proceedings International Conference on Autonomic and Trusted Computing, Banff, pp. 175-186, 2011.
[29] Mishra R. and Thakur R., “Analysis of Random Forest and Naïve Bayes for Spam Mail using Feature Selection Categorization,” International Journal of Computer Applications, vol. 80, no. 3, pp. 42-47, 2013.
[30] Quinlan J., Programs for Machine Learning, 1993.
[31] Rakotomalala R., “TANAGRA: a Free Software for Research and Academic Purposes,” in Proceedings of EGC, Paris, pp. 697-702, 2005.
[32] Rusland N., Wahid N., Kasim S., and Hafit H., “Analysis of Naïve Bayes Algorithm for Email Spam Filtering Across Multiple Datasets,” in Proceedings of IOP Conference Series: Materials Science and Engineering, Melaka, 2017.
[33] Sao P. and Prashanthi K., “E-mail Spam Classification Using Naïve Bayesian Classifier,” International Journal of Advanced Research in Computer Engineering and Technology, vol. 4, no. 6, 2015.
[34] Shah N. and Kumar P., “A Comparative Analysis of Various Spam Classifications,” in Proceedings of Progress in Intelligent Computing Techniques: Theory, Practice, and Applications, Springer, pp. 265-271, 2018.
[35] Shams R. and Mercer R., “Classifying Spam Emails Using Text and Readability Features,” in Proceedings of the IEEE 13th International Conference on Data Mining, Dallas, pp. 657-666, 2013.
[36] Sharaff A., Nagwani N., and Dhadse A., “Comparative Study of Classification Algorithms for Spam Email Detection,” in Proceedings of the Emerging Research in Computing, Information, Communication and Applications, Springer, pp. 237-244, 2016. Highly Accurate Spam Detection with the Help of Feature Selection and ... 37
[37] Shrivastava A. and Dubey R., “Classification of Spam Mail using Different Machine Learning Algorithms,” in Proceedings of the International Conference on Advanced Computation and Telecommunication, Bhopal, pp. 1-10, 2018.
[38] Srinivasan S., Ravi V., Alazab M., Ketha S., Al- Zoubi A., and Kotti Padannayil S., “Spam Emails Detection Based on Distributed Word Embedding with Deep Learning,” in Proceedings of the Machine Intelligence and Big Data Analytics for Cybersecurity Applications, Springer, pp. 161- 189, 2021.
[39] UCI Machine Learning Repository: Spambase Data Set. (n.d.), from http://archive.ics.uci.edu/ml/datasets/Spambase/, Last Visited, 2022.
[40] Vapnik V., The Nature of Statistical Learning, Theory, 1995.
[41] Whittaker S., Bellotti V., and Moody P., “Introduction to this Special Issue on Revisiting And Reinventing E-Mail,” Human-Computer Interaction, vol. 20, no. 1-2, pp. 1-9, 2005.
[42] Yitagesu M. and Tijare P., “Email Classification Using Classification Method,” International Journal of Engineering Trends and Technology, vol. 32, no. 3, pp. 142-145, 2016.
[43] Yüksel A., Cankaya S., and Üncü İ., “Design of a Machine Learning Based Predictive Analytics System for Spam Problem,” Acta Physica Polonica, vol. 132, no. 3, 2017.