The International Arab Journal of Information Technology (IAJIT)


Hybrid Support Vector Machine based Feature Selection Method for Text Classification

Automatic text classification is an effective solution used to sort out the increasing amount of online textual content. However, high dimensionality is a considerable impediment observed in the text classification field in spite of the fact that there have been many statistical methods available to address this issue. Still, none of these has proved to be effective enough in solving this problem. This paper proposes a machine learning based feature ranking and selection method named Support Vector Machine based Feature Ranking Method (SVM-FRM). The proposed method utilizes Support Vector Machine (SVM) learning algorithm for weighting and selecting the significant features in order to obtain better classification performance. Later on, hybridization techniques are applied to enhance the performance of SVM-FRM method in some experimental situations. The proposed SVM-FRM method and its enhancement are tested using three text classification public datasets. The achieved results are compared with other statistical feature selection methods currently used for the said purpose. Results evaluation shows higher and superior F-measure and accuracy performances of the proposed SVM-FRM on balanced datasets. Moreover, a noticeable performance enhancement is recorded due to the application of the proposed hybridization techniques on an unbalanced dataset.

[1] Abbas M., Smaili K., and Berkani D., Comparing TR-Classifier and KNN by using Reduced Sizes of Vocabularies, in Proceedings of the 3rd International Conference on Arabic Language Processing, Rabat, 2009.

[2] Abbas M., Sma li K., and Berkani D., Evaluation of Topic Identification Methods on Arabic Corpora, Journal of Digital Information Management, vol. 9, no. 5, pp. 185-192, 2011.

[3] Abuaiadah D., El Sana J., and Abusalah W., On The Impact of Dataset Characteristics on Arabic Document Classification, International Journal of Computer Applications, vol. 101, no. 7, pp. 31- 38, 2014.

[4] Abuaiadah D., Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents, ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 15, no. 3, pp. 1-13, 2016.

[5] Alguliyev R., Aliguliyev R., and Isazade N., An Unsupervised Approach to Generating Generic Summaries of Documents, Applied Soft Computing, vol. 34, pp. 236-250, 2015.

[6] Aliwy A., Tokenization as Preprocessing for Arabic Tagging System, International Journal of Information and Education Technology, vol. 2, no. 4, pp. 348-353, 2012.

[7] Aphinyanaphongs Y., Fu L., Li Z., Peskin E., Efstathiadis., Aliferis C., and Statnikov A., A Comprehensive Empirical Comparison of Modern Supervised Classification and Feature Selection Methods for Text Categorization, Journal of the Association for Information Science and Technology, vol. 65, no. 10, pp. 1964-1987, 2014.

[8] Bharti K. and Singh P., Hybrid Dimension Reduction by Integrating Feature Selection with Feature Extraction Method for Text Clustering, Expert Systems with Applications, vol. 42, no. 6, pp. 3105-3114, 2015.

[9] Chang Y. and Lin C., Feature Ranking Using Linear SVM, in Proceedings of the Workshop on the Causation and Prediction Challenge at WCCI, Hong Kong,pp. 53-64, 2008.

[10] Chen Y. and Chen M., Using Chi-Square Statistics to Measure Similarities for Text Categorization, Expert Systems with Applications, vol. 38, no. 4, pp. 3085-3090, 2011.

[11] Dasari D. and Rao K., Text Categorization and Machine Learning Methods: Current State Of The Art, Global Journal of Computer Science and Technology Software and Data Engineering, vol. 12, no. 11, pp. 37-46, 2012.

[12] Efron M., Zhang, J. and Marchionini G., Comparing Feature Selection Criteria for Term Clustering Applications, in Proceedings of ACM SIGIR Conference on Research and Development in Informaion Retrieval, Toronto, pp. 28-31, 2003.

[13] Forman G., An Extensive Empirical Study of Feature Selection Metrics for Text Classification, Journal of Machine Learning Research, vol. 3, pp. 1289-1305, 2003.

[14] Guyon I. and Elisseeff A., An Introduction to Variable and Feature Selection, Journal of Machine Learning Research, vol. 3, pp. 1157- 1182, 2003.

[15] HmeidiI., Al-Ayyoub M., Abdulla N.,Almodawar A., Abooraig R., and Mahuoob N., Automatic Arabic Text Categorization: A Comprehensive Comparative Study, Journal of Information Science, vol. 4, no. 1, pp. 114-124, 2014.

[16] Jiang S., Pang G., Wu M., and Kuang L., An Improved K-Nearest-Neighbor Algorithm for Text Categorization, Expert Systems with Applications, vol. 39, no. 1, pp. 1503-1509, 2012.

[17] Joachims T., Text Categorization with Suport 608 The International Arab Journal of Information Technology, Vol. 15, No. 3A, Special Issue 2018 Vector Machines: Learning with Many Relevant Features, in Proceeding of Machine Learning: ECML, Berlin, pp. 137-142, 1998.

[18] Lee C. and Lee G., Information Gain and Divergence-Based Feature Selection for Machine Learning-Based Text Categorization, Information Processing and Management, vol. 42, no. 1, pp. 155-165, 2006.

[19] Lee L., Wan C., Rajkumar R., and Isa D., An Enhanced Support Vector Machine Classification Framework by using Euclidean Distance Function for Text Document Categorization, Applied Intelligence, vol. 37, no. 1, pp. 80-99, 2012.

[20] Man, L., Tan C., Su J., and Lu Y., Supervised and Traditional Term Weighting Methods for Automatic Text Categorization, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 721-735, 2009.

[21] Liu L. and zsu M.,Encyclopedia of Database Systems, Springer 2009.

[22] Meyer D. and WienT., Support Vector Machines-the Interface to Libsvm in Package e1071, Technical Report, 2001.

[23] Onan A., Classifier and Feature Set Ensembles for Web Page Classification, Journal of Information Science, vol. 42, no. 2, pp. 150-165, 2016.

[24] Raho G., Kanaan G., Al-Shalabi R., and Nassar A., Different Classification Algorithms Based on Arabic Text Classification: Feature Selection Comparative Study, International Journal of Advanced Computer Science and Applications, vol. 6, no. 2, pp. 23-28, 2015.

[25] Saad M. and Ashour W., OSAC: Open Source Arabic Corpora, in Proceedings of the 6th International Conference on Electrical and Computer Systems, Lefke, pp. 118-123,2010.

[26] Sabbah T., Selamat A., Selamat M., Ibrahim R., and Fujita H., Hybridized Term-Weighting Method for Dark Web Classification, Neurocomputing, vol. 173, no. P3, pp. 1908-1926, 2016.

[27] Sabbah T., Selamat A., Selamat M.,Fujita H., Al- Anzi F., Viedma E., and Krejcar O., Modified Frequency-Based Term Weighting Schemes for Text Classification, Applied Soft Computing, vol. 58, pp. 193-206, 2017.

[28] Sulic V., Pers J., Kristan M., and Kovacic S., Efficient Dimensionality Reduction using Random Projection, in Proceedings of the Computer Vision Winter Workshop, Prague, pp. 29-36, 2010.

[29] Uysal A., An Improved Global Feature Selection Scheme for Text Classification, Expert Systems with Applications, vol. 43, pp. 82-92, 2016.

[30] Yang Y. and Pedersen J., A Comparative Study on Feature Selection in Text Categorization, in Proceedings of the 14th International Conference on Machine Learning, Nashville, pp. 412-420, 1997.

[31] Yang Z., He J., and Shao Y., Feature Selection Based On Linear Twin Support Vector Machines, Procedia Computer Science, vol. 17, pp. 1039-1046, 2013.

[32] Yousif S., Elkabani I., Samawi V., and Zantout R., Enhancement of Arabic Text Classification Using Semantic Relations With Part of Speech Tagger, in Proceedings of 14th International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases,Cambrigde,pp. 195-201, 2015.

[33] Zhang W., Yoshida T., and Tang X., A Comparative Study of TF*IDF, LSI and Multi- Words for Text Classification, Expert Systems with Applications, vol. 38, no. 3, pp. 2758-2765, 2011. Hybrid Support Vector Machine based Feature Selection Method for Text Classification 609 Thabit Sabbah received his Bachelor of Computer Science BSc (CS), Master of Computer Science MSc (CS) from Al Quds University, Jerusalem / Palestine, and Doctor of Philosophy PhD in Computer Science from Universiti Teknologi Malaysia UTM, Malaysia in 1998, 2009 and 2015 respectively. His research interests are mainly focused on Data Mining, Text Mining and Classification, Information Retrieval, Machine Learning, and Artificial Intelligence. He has broad experience in administrative work, teaching and research. During the past 20 years he worked in many administrative and Academic positions. Currently, he is a Faculty Member in the Collage of Technology and Applied Sciences at Al Quds Open University / Palestine. Dr. Sabbah has received many academic and research awards. He has published a number of articles in high ranked International Journals, and many other research papers in International Conferences, Book Chapters, and he has been a reviewer of various International Journals and Conferences. Mosab Ayyash received his Bachelor of Computer Science BSc (CS) from Al Quds University, Jerusalem / Palestine in 2003, and Master Degree (MSc) in Scientific Computing from Berzeit University in 2007. Currently, he is a Lecturer and Faculty Member of Computer Information Systems department / Collage of Technology and Applied Sciences at AL Quds Open University (QOU). His research interests are focused on the fields of Database System, Data mining, Project Management, and Data Analysis. Mahmood Ashraf received his Bachelor of Computer Science BSc(CS), Master of Computer Science MSc(CS), second Master of Computer Science MS(CS) from Islamabad, Pakistan and Doctor of Philosophy PhD in Computer Science from Universiti Teknologi Malaysia UTM, Johar Bahru, Malaysia in 1999, 2002, 2008 and 2014 respectively. His areas of interests are: Human- Computer Interaction, Physintuitive Systems, Smart Environment, Text Classification, Machine Learning, Artificial Intelligence, and Intelligent User Interfaces. He has been administrative, academic and research Head of Islamabad Campus (as In charge Campus) of Federal Urdu University of Arts, Science and Technology (FUUAST) from 2017 to 2018. Dr. Mahmood Ashraf has published a number of research papers in National, International Conferences, Book Chapters and International Journals. He is Higher Education Commission (HEC) s recognized MS/PhD supervisor. He has been a reviewer of various International Conferences and an International Journal.