The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Feature Selection Method Based On Statistics of Compound Words for Arabic Text Classification

One of the main problems of text classification is the high dimensionality of the feature space. Feature selection methods are normally used to reduce the dimensionality of datasets to improve the performance of the classification, or to reduce the processing time, or both. To improve the performance of text classification, a feature selection algorithm is presented, based on terminology extracted from the statistics of compound words, to reduce the high dimensionality of the feature space. The proposed method is evaluated as a standalone method and in combination with other feature selection methods (two-stage method). The performance of the proposed algorithm is compared to the performance of six well-known feature selection methods including Information Gain, Chi-Square, Gini Index, Support Vector Machine-Based, Principal Components Analysis and Symmetric Uncertainty. A wide range of comparative experiments were conducted on three Arabic standard datasets and with three classification algorithms. The experimental results clearly show the superiority of the proposed method in both cases as a standalone or in a two-stage scenario. The results show that the proposed method behaves better than traditional approaches in terms of classification accuracy with a 6-10% gain in the macro-average, F1.


[1] Aghdam M., Ghasem-Aghaee N., and Basiri M., “Text Feature Selection using Ant Colony Optimization,” Expert Systems with Applications, vol. 36, no. 3, pp. 6843-6853, 2009.

[2] Alhutaish R. and Omar N., “Arabic Text Classification using K-Nearest Neighbour Algorithm,” The International Arab Journal of Information Technology, vol. 12, no. 2, pp. 190- 195, 2015.

[3] Baccianella S., Esuli A., and Sebastiani F., “Feature Selection for Ordinal Text Classification,” Neural Computation, vol. 26, no. 3, pp. 557-591, 2014.

[4] Bespalov D., Bai B., Qi Y., and Shokoufandeh A., “Sentiment Classification based on Supervised Latent N-Gram Analysis,” in Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, pp. 375-382, 2011.

[5] Chen Y., Sun Y., and Han B., “Improving Classification of Protein Interaction Articles using Context Similarity-based Feature Selection,” Biomed Research International, 2015.

[6] D’orazio V., Landis S., Palmer G., and Schrodt P., “Separating the Wheat from the Chaff: Applications of Automated Document 184 The International Arab Journal of Information Technology, Vol. 16, No. 2, March 2019 Classification using Support Vector Machines,” Political Analysis, vol. 22, no. 2, pp. 224-242, 2014.

[7] Dai J. and Xu Q., “Attribute Selection based on Information Gain Ratio in Fuzzy Rough Set Theory with Application to Tumor Classification,” Applied Soft Computing, vol. 13, no. 1, pp. 211-221, 2012.

[8] Dai Y. and Sun H., “The Naive Bayes Text Classification Algorithm based on Rough Set in the Cloud Platform,” Journal of Chemical and Pharmaceutical Research, vol. 6, no. 7, pp. 1636- 1643, 2014.

[9] De Stefano C., Fontanella F., Marrocco C., and Di Freca A., “A GA-based Feature Selection Approach with an Application to Handwritten Character Recognition,” Pattern Recognition Letters, vol. 35, pp. 130-141, 2014.

[10] Dias G. and Kaalep H., “Automatic Extraction of Multiword Units for Estonian: Phrasal Verbs,” Languages in Development, vol. 41, pp. 81-89, 2003.

[11] Figueiredo F., Rocha L., Couto T., Salles T., Gonçalves M., and Meira W., “Word Co- Occurrence Features for Text Classification,” Information Systems, vol. 36, no. 5, pp. 843-858, 2011.

[12] Ganapathy S., Vijayakumar P., Yogesh P., and Kannan A., “An Intelligent CRF based Feature Selection for Effective Intrusion Detection,” The International Arab Journal of Information Technology, vol. 13, no. 1, pp. 44-50, 2016.

[13] Gao Z., Xu Y., Meng F., Qi F., Lin Z., “Improved Information Gain-based Feature Selection for Text Categorization,” in Proceedings of 4th International Conference on Wireless Communications, Vehicular Technology, Information Theory and Aerospace and Electronic Systems, Aalborg, pp. 1-5, 2014.

[14] Li S., Xia R., Zong C., and Huang C., “A Framework of Feature Selection Methods for Text Categorization,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, pp. 692-700, 2009.

[15] Meena M., Chandran K., Karthik A., and Vijay A., “An Enhanced ACO Algorithm to Select Features for Text Categorization and its Parallelization,” Expert Systems with Applications, vol. 39, no. 5, pp. 5861-5871, 2012.

[16] Meng J., Lin H., and Yu Y., “A Two-Stage Feature Selection Method for Text Categorization,” Computers and Mathematics with Applications, vol. 62, no. 7, pp. 2793-2800, 2011.

[17] Mesleh A., “Support Vector Machines based Arabic Language Text Classification System: Feature Selection Comparative Study,” in Proceedings of Advances in Computer and Information Sciences and Engineering, pp. 11- 16, 2008.

[18] Mladenic D. and Grobelnik M., “Feature Selection for Unbalanced Class Distribution and Naive Bayes,” in Proceedings of the 16th International Conference on Machine Learning, San Francisco, pp. 258-267, 1999.

[19] Nakagawa H., “Automatic Term Recognition based on Statistics of Compound Nouns,” Terminology, vol. 6, no. 2, pp. 195-210, 2001.

[20] Nakagawa H. and Mori T., “A Simple but Powerful Automatic Term Extraction Method,” in Proceedings of 2nd International Workshop on Computational Terminology, Stroudsburg, pp. 1- 7, 2002.

[21] Pinheiro R., Cavalcanti G., Correa R., and Ren T., “A Global-Ranking Local Feature Selection Method for Text Categorization,” Expert Systems with Applications, vol. 39, no. 17, pp. 12851-12857, 2012.

[22] Ren F. and Sohrab M., “Class-Indexing-based Term Weighting for Automatic Text Classification,” Information Sciences, vol. 236, pp. 109-125, 2013.

[23] Saad M., the Impact of Text Preprocessing and Term Weighting on Arabic Text Classification, Theses, Master of Science, Computer Engineering, the Islamic University, 2010.

[24] Shang W., Huang H., Zhu H., Lin Y., Qu Y., and Wang Z., “A Novel Feature Selection Algorithm for Text Categorization,” Expert Systems with Applications, vol. 33, no. 1, pp. 1-5, 2007.

[25] Singh B., Kushwaha N., and Vyas O., “A Feature Subset Selection Technique for High Dimensional Data Using Symmetric Uncertainty,” Journal of Data Analysis and Information Processing, vol. 2, no. 4, pp. 95- 105, 2014.

[26] Tan C., Wang Y., and Lee C., “The Use of Bigrams to Enhance Text Categorization,” Information Processing and Management, vol. 38, no. 4, pp. 529-546, 2002.

[27] Uğuz H., “A Two-Stage Feature Selection Method for Text Categorization by using Information Gain, Principal Component Analysis and Genetic Algorithm,” Knowledge- Based Systems, vol. 24, no. 7, pp. 1024-1032, 2011.

[28] Uysal A. and Gunal S., “A Novel Probabilistic Feature Selection Method for Text Classification,” Knowledge-Based Systems, vol. 36, pp. 226-235, 2012.

[29] Vege S., Ensemble of Feature Selection Techniques for High Dimensional Data, Theses, Western Kentucky University, 2012. Feature Selection Method Based On Statistics of Compound Words for Arabic ... 185

[30] Wang J., Zhou S., Yi Y., and Kong J., “An Improved Feature Selection based on Effective Range for Classification,” The Scientific World Journal, pp. 1-8, 2014. Aisha Adel is PhD candidate in UKM, Malaysia. She earned her MSc degree in 2014 in computer science from UKM, Malaysia. BSc degree in 2009 UST Yemen. Her research interests are on machine learning and optimization algorithms. Nazlia Omar is currently an Associate Professor at the Center for AI Technology, Faculty of Information Science and Technology, University Kebangsaan Malaysia. She holds her PhD in Computer Science from the University of Ulster, UK. Her main research interest is in the area of Natural Language Processing and Computational Linguistics. Mohammed Albared obtained his BSc in Computer Science from Yarmouk University, Jordan. He obtained his master degree in Computer Science from Yarmouk University, Jordan. He did his PhD in Computer Science at Universiti Kebangsaan Malaysia. Now, he is working as an Assistant professor at Sana’a University. His research interest falls under Natural Language Processing (NLP), Machine Learning, Text and Web Mining, and Sentiment Analysis. Adel Al-shabi earned his PhD degree in 2018 and MSc degree in 2013 in computer science at Universiti Kebangsaan Malaysia. He obtained his BSc degree in 2006 at National University, Yemen. His research interests are on machine learning and sentiment analysis.