The International Arab Journal of Information Technology (IAJIT)


A Genetic Algorithm based Domain Adaptation Framework for Classification of Disaster Topic

The ability to post short text and media messages on Social media platforms like Twitter, Facebook, etc., plays a huge role in the exchange of information following a mass emergency event like hurricane, earthquake, tsunami etc. Disaster victims, families, and other relief operation teams utilize social media to help and support one another. Despite the benefits offered by these communication media, the disaster topic related posts (posts that indicate conversations about the disaster event in the aftermath of the disaster) gets lost in the deluge of posts since there would be a surge in the amount of data that gets exchanged following a mass emergency event. This hampers the emergency relief effort, which in turn affects the delivery of useful information to the disaster victims. Research in emergency coordination via social media has received growing interest in recent years, mainly focusing on developing machine learning-based models that can separate disaster-related topic posts from non- disaster related topic posts. Of these, supervised machine learning approaches performed well when the machine learning model trained using source disaster dataset and target disaster dataset are similar. However, in the real world, it may not be feasible as different disasters have different characteristics. So, models developed using supervised machine learning approaches do not perform well in unseen disaster datasets. Therefore, domain adaptation approaches, which address the above limitation by learning classifiers from unlabeled target data in addition to source labelled data, represent a promising direction for social media crisis data classification tasks. The existing domain adaptation techniques for the classification of disaster tweets are experimented with using single disaster event dataset pairs; then, self-training is performed on the source target dataset pairs by considering the highly confident instances in subsequent iterations of training. This could be improved with better feature engineering. Thus, this research proposes a Genetic Algorithm based Domain Adaptation Framework (GADA) for the classification of disaster tweets. The proposed GADA combines the power of 1) Hybrid Feature Selection component using the Genetic Algorithm and Chi-Square Feature Evaluator for feature selection and 2) the Classifier component using Random Forest to classify disaster-related posts from noise on Twitter. The proposed framework addresses the challenge of the lack of labeled data in the target disaster event by proposing a Genetic Algorithm based approach. Experimental results on Twitter datasets corresponding to four disaster domain pair shows that the proposed framework improves the overall performance of the previous supervised approaches and significantly reduces the training time over the previous domain adaptation techniques that do not use the Genetic Algorithm (GA) for feature selection.

[1] Andre J., Siarry P., and Dognon T., “An Improvement of the Standard Genetic Algorithm Fighting Premature Convergence in Continuous Optimization,” Advances in Engineering Software, vol. 32, no.1, pp.49-60, 2001.

[2] Babatunde O., Armstrong L., Leng L., and Diepeveen D., “A Genetic Algorithm-Based Feature Selection,” International Journal of Electronics Communication 0.75 0.8 0.85 0.9 0.95 1 1.05 Weighted auROC Disaster event pairs NB-EM NB-ST RF-Supervised RF-ST GADA 0 20 40 60 80 100 120 Accuracy (%) Disaster event pairs NB-EM NB-ST RF-Supervised GADA 020406080 NB-ST (using 4000instances fromsource data) RF-ST (using 1000instances fromsource data) GADA (using 1000instances fromsource data) Total no. of iterations Domain adaptation approaches QF-OKT QF-BB QF-AF BB-WTE 64 The International Arab Journal of Information Technology, Vol. 20, No. 1, January 2023 and Computer Engineeringm, vol. 5, no. 4, pp. 899-905, 2014.

[3] Bermejo P., Gámez J., and Puerta J., “Speeding up Incremental Wrapper Feature Subset Selection with Naive Bayes Classifier,” Knowledge-Based Systems, vol. 55, pp. 140-147, 2014.

[4] Blum A. and Mitchell T., “Combining Labelled and Unlabelled Data with Co- Training,” in Proceedings of the 11th annual conference on Computational Learning Theory, Madison, pp. 92-100, 1998.

[5] Catak F. and Bilgem T., “Genetic Algorithm Based Feature Selection in High Dimensional Text Dataset Classification,” WSEAS Transactions on Information Science and Applications, vol. 12, no. 28, pp. 290-296, 2015.

[6] Chinnaiah V. and Kiliroor C., “Heterogeneous Feature Analysis on Twitter Data Set for Identification of Spam Messages,” The International Arab Journal of Information Technology, vol. 19, no. 1, pp. 38-44, 2022.

[7] Günal S., “Hybrid Feature Selection for Text Classification,” Turkish Journal of Electrical Engineering and Computer Science, vol. 20, no. 2, pp. 1296-1311, 2012.

[8] Hassanat A., Almohammadi K., Alkafaween E., Abunawas E., Hammouri A., and Prasath V., “Choosing Mutation And Crossover Ratios for Genetic Algorithms-A Review with A New Dynamic Approach,” Information, vol. 10, no. 12, pp. 390, 2019.

[9] Huang C. and Dun J., “A Distributed PSO- SVM Hybrid System with Feature Selection and Parameter Optimization,” Applied Soft Computing, vol. 8, no. 4, pp.1381-1391, 2008.

[10] Jin X., Xu A., Bie R., and Guo P., “Machine Learning Techniques and Chi-Square Feature Selection for Cancer Classification Using SAGE Gene Expression Profiles,” in Proceedings of the International Workshop on Data Mining for Biomedical Applications, Singapore, pp. 106-115, 2006.

[11] Lanzi P., “Fast Feature Selection with Genetic Algorithms: A Filter Approach,” in Proceedings of the IEEE International Conference on Evolutionary Computation, Indianapolis, pp. 537-540, 1997.

[12] Leardi R., “Application of a Genetic Algorithm to Feature Selection under Full Validation Conditions and to Outlier Detection,” Journal of Chemometrics, vol. 8, no. 1, pp. 65-79, 1994.

[13] Li H., Guevara N., Herndon N., Caragea D., Neppalli K., Caragea C., Squicciarini A., Tapia A., “Twitter Mining for Disaster Response: A Domain Adaptation Approach,” in Proceedings ISCRAM, Krystiansand, 2015.

[14] Li H., Caragea D., Caragea C., and Herndon N., “Disaster Response Aided By Tweet Classification With A Domain Adaptation Approach,” Journal of Contingencies and Crisis Management, vol. 26, no. 1, pp. 16-27, 2018.

[15] Li X. and Caragea D., “Domain Adaptation with Reconstruction for Disaster Tweet Classification,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, China, pp. 1561-1564, 2020.

[16] Mazloom R., Li H., Caragea D., Caragea C., and Imran M., “A Hybrid Domain Adaptation Approach for Identifying Crisis-Relevant Tweets,” International Journal of Information Systems for Crisis Response and Management, vol. 11, no. 2, pp. 1-19, 2019.

[17] Mohammed T. Bayat O., Ucan O., and Alhyali S., “Hybrid Efficient Genetic Algorithm for Big Data Feature Selection Problems,” Foundations of Science, vol. 25, no. 21, pp. 1-17, 2019.

[18] Olteanu A., Castillo C., Diaz F., and Vieweg S., “Crisislex: A lexicon for Collecting and Filtering Microblogged Communications in Crises,” in Proceedings of the 8th International AAAI Conference on Web and Social Media, Ann Arbor, 2014.

[19] Pan S. and Yang Q., “A Survey on Transfer Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp.1345-1359, 2010.

[20] Panchal G. and Panchal D., “Solving np Hard Problems Using Genetic Algorithm,” Transportation, vol.106, pp. 6-2, 2015.

[21] Parilla-Ferrer B., Fernandez P., BallenaIV J., “Automatic Classification of Disaster-Related Tweets,” in Proceedings of the International conference on Innovative Engineering Technologies, Bangkok, pp. 62-69, 2014.

[22] Rudra K. Ghosh S., Ganguly N., Goyal P., “Extracting Situational Information from Microblogs during Disaster Events: A Classification-Summarization Approach,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, pp. 583-592, 2015.

[23] Schulz A., Guckelsberger C., and Janssen F., “Semantic Abstraction for Generalization of Tweet Classification: An Evaluation of Incident-Related Tweets,” Semantic Web, vol. 8, no. 3, pp. 353-372, 2017.

[24] Stowe K., Paul M., Palmer M., Palen L., and Anderson K., “Identifying and Categorizing Disaster-Related Tweets,” in Proceedings of the 4th International Workshop on Natural Language Processing for Social Media, Austin, pp. 1-6, 2016.

[25] Tiwari R. and Singh M., “Correlation-based Attribute Selection Using Genetic Algorithm,” International Journal of Computer Applications, vol. 4, no. 8, pp. 28- A Genetic Algorithm based Domain Adaptation Framework for Classification of ... 65 34, 2010.

[26] Umbarkar A. and Sheth P., “Crossover Operators in Genetic Algorithms: A Review,” ICTACT Journal on Soft Computing, vol. 6, no. 1, 2015.

[27] Witten I. and Frank E., Data Mining: Practical Machine Learning Tools and Techniques, Elsvier, 2005.

[28] Yarowsky D., “Unsupervised Word Sense Disambiguation Rivaling Supervised Methods,” in Proceedings 33rd Annual Meeting of The Association for Computational Linguistics, USA, pp. 189- 196, 1995.

[29] Zhai Y., Song W., Liu X., Liu L., and Zhao X., “A chi-Square Statistics Based Feature Selection Method in Text Classification,” in Proceedings of the IEEE 9th International Conference on Software Engineering and Service Science, Beijing, pp. 160-163, 2018.

[30] Zhou Y., De S., and Moessner K., “Real World City Event Extraction from Twitter Data Streams,” Procedia Computer Science, pp. 443-448, 2016. Lokabhiram Dwarakanath received the B.Engg degree in Electronics and Communication Engineering from Dr.MGR Engg College, University of Madras, India, and the M.Sc. degree in Enterprise Business Systems from the Brunel University, West London, U.K. He is currently pursuing the Ph.D. degree in the Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia. His research interests include data science, natural language processing, information systems, cloud computing, machine learning, big data, and social media analytics. Amirrudin Kamsin is a Senior Lecturer at the Faculty of Computer Science and Information Technology, and the Acting Director and Deputy Director (ODL and Professional Programme) at the University of Malaya Centre for Continuing Education (UMCCed), University of Malaya, Malaysia. He received his BIT (Management) in 2001 and MSc in Computer Animation in 2002 from University of Malaya and Bournemouth University, UK respectively. He obtained his PhD in Computer Science from University College London (UCL) in 2014. His research areas include human-computer interaction (HCI), authentication systems, e-learning, mobile applications, serious game, augmented reality and mobile health services. Liyana Shuib obtained her Master of Information System (Data Mining) from Universiti Kebangsaan Malaysia in 2005 and a Ph.D. from the University of Malaya, Malaysia in 2013 respectively. She is an Associate Professor at the Department of Information Systems, Faculty of Computer Science & Information Technology and the Deputy Director of Analytics at Academic Strategic Planning Centre, Deputy Vice Chancellor (Academic & International), University of Malaya, Malaysia. She has published a number of journal papers and proceedings locally and internationally. Her research interests include personalization, e-learning, recommender system, data science, data mining, artificial intelligence application, and educational technology. She has won more than 20 awards from reputable innovation competition internationally. She is also a senior member of IEEE computing society, an active blogger and presently, the principal investigator of multiple research grant in the Faculty.