The International Arab Journal of Information Technology (IAJIT)


A Genetic Algorithm based Domain Adaptation Framework for Classification of Disaster Topic Text Tweets

The ability to post short text and media messages on Social media platforms like Twitter, Facebook, etc., plays a huge role in the exchange of information following a mass emergency event like hurricane, earthquake, tsunami etc. Disaster victims, families, and other relief operation teams utilize social media to help and support one another. Despite the benefits offered by these communication media, the disaster topic related posts (posts that indicate conversations about the disaster event in the aftermath of the disaster) gets lost in the deluge of posts since there would be a surge in the amount of data that gets exchanged following a mass emergency event. This hampers the emergency relief effort, which in turn affects the delivery of useful information to the disaster victims. Research in emergency coordination via social media has received growing interest in recent years, mainly focusing on developing machine learning-based models that can separate disaster-related topic posts from non- disaster related topic posts. Of these, supervised machine learning approaches performed well when the machine learning model trained using source disaster dataset and target disaster dataset are similar. However, in the real world, it may not be feasible as different disasters have different characteristics. So, models developed using supervised machine learning approaches do not perform well in unseen disaster datasets. Therefore, domain adaptation approaches, which address the above limitation by learning classifiers from unlabeled target data in addition to source labelled data, represent a promising direction for social media crisis data classification tasks. The existing domain adaptation techniques for the classification of disaster tweets are experimented with using single disaster event dataset pairs; then, self-training is performed on the source target dataset pairs by considering the highly confident instances in subsequent iterations of training. This could be improved with better feature engineering. Thus, this research proposes a Genetic Algorithm based Domain Adaptation Framework (GADA) for the classification of disaster tweets. The proposed GADA combines the power of 1) Hybrid Feature Selection component using the Genetic Algorithm and Chi-Square Feature Evaluator for feature selection and 2) the Classifier component using Random Forest to classify disaster-related posts from noise on Twitter. The proposed framework addresses the challenge of the lack of labeled data in the target disaster event by proposing a Genetic Algorithm based approach. Experimental results on Twitter datasets corresponding to four disaster domain pair shows that the proposed framework improves the overall performance of the previous supervised approaches and significantly reduces the training time over the previous domain adaptation techniques that do not use the Genetic Algorithm (GA) for feature selection.

[1] Andre J., Siarry P., and Dognon T., “An Improvement of the Standard Genetic Algorithm Fighting Premature Convergence in Continuous Optimization,” Advances in Engineering Software, vol. 32, no.1, pp.49-60, 2001.

[2] Babatunde O., Armstrong L., Leng L., and Diepeveen D., “A Genetic Algorithm-Based Feature Selection,” International Journal of Electronics Communication 0.75 0.8 0.85 0.9 0.95 1 1.05 Weighted auROC Disaster event pairs NB-EM NB-ST RF-Supervised RF-ST GADA 0 20 40 60 80 100 120 Accuracy (%) Disaster event pairs NB-EM NB-ST RF-Supervised GADA 020406080 NB-ST (using 4000instances fromsource data) RF-ST (using 1000instances fromsource data) GADA (using 1000instances fromsource data) Total no. of iterations Domain adaptation approaches QF-OKT QF-BB QF-AF BB-WTE 64 The International Arab Journal of Information Technology, Vol. 20, No. 1, January 2023 and Computer Engineeringm, vol. 5, no. 4, pp. 899-905, 2014.

[3] Bermejo P., Gámez J., and Puerta J., “Speeding up Incremental Wrapper Feature Subset Selection with Naive Bayes Classifier,” Knowledge-Based Systems, vol. 55, pp. 140-147, 2014.

[4] Blum A. and Mitchell T., “Combining Labelled and Unlabelled Data with Co- Training,” in Proceedings of the 11th annual conference on Computational Learning Theory, Madison, pp. 92-100, 1998.

[5] Catak F. and Bilgem T., “Genetic Algorithm Based Feature Selection in High Dimensional Text Dataset Classification,” WSEAS Transactions on Information Science and Applications, vol. 12, no. 28, pp. 290-296, 2015.

[6] Chinnaiah V. and Kiliroor C., “Heterogeneous Feature Analysis on Twitter Data Set for Identification of Spam Messages,” The International Arab Journal of Information Technology, vol. 19, no. 1, pp. 38-44, 2022.

[7] Günal S., “Hybrid Feature Selection for Text Classification,” Turkish Journal of Electrical Engineering and Computer Science, vol. 20, no. 2, pp. 1296-1311, 2012.

[8] Hassanat A., Almohammadi K., Alkafaween E., Abunawas E., Hammouri A., and Prasath V., “Choosing Mutation And Crossover Ratios for Genetic Algorithms-A Review with A New Dynamic Approach,” Information, vol. 10, no. 12, pp. 390, 2019.

[9] Huang C. and Dun J., “A Distributed PSO- SVM Hybrid System with Feature Selection and Parameter Optimization,” Applied Soft Computing, vol. 8, no. 4, pp.1381-1391, 2008.

[10] Jin X., Xu A., Bie R., and Guo P., “Machine Learning Techniques and Chi-Square Feature Selection for Cancer Classification Using SAGE Gene Expression Profiles,” in Proceedings of the International Workshop on Data Mining for Biomedical Applications, Singapore, pp. 106-115, 2006.

[11] Lanzi P., “Fast Feature Selection with Genetic Algorithms: A Filter Approach,” in Proceedings of the IEEE International Conference on Evolutionary Computation, Indianapolis, pp. 537-540, 1997.

[12] Leardi R., “Application of a Genetic Algorithm to Feature Selection under Full Validation Conditions and to Outlier Detection,” Journal of Chemometrics, vol. 8, no. 1, pp. 65-79, 1994.

[13] Li H., Guevara N., Herndon N., Caragea D., Neppalli K., Caragea C., Squicciarini A., Tapia A., “Twitter Mining for Disaster Response: A Domain Adaptation Approach,” in Proceedings ISCRAM, Krystiansand, 2015.

[14] Li H., Caragea D., Caragea C., and Herndon N., “Disaster Response Aided By Tweet Classification With A Domain Adaptation Approach,” Journal of Contingencies and Crisis Management, vol. 26, no. 1, pp. 16-27, 2018.

[15] Li X. and Caragea D., “Domain Adaptation with Reconstruction for Disaster Tweet Classification,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, China, pp. 1561-1564, 2020.

[16] Mazloom R., Li H., Caragea D., Caragea C., and Imran M., “A Hybrid Domain Adaptation Approach for Identifying Crisis-Relevant Tweets,” International Journal of Information Systems for Crisis Response and Management, vol. 11, no. 2, pp. 1-19, 2019.

[17] Mohammed T. Bayat O., Ucan O., and Alhyali S., “Hybrid Efficient Genetic Algorithm for Big Data Feature Selection Problems,” Foundations of Science, vol. 25, no. 21, pp. 1-17, 2019.

[18] Olteanu A., Castillo C., Diaz F., and Vieweg S., “Crisislex: A lexicon for Collecting and Filtering Microblogged Communications in Crises,” in Proceedings of the 8th International AAAI Conference on Web and Social Media, Ann Arbor, 2014.

[19] Pan S. and Yang Q., “A Survey on Transfer Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp.1345-1359, 2010.

[20] Panchal G. and Panchal D., “Solving np Hard Problems Using Genetic Algorithm,” Transportation, vol.106, pp. 6-2, 2015.

[21] Parilla-Ferrer B., Fernandez P., BallenaIV J., “Automatic Classification of Disaster-Related Tweets,” in Proceedings of the International conference on Innovative Engineering Technologies, Bangkok, pp. 62-69, 2014.

[22] Rudra K. Ghosh S., Ganguly N., Goyal P., “Extracting Situational Information from Microblogs during Disaster Events: A Classification-Summarization Approach,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, pp. 583-592, 2015.

[23] Schulz A., Guckelsberger C., and Janssen F., “Semantic Abstraction for Generalization of Tweet Classification: An Evaluation of Incident-Related Tweets,” Semantic Web, vol. 8, no. 3, pp. 353-372, 2017.

[24] Stowe K., Paul M., Palmer M., Palen L., and Anderson K., “Identifying and Categorizing Disaster-Related Tweets,” in Proceedings of the 4th International Workshop on Natural Language Processing for Social Media, Austin, pp. 1-6, 2016.

[25] Tiwari R. and Singh M., “Correlation-based Attribute Selection Using Genetic Algorithm,” International Journal of Computer Applications, vol. 4, no. 8, pp. 28- A Genetic Algorithm based Domain Adaptation Framework for Classification of ... 65 34, 2010.

[26] Umbarkar A. and Sheth P., “Crossover Operators in Genetic Algorithms: A Review,” ICTACT Journal on Soft Computing, vol. 6, no. 1, 2015.

[27] Witten I. and Frank E., Data Mining: Practical Machine Learning Tools and Techniques, Elsvier, 2005.

[28] Yarowsky D., “Unsupervised Word Sense Disambiguation Rivaling Supervised Methods,” in Proceedings 33rd Annual Meeting of The Association for Computational Linguistics, USA, pp. 189- 196, 1995.

[29] Zhai Y., Song W., Liu X., Liu L., and Zhao X., “A chi-Square Statistics Based Feature Selection Method in Text Classification,” in Proceedings of the IEEE 9th International Conference on Software Engineering and Service Science, Beijing, pp. 160-163, 2018.

[30] Zhou Y., De S., and Moessner K., “Real World City Event Extraction from Twitter Data Streams,” Procedia Computer Science, pp. 443-448, 2016.