The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Hierarchical Method for Automated Text Documents Classification

Digitalization is currently not a concept the world seeks to apply; rather, it is a fact this world lives in. The transformation for the green world has strongly introduced the principle of eliminating hard copy resources while maintaining their digital versions. The immense amount of information that resides in electronic documents opened a wide road for research a long time ago. On the other hand, information extraction, text mining, and Natural Language Processing (NLP) are three concatenated fields that have gained their unique place in the digital world through time. This research aims to introduce a novel method for Arabic document classification. The research provides multi-tagging to the document according to a set of criteria, one of these tags is the hierarchical classification for the document that could play an efficient role in its related field. For example, documents in healthcare systems beehive could lead to exploring a new symptom of a disease, as it is known that symptoms could continuously mutate over time. The proposed method succeeds through the generated schema to relate between old and new symptoms, which makes it no surprise when evolving and gives a chance for pre-preparation and success to containment. The technical challenges of this study include the ability to successfully apply text mining techniques and machine learning. Additionally, the higher level of challenges that arise in this study is the fact that the processing is applied to Arabic text documents. Arabic has been known to be a complex language as it has its unique nature. The proposed method has been applied, compared with known methods, and its effectiveness has been confirmed by applying a classification task with an Accuracy equal to 99.5%.

[1] Afify E., Sharaf Eldin A., Khedr A., and Alsheref F., “User-Generated Content (UGC) Credibility on Social Media Using Sentiment Classification,” FCI-H Informatics Bulletin, vol. 1, no. 1, pp. 1-19, 2019. file:///C:/Users/user/Downloads/User- GeneratedContentUGCCredibilityonSocialMedia.pdf

[2] Akcapinar G., “How Automated Feedback through Text Mining Changes Plagiaristic Behavior in Online Assignments,” Computers and Education, vol. 87, pp. 123-130, 2015. https://doi.org/10.1016/j.compedu.2015.04.007

[3] AlMazroi A., Khedr A., and Idrees A., “A Proposed Customer Relationship Framework Based on Information Retrieval for Effective Firms’ Competitiveness,” Expert Systems with Applications, vol. 176, pp. 114882, 2021. https://doi.org/10.1016/j.eswa.2021.114882

[4] Alzubi J., Nayyar A., and Kumar A., “Machine Learning from Theory to Algorithms: An Overview,” Journal of Physics: Conference Series, Bangalore, pp. 1-16, 2018. DOI:10.1088/1742-6596/1142/1/012012

[5] Attia M., Abdel-Fattah M., and Khedr A., “A Proposed Multi Criteria Indexing and Ranking Model for Documents and Web Pages on Large Scale Data,” Journal of King Saud University- Computer and Information Sciences, vol. 34, no. 10, 2022. https://doi.org/10.1016/j.jksuci.2021.10.009

[6] Benabdallah A., Alaeddine M., and Abderrahim M., “Extraction of Terms and Semantic Relationships from Arabic Texts for Automatic Construction of an Ontology,” International Journal of Speech Technology, vol. 20, no. 2, pp. 289-96, 2017. https://link.springer.com/article/10.1007/s10772- 017-9405-5

[7] Boukil S., Biniz M., El Adnani F., Cherrat L., and El Moutaouakkil A., “Arabic Text Classification Using Deep Learning Technics,” International Journal of Grid and Computational Computing, vol. 11, no. 9, pp.103-114, 2018. https://article.nadiapub.com/IJGDC/vol11_no9/9.pdf

[8] Bourahouat G., Abourezq M., and Daoudi N., “Word Embedding as a Semantic Feature Extraction Technique in Arabic Natural Language Processing: An Overview,” The International Arab Journal of Information Technology, vol. 21, no. 2, pp. 313-325, 2024. https://doi.org/10.34028/iajit/21/2/13

[9] Chandra P., Ahammed M., Ghosh S., Emon R., Billah M., Ahamad M., and Balaji P., “Contextual Emotion Detection in Text using Deep Learning and Big Data,” in Proceedings of the 2nd International Conference on Computer Science, Engineering and Application, Gunupur, pp. 1-5, 2022. DOI:10.1109/ICCSEA54677.2022.9936154

[10] Da Rocha N., Barbosa A., Schnr Y., Machado- Rugolo J., De Andrade L., Corrente J., and Silveira L., “Natural Language Processing to Extract Information from Portuguese-Language Medical Records,” Data, vol. 8, no. 1, pp. 1-15, 2023. https://www.mdpi.com/2306-5729/8/1/11

[11] Dahab M., Idrees A., Hassan H., and Rafea A., “Pattern Based Concept Extraction for Arabic Documents,” The International Journal of Intelligent Computing and Information Sciences, vol. 10, no. 2, pp. 1-14, 2010. https://scholar.cu.edu.eg/?q=hesham/publications /pattern-based-concept-extraction-arabic- documents-0

[12] Dziadek J., Henriksson A., and Duneld M., “Improving Terminology Mapping in Clinical Text with Context-Sensitive Spelling Correction,” Informatics for Health: Connected Citizen-Led Wellness and Population Health, vol. 235, pp. 241-245, 2017. https://pubmed.ncbi.nlm.nih.gov/28423790/

[13] Hassan H., Dahab M., Bahnassy K., Idrees A., and Gamal F., “Arabic Documents Classification Method a Step towards Efficient Documents Summarization,” International Journal on Recent and Innovation Trends in Computing and Communication, vol. 3, no. 1, pp. 351-359, 2015. file:///C:/Users/user/Downloads/1423537026.pdf

[14] Hassan H., Dahab M., Bahnasy K., Idrees A., and Gamal F., “Query Answering Approach Based on Document Summarization,” International Open Access Journal of Modern Engineering Research, vol. 4, no. 12, pp. 50-55, 2014. file:///C:/Users/user/Downloads/IJMER.pdf

[15] Hassouna D., Khedr A., Idrees A., and ElSeddawy A., “Intelligent Personalized System for Enhancing the Quality of Learning,” Journal of Theoretical and Applied Information Technology, vol. 98, no. 13, pp. 2199-2213, 2020. https://www.jatit.org/volumes/Vol98No13/1Vol9 8No13.pdf

[16] Hawashin B., Mansour A., and Aljawarneh S., “An Efficient Feature Selection Method for Arabic Text Classification,” International Journal of Computer Applications, vol. 83, no. 17, pp. 1-6, 2013. 18 The International Arab Journal of Information Technology, Vol. 22, No. 1, January 2025 https://www.ijcaonline.org/archives/volume83/nu mber17/14666-2588/

[17] Helmy Y., Emam O., Khedr A., and Bahloul M., “A Survey on Effect of KPIs in Higher Education Based on Text Mining Techniques,” International Journal of Scientific and Engineering Research, vol. 11, no. 3, pp. 1408-1414, 2020. https://www.ijser.org/researchpaper/A-Survey- on-Effect-of-KPIs-in-Higher-Education-based- on-Text-Mining-Techniques.pdf

[18] Idrees A. and Shabaan E., “Building a Knowledge Base Shell Based on Exploring Text Semantic Relations from Arabic Text,” International Journal of Intelligent Engineering and Systems, vol. 13, no. 1, pp. 324-333, 2020. https://inass.org/publications/contents/?rp=conten ts2020-1

[19] Idrees A., Alsheref F., and ElSeddawy A., “A Proposed Model for Detecting Facebook News’ Credibility,” International Journal of Advanced Computer Science and Applications, vol. 10, no. 7, pp. 311-316, 2019. DOI:10.14569/IJACSA.2019.0100743

[20] Idrees A., ElSeddawy A., and Zeidan M., “Knowledge Discovery Based Framework for Enhancing the House of Quality,” International Journal of Advanced Computer Science and Applications, vol. 10, no. 7, pp. 324-332, 2019. DOI:10.14569/IJACSA.2019.0100745

[21] Idrees A., Helmy Y., and Khedr A., “Credibility Aspects’ Perceptions of Social Networks, A Survey,” Social Network Analysis and Mining, vol. 12, no. 1, 2022. https://link.springer.com/article/10.1007/s13278- 022-00924-6

[22] Jasti V., Kumar G., Kumar M., Maheshwari V., Jayagopal P., Pant B., Karthick A., and Muhibbullah M., “Relevant-Based Feature Ranking (RBFR) Method for Text Classification Based on Machine Learning Algorithm,” Functional Nanomaterial-based Flexible Electronics, vol. 2022, pp. 1-12, 2022. https://doi.org/10.1155/2022/9238968

[23] Kaggle, News Dataset, https://www.kaggle.com/datasets/rmisra/news- category-dataset/, Last Visited, 2024.

[24] Khan M., Rafa S., Abir A., and Das A., “Sentiment Analysis on Bengali Facebook Comments to Predict Fan’s Emotions towards a Celebrity,” Journal of Engineering Advancements, vol. 2, no. 3, pp. 118-124, 2021. https://doi.org/10.38032/jea.2021.03.001

[25] Khedr A., Idrees A., and Alsheref F., “A Proposed Framework to Explore Semantic Relations for Learning Process Management,” International Journal of e-Collaboration, vol. 15, no. 4, pp. 46- 50, 2019. https://doi.org/10.4018/IJeC.2019100104

[26] Khedr A., Idrees A., and Shabaan E., “Automated Ham-Spam Lexicon Generation Based on Semantic Relations Extraction,” International Journal of e-Collaboration, vol. 16, no. 2, pp. 45- 64, 2020. DOI:10.4018/IJeC.2020040104

[27] Mohammed M. and Omar N., “Question Classification Based on Bloom’s Taxonomy Cognitive Domain Using Modified TF-IDF and Word2Vec,” PloS One, vol. 15, no. 3, pp. 1-21, 2020. https://doi.org/10.1371/journal.pone.0230442

[28] Mohsen A., Hassan H., and Idrees A., “Documents Emotions Classification Model Based on TF-IDF Weighting,” International Journal of Computer Electrical Automation Control and Information Engineering, vol. 10, no. 1, pp. 252-258, 2016. https://zenodo.org/records/1126597

[29] Mohsen A., Idrees A., and Hassan H., “Emotion Analysis for Opinion Mining from Text: A Comparative Study,” International Journal of e- Collaboration, vol. 15, no. 1, pp. 1-21, 2019. https://doi.org/10.4018/IJeC.2019010103

[30] Mostafa A., Idrees A., Khedr A., and Helmy Y., “A Proposed Architectural Framework for Generating Personalized Users’ Query Response,” Journal of Southwest Jiaotong University, vol. 55, no. 5, pp. 1-13, 2020. http://www.jsju.org/index.php/journal/article/vie w/714/708

[31] Mouri K., Ren Z., Uosaki N., and Yin C., “Analyzing Learning Patterns Based on Log Data from Digital Textbooks,” International Journal of Distance Education Technologies, vol. 17, no. 1, pp. 1-14, 2019. DOI:10.4018/IJDET.2019010101

[32] Othman M., Hassan H., Moawad R., and Idrees A., “A Linguistic Approach for Opinionated Documents Summary,” Future Computing and Informatics Journal, vol. 3, no. 2, pp. 152-158, 2018. https://doi.org/10.1016/j.fcij.2017.10.004

[33] Othman M., Hassan H., Moawad R., and Idrees A., “Using NLP Approach for Opinion Types Classifier,” Journal of Computers, vol. 11, no. 5, pp. 400-410, 2016. DOI:10.17706/jcp.11.5.400-410

[34] Peng D. and Zhao H., “Seq2Emoji: A Hybrid Sequence Generation Model for Short Text Emoji Prediction,” Knowledge-Based Systems, vol. 214, pp. 106727, 2021. https://doi.org/10.1016/j.knosys.2020.106727

[35] Pohl H., Domin C., and Rohs M., “Beyond Just Text: Semantic Emoji Similarity Modeling to Support Expressive Communication,” ACM Transactions on Computer-Human Interaction, vol. 24, no. 1, pp. 1-42, 2017. https://doi.org/10.1145/3039685

[36] Qaffas A., Idrees A., Khedr A., and Kholeif S., “A Smart Testing Model Based on Mining Semantic Relations,” IEEE Access, vol. 11, pp. 30237- 30246, 2023. Hierarchical Method for Automated Text Documents Classification 19 DOI:10.1109/ACCESS.2023.3260406

[37] Sabri T., El Beggar O., and Kissi M., “Comparative Study of Arabic Text Classification Using Feature Vectorization Methods,” Procedia Computer Science, vol. 198, pp. 269-275, 2022. https://doi.org/10.1016/j.procs.2021.12.239

[38] Sarker I., Colman A., Han J., and Watters P., Context-Aware Machine Learning and Mobile Data Analytics: Automated Rule-Based Services with Intelligent Decision-Making, Springer, 2021. https://link.springer.com/book/10.1007/978-3- 030-88530-4

[39] Sayed M., Salem R., and Khedr A., “A Survey of Arabic Text Classification Approaches,” International Journal of Computer Applications in Technology, vol. 95, no. 3, pp. 236-251, 2019. https://doi.org/10.1504/IJCAT.2019.098601

[40] Singh M., Sahu H., and Sharma N., Data Management, Analytics and Innovation, Springer, 2019. https://link.springer.com/chapter/10.1007/978- 981-13-1274-8_28

[41] Wang K., Cao K., Chen M., Yan Z., Zhong L., Yang H., and Cai S., “Front-Page News Classification Model Based on the Stacking of Textual Context and Attribute Information,” Scientific Programming, vol. 2022, pp. 1-9, 2022. https://doi.org/10.1155/2022/3031195

[42] Yasser F., AbdelMawgoud S., and Idrees A., “A Survey for News Credibility in Social Networks,” International Journal of e-Collaboration, vol. 18, no. 1, pp. 1-20, 2022. https://doi.org/10.4018/IJeC.304378

[43] Yasser F., AbdelMawgoud S., and Idrees A., Handbook of Research on Technologies and Systems for e-Collaboration during Global Crises, IGI Global, 2022. https://www.igi- global.com/chapter/mining-perspectives-for- news-credibility/301832

[44] Zaki S., Ghali N., Abo-Elfetooh A., and Idrees A., “Comparison of Four Ml Predictive Models Predictive Analysis of Big Data,” Journal of Theoretical and Applied Information Technology, vol. 101, no. 1, pp. 282-289, 2023. https://www.jatit.org/volumes/Vol101No1/24Vol 101No1.pdf