Determining PolyCystic Ovarian Syndrome Severity from Reddit Posts using Topic Modelling and Association Rule Mining
Nowadays social media plays a vital role in various real-time applications, especially in healthcare applications. PolyCystic Ovarian Syndrome (PCOS) is a condition that affects females between the ages of 15 and 35 who are of reproductive potential. The symptoms of PCOS are hormonal issues, irregular periods, weight gain, follicles, infertility, excessive hair growth in the skin, hair loss, acne, pimples, dark scars, and depression. Most of the earlier researchers analyzed the PCOS based on clinical text and health records using a machine learning approach. The main motivation of this proposed work is to predict the upcoming PCOS symptoms based on current symptoms and find the severity of the PCOS from Reddit users. This is done by collecting head symptoms from Gynecologists, gathering present symptoms from Reddit users, collecting unstructured data is pre-processed and PCOS sub symptoms are extracted using Bag of Words. The sub symptoms are mapped into head symptoms using Latent Dirichlet Allocation (LDA) for dimension reduction. The major issue in that approach is a single user has experienced the same type of symptom multiple times. This issue is solved by implementing a novel method called Symptom Segmentation and grouping Labeled Latent Dirichlet Allocation (SSG_LLDA) is designed to reduce the dimensionality and map the social media users sub symptoms into head symptoms. Association Rule Mining (ARM) with Apriori is employed to produce the frequent symptoms, and effective rule sets, and form the distinctive symptom patterns. Among several mini-mum support and confidence metrics, 0.02 and 0.1 delivers the best rule sets and symptom patterns. Based on rulesets of symptom patterns and combinations, the severity of PCOS is determined for Reddit users. The novelty of this work is the construction of PCOS symptom patterns from topic modelling results instead of original data so the dimensionality of the features is reduced and more scalable.
[1] Akram W. and Kuma R., “A Study on Positive and Negative Effects of Social Media on Society,” International Journal of Computer Sciences and Engineering, vol. 5, no. 10, pp. 347-354, 2017. https://doi.org/10.26438/ijcse/v5i10.351354
[2] Alamoudi A., Khan I., Aslam N., Alqahtani N., Alsaif H., Al Dandan M., Al Gadeeb M., and Al Bahrani R., “A Deep Learning Fusion Approach to Diagnosis the Polycystic Ovary Syndrome,” Applied Computational Intelligence and Soft Computing, vol. 2023, pp. 1-15, 2023. https://doi.org/10.1155/2023/9686697
[3] Alessa A., Faezipour M., and Alhassan Z., “Text Classification of Flu-related Tweets Using FastText with Sentiment and Keyword Features,” in Proceedings of the IEEE International Conference on Healthcare Informatics, New York, pp. 366-367, 2018. DOI:10.1109/ICHI.2018.00058
[4] Alga A, Eriksson O., and Nordberg M., “Analysis of Scientific Publications during the early Phase of the COVID-19 Pandemic: Topic Modeling Study,” Journal of Medical Internet Research, vol. 22, no. 11, pp. 1-11, 2020. https://www.jmir.org/2020/11/e21559/
[5] Alkouz B. and Al Aghbari Z., “Analysis and Prediction of Influenza in the UAE based on Arabic Tweets,” in Proceedings of the IEEE 3rd International Conference on Big Data Analysis, Shanghai, pp. 61-66, 2018. DOI:10.1109/ICBDA.2018.8367652
[6] Alkouz B., Al Aghbari Z., and Abawajy J., “Tweetluenza: Predicting Flu Trends from Twitter Data,” IEEE Transactions on Big Data Mining and Analytics, vol. 2, no. 4, pp. 273-287, 2019. DOI:10.26599/BDMA.2019.9020012
[7] Amara A., Taieb M., and Ben Aouicha M., “Multilingual Topic Modeling for Tracking COVID-19 Trends Based on Facebook Data Analysis,” Applied Intelligence, vol. 51, no. 5, pp. 3052-3073, 2021. https://doi.org/10.1007/s10489- 020-02033-3 454 The International Arab Journal of Information Technology, Vol. 21, No. 3, May 2024
[8] Amin S., Irfan Uddin M., Zeb M., Alarood A., Mahmoud M., and Alkinani M., “Detecting Dengue/Flu Infections Based on Tweets Using LSTM and Word Embedding,” IEEE Access, vol. 8, pp. 189054-189068, 2020. DOI:10.1109/ACCESS.2020.3031174
[9] Charalambous A., “Social Media and Health Policy,” Asia-Pacific Journal of Oncology Nursing, vol. 6, no. 1, pp. 24-27, 2019. DOI:10.4103/apjon.apjon_60_18
[10] Chee C., Jaafar J., Aziz I., Hasan M., and Yeoh W., “Algorithms for Frequent Itemset Mining: A Literature Review,” Artificial Intelligence Review, vol. 52, no. 3, pp. 2603-2621, 2019. https://link.springer.com/article/10.1007/s10462- 018-9629-z
[11] Dahmani D., Rahal S., and Belalem G., “A New Approach to Improve Association Rules for Big Data in Cloud Environment,” The International Arab Journal of Information Technology, vol. 16, no. 6, pp. 1013-1020, 2019. https://www.iajit.org/portal/PDF/November%202 019,%20No.%206/13038.pdf
[12] Darby L., “The Stein-Leventhal Syndrome: A Curable Form of Sterility” (1958), by Irving Freiler Stein Sr.,” Embryo Project Encyclopedia, 2017. https://hdl.handle.net/10776/11884
[13] Denny A., Raj A., Ashok A., Ram C., and George R., “i-HOPE: Detection and Prediction System for Polycystic Ovary Syndrome (PCOS) Using Machine Learning Techniques,” in Proceedings of the IEEE Region 10th Conference, Kochi, pp. 673- 678, 2019. DOI:10.1109/TENCON.2019.8929674
[14] Dogan S. and Turkoglu I., “Diagnosing Hyperlipidemia Using Association Rules,” Mathematical and Computational Applications, vol. 13, no. 3, pp. 193-202, 2008. https://www.mdpi.com/2297-8747/13/3/193
[15] Domadiya N. and Rao U., “Privacy-Preserving Association Rule Mining for Horizontally Partitioned Healthcare Data: A Case Study on the Heart Diseases,” Sadhana, vol. 43, no. 127, pp. 1- 9, 2018. https://doi.org/10.1007/s12046-018- 0916-9
[16] Elmannai H., El-Rashidy N., Mashal I., Alohali M., Farag S., El-Sappagh S., and Saleh H., “Polycystic Ovary Syndrome Detection Machine Learning Model Based on Optimized Feature Selection and Explainable Artificial Intelligence,” Diagnostics, vol. 13, no. 8, pp. 1-21, 2023. DOI: 10.3390/diagnostics13081506
[17] Gancho S., “Social Media: A Literature Review,” e-Revista LOGO, vol. 6, no. 2, pp. 1-20, 2017. DOI:10.26771/e-Revista.LOGO/2017.2.01
[18] Garbhapu V. and Bodapati P., “A Comparative Analysis of Latent Semantic Analysis and Latent Dirichlet Allocation Topic Modeling Methods Using Bible Data,” Indian Journal of Science and Technology, vol. 13, no. 44, pp. 4474-4482, 2020. DOI:10.17485/IJST/v13i44.1479
[19] Ghosh S., Chakraborty P., Nsoesie E., Cohn E., Mekaru S., Brownstein J., and Ramakrishnan N., “Temporal Topic Modeling to Assess Associations between News Trends and Infectious Disease Outbreaks,” Scientific Reports, vol. 7, pp. 1-12, 2017. https://doi.org/10.1038/srep40841
[20] How to Use Social Media in Healthcare: A Guide for Health Professionals, https://blog.hootsuite.com/social-media-health- care/, Last Visited, 2024.
[21] Huang Z., Dong W., and Duan H., “A Probabilistic Topic Model for Clinical Risk Stratification from Electronic Health Records,” Journal of Biomedical Informatics, vol. 58, pp. 28-36, 2015. https://doi.org/10.1016/j.jbi.2015.09.005
[22] India Social Media Statistics, https://www.theglobalstatistics.com/india-social- media-statistics/, Last Visited, 2024.
[23] Ivancevic V., Tusek I., Tusek J., Knezevic M., Elheshk S., and Lukovic I., “Using Association Rule Mining to Identify Risk Factors for early Childhood Caries,” Computer Methods and Programs in Biomedicine, vol. 122, no. 2, pp. 175- 181, 2015. DOI: 10.1016/j.cmpb.2015.07.008
[24] Jafar A., Fakhr M., and Farouk M., “Enhanced Clustering-based Topic Identification of Transcribed Arabic Broadcast News,” The International Arab Journal of Information Technology, vol. 14, no. 5, pp. 721-728, 2017. https://iajit.org/PDF/vol%2014,%20no.%205%20 sep/9013.pdf
[25] Kamalesh M., Prasanna K., Bharathi B., Dhanalakshmi R., and Canessane R., Predicting the Risk of Diabetes Mellitus to Subpopulations Using Association Rule Mining, Springer, 2016. https://link.springer.com/chapter/10.1007/978- 81-322-2671-0_6
[26] Kaplan A. and Haenlein M., “Users of the World, Unite! The Challenges and Opportunities of Social Media,” Business Horizons, vol. 53, no. 1, pp. 59- 68, 2010. https://doi.org/10.1016/j.bushor.2009.09.003
[27] Kelaiaia A. and Merouani H., “Clustering with Probabilistic Topic Models on Arabic Texts: A Comparative Study of LDA and K-Means,” The International Arab Journal of Information Technology, vol. 13, no. 2, pp. 332-338, 2016. https://www.iajit.org/portal/PDF/Vol.13,%20No. 2/6146.pdf
[28] Ki C., Hosseinian-Far A., Daneshkhah A., and Salari N., “Topic Modelling in Precision Medicine with its Applications in Personalized Diabetes Management,” Expert Systems, vol. 39, no. 4, pp. 1-21, 2021. https://doi.org/10.1111/exsy.12774 Determining PolyCystic Ovarian Syndrome Severity from Reddit Posts using Topic ... 455
[29] Khanna V., Chadaga K., Sampathila N., Prabhu S., Bhandage V., and Hegde G., “A Distinctive Explainable Machine Learning Framework for Detection of Polycystic Ovary Syndrome,” Applied System Innovation, vol. 6, no. 2, pp. 1-26, 2023. https://doi.org/10.3390/asi6020032
[30] Khare S. and Gupta D., “Association Rule Analysis in Cardiovascular Disease,” in Proceedings of the 2nd International Conference on Cognitive Computing and Information Processing, Mysur, pp. 1-6, 2016. DOI: 10.1109/CCIP.2016.7802881
[31] Kumar K. and Arumugaperumal S., “Association Rule Mining and Medical Application: A Detailed Survey,” International Journal of Computer Applications, vol. 80, no. 17, pp. 10-19, 2013. DOI:10.5120/13967-1698
[32] Lakshmi Hospital, https://lakshmifertilitycentre.com/hormone- analysis/#/, Last Visited, 2024.
[33] Lau A., Ong S., Mahidadia A., Hoffmann A., Westbrook J., and Zrimec T., “Mining Patterns of Dyspepsia Symptoms across Time Points Using Constraint Association Rules,” in Proceedings of the Pacific-7th Asia Conference on Advances in Knowledge Discovery and Data Mining, Seoul, pp. 124-135, 2003. https://doi.org/10.1007/3-540- 36175-8_13
[34] Liu J., Wu Q., Hao Y., Jiao M., Wang X., Jiang S., and Han L., “Measuring the Global Disease Burden of Polycystic Ovary Syndrome in 194 Countries: Global Burden of Disease Study 2017,” Human Reproduction, vol. 36, no. 4, pp. 1108- 111, 2021. DOI:10.1093/humrep/deaa371
[35] Liu L., Tang L., Dong W., Yao S., and Zhou W., “An Overview of Topic Modelling and its current Applications in Bioinformatics,” Springer Plus, vol. 5, no. 1608, pp. 1-22, 2016. https://doi.org/10.1186/s40064-016-3252-8
[36] Lossio-Ventura J., Gonzales S., Morzan J., Alatrista-Salas H., Hernandez-Boussard T., and Bian J., “Evaluation of Clustering and Topic Modeling Methods over Health-Related Tweets and Emails,” Artificial Intelligence in Medicine, vol. 117, pp. 102096, 2021. https://doi.org/10.1016/j.artmed.2021.102096
[37] Madila S., Dida M., and Kaijage S., “A Review of Usage and Applications of Social Media Analytics,” Journal of Information Systems Engineering and Management, vol. 6, no. 3, pp. 1- 10, 2021. http://repository.mocu.ac.tz/xmlui/handle/123456 789/583
[38] McCormick T., Rudin C., and Madigan D., “A Hierarchical Model for Association Rule Mining of Sequential Events: An Approach to Automated Medical Symptom Prediction,” Annals of Applied Statistics, vol. 1, pp. 1-19, 2011. DOI:10.2139/ssrn.1736062
[39] Mohammed S. and Al-Augby S., “LSA and LDA Topic Modeling Classification: Comparison Study on E-books,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 19, no. 1, pp. 353-362, 2020. http://doi.org/10.11591/ijeecs.v19.i1.pp353-362
[40] Muliono R., Muhathir., Khairina N., and Harahap M., “Analysis of Frequent Itemsets Mining Algorithm against Models of Different Datasets,” in Proceedings of the 1st International Conference of SNIKOM, Medan, pp. 1-9, 2019. DOI:10.1088/1742-6596/1361/1/012036
[41] Nahar J., Imam T., Tickle K., and Chen Y., “Association Rule Mining to Detect Factors Which Contribute to Heart Disease in Males and Females,” Expert Systems with Applications, vol. 40, no. 4, pp. 1086-1093, 2013. https://doi.org/10.1016/j.eswa.2012.08.028
[42] Nandhini M., Rajalakshmi M., and Sivanandam S., “Performance Analysis of Predictive Association Rule Classifiers Using Healthcare Datasets,” IETE Technical Review, vol. 39, no. 1, pp. 143-156, 2022. https://doi.org/10.1080/02564602.2020.1827988
[43] Nguyen D., Luo W., Phung D., and Venkatesh S., “LTARM: A Novel Temporal Association Rule Mining Method to Understand Toxicities in a Routine Cancer Treatment,” Knowledge-Based Systems, vol. 161, pp. 313-328, 2018. https://doi.org/10.1016/j.knosys.2018.07.031
[44] Nsugbe E., “An Artificial Intelligence-based Decision Support System for early Diagnosis of Polycystic Ovaries Syndrome,” Healthcare Analytics, vol. 3, no. 2, pp. 1-7, 2023. https://doi.org/10.1016/j.health.2023.100164
[45] Patil S. and Kumaraswamy Y., “Extraction of Significant Patterns from Heart Disease Warehouses for Heart Attack Prediction,” International Journal of Computer Science and Network Security, vol. 9, no. 2, pp. 228-235, 2009. http://paper.ijcsns.org/07_book/200902/20090230.pdf
[46] PCOS Comments in Reddit, https://www.reddit.com/r/PCOS/comments/, Last Visited, 2024.
[47] PCOS in Facebook, https://www.facebook.com/search/top?q=pcos, Last Visited, 2024.
[48] PCOS in Reddit, https://www.reddit.com/r/PCOS/, Last Visited, 2024.
[49] PCOS in Tumblr, https://www.tumblr.com/search/pcos, Last Visited, 2024.
[50] PCOS in Twitter, https://twitter.com/search?q=pcos&src=typed_qu ery&f=user, Last Visited, 2024.
[51] Pradeepa S., Geetha K., Kannan K., and Manjula 456 The International Arab Journal of Information Technology, Vol. 21, No. 3, May 2024 K., “DEODORANT: A Novel Approach for early Detection and Prevention of Polycystic Ovary Syndrome Using Association Rule in Hypergraph with the Dominating Set Property,” Journal of Ambient Intelligence and Humanized Computing, vol. 14, pp. 5421-5437 2023. https://link.springer.com/article/10.1007/s12652- 020-01990-4
[52] Quwaider M. and Alfaqeeh M., “Social Networks Benchmark Dataset for Diseases Classification,” in Proceedings of the 4th International Conference on Future Internet of Things and Cloud Workshops, Vienna, pp. 234-239, 2016. DOI:10.1109/W-FiCloud.2016.56
[53] Ramasamy S. and Nirmala K., “Disease Prediction in Data Mining Using Association Rule Mining and Keyword-based Clustering Algorithms,” International Journal of Computers and Applications, vol. 42, no. 1, pp. 1-8, 2017. https://www.tandfonline.com/doi/pdf/10.1080/12 06212X.2017.1396415
[54] Rani R., Hajam Y., Kumar R., Bhat R., Rai S., and Rather M., “A Landscape Analysis of the Potential Role of Polyphenols for the Treatment of Polycystic Ovarian Syndrome,” Phytomedicine Plus, vol. 2, no. 1, pp. 1-21, 2021. https://doi.org/10.1016/j.phyplu.2021.100161
[55] Sahatiya P., “Big Data Analytics on Social Media Data: A Literature Review,” International Research Journal of Engineering and Technology, vol. 5, no. 2 pp. 189-192, 2018. https://www.irjet.net/archives/V5/i2/IRJET- V5I245.pdf
[56] Sapountzi A. and Psannis K., Principles of Data Science, Springer, 2020. https://doi.org/10.1007/978-3-030-43981-1_4
[57] Shi L., Du J., and Kou F., “A Sparse Topic Model for Bursty Topic Discovery in Social Networks,” The International Arab Journal of Information Technology, vol. 17, no. 5, pp. 816-824, 2020. https://iajit.org/PDF/September%202020,%20No .%205/16576.pdf
[58] Smailhodzic E., Hooijsma W., Boonstra A., and Langley D., “Social Media Use in Healthcare: A Systematic Review of Effects on Patients and on their Relationship with Healthcare Professionals,” BMC Health Services Research, vol. 16, no. 442, pp. 1-14, 2016. https://doi.org/10.1186/s12913- 016-1691-0
[59] Sonet K., Rahman M., Mazumder P., Reza A., and Rahman R., “Analyzing Patterns of Numerously Occurring Heart Diseases Using Association Rule Mining,” in Proceedings of the 12th International Conference on Digital Information Management, Fukuoka, pp. 38-45, 2017. DOI:10.1109/ICDIM.2017.8244690
[60] Soni P. and Vashisht S., “Image Segmentation for Detecting Polycystic Ovarian Disease using Deep Neural Networks,” International Journal of Computer Sciences and Engineering, vol. 7, no. 3, pp. 534-537, 2019. https://doi.org/10.26438/ijcse/v7i3.534537
[61] Stieglitza S., Mirbabaiea M., Rossa B., and Neuberger C., “Social Media Analytics- Challenges in Topic Discovery, Data Collection, and Data Preparation,” International Journal of Information Management, vol. 39, pp. 156-168, 2018. https://doi.org/10.1016/j.ijinfomgt.2017.12.002
[62] Tabassum S., Pereira F., Fernandes S., and Gama J., “Social Network Analysis: An Overview,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 5, pp. 1-30, 2018. https://doi.org/10.1002/widm.1256
[63] Tandan M., Acharya Y., Pokharel S., and Timilsina M., “Discovering Symptom Patterns of COVID-19 Patients Using Association Rule Mining,” Computers in Biology and Medicine, vol. 131, pp. 1-12, 2021. https://doi.org/10.1016/j.compbiomed.2021.104249
[64] Tiwari S., Kane L., Koundal D., Jain A., Alhudhaif A., Pola K., Zaguia A., Alenezi F., and Althubiti S., “SPOSDS: A Smart Polycystic Ovary Syndrome Diagnostic System Using Machine Learning,” Expert Systems with Applications, vol. 203, pp. 117592, 2022. https://doi.org/10.1016/j.eswa.2022.117592
[65] Ventola C., “Social Media and Health Care Professionals: Benefits, Risks, and Best Practices,” Pharmacy and Therapeutics, vol. 39, no. 7, pp. 491-499, 2014. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC 4103576/
[66] Yochum P., Nisamaneewong P., Karnchanapimonkul P., and Chomanan P., “Automated Disease Detection Based on Clinical Text Using Topic Modeling,” in Proceedings of the 10th International Conference on Information Technology: IoT and Smart City, Shanghai, pp. 74-79, 2022. https://doi.org/10.1145/3582197.3582209
[67] Zhang F., Luo J., Li C., Wang X., and Zhao Z., “Detecting and Analyzing Influenza Epidemics with Social Media in China,” in Proceedings of the 18th Pacific-Asia Conference: Lecture Notes in Computer Science, Tainan, pp. 90-101, 2014. https://link.springer.com/chapter/10.1007/978-3- 319-06608-0_8
[68] Zhang X., Saleh H., Younis E., Sahal R., and Ali A., “Predicting Coronavirus Pandemic in Real- Time Using Machine Learning and Big Data Streaming System,” Complexity, vol. 2020, pp. 1- 10, 2020. https://doi.org/10.1155/2020/6688912
[69] Zhao J., Feng Q., Wu P., Warner J., Denny J., and Wei W., “Using Topic Modeling via Non- Negative Matrix Factorization to Identify Determining PolyCystic Ovarian Syndrome Severity from Reddit Posts using Topic ... 457 Relationships between Genetic Variants and Disease Phenotypes: A Case Study of Lipoprotein(a) (LPA),” PLoS One, vol. 14, no. 2, pp. 1-15, 2019. https://doi.org/10.1371/journal.pone.0212112