The International Arab Journal of Information Technology (IAJIT)


Determining PolyCystic Ovarian Syndrome Severity from Reddit Posts using Topic Modelling and Association Rule Mining

Nowadays social media plays a vital role in various real-time applications, especially in healthcare applications. PolyCystic Ovarian Syndrome (PCOS) is a condition that affects females between the ages of 15 and 35 who are of reproductive potential. The symptoms of PCOS are hormonal issues, irregular periods, weight gain, follicles, infertility, excessive hair growth in the skin, hair loss, acne, pimples, dark scars, and depression. Most of the earlier researchers analyzed the PCOS based on clinical text and health records using a machine learning approach. The main motivation of this proposed work is to predict the upcoming PCOS symptoms based on current symptoms and find the severity of the PCOS from Reddit users. This is done by collecting head symptoms from Gynecologists, gathering present symptoms from Reddit users, collecting unstructured data is pre-processed and PCOS sub symptoms are extracted using Bag of Words. The sub symptoms are mapped into head symptoms using Latent Dirichlet Allocation (LDA) for dimension reduction. The major issue in that approach is a single user has experienced the same type of symptom multiple times. This issue is solved by implementing a novel method called Symptom Segmentation and grouping Labeled Latent Dirichlet Allocation (SSG_LLDA) is designed to reduce the dimensionality and map the social media users sub symptoms into head symptoms. Association Rule Mining (ARM) with Apriori is employed to produce the frequent symptoms, and effective rule sets, and form the distinctive symptom patterns. Among several mini-mum support and confidence metrics, 0.02 and 0.1 delivers the best rule sets and symptom patterns. Based on rulesets of symptom patterns and combinations, the severity of PCOS is determined for Reddit users. The novelty of this work is the construction of PCOS symptom patterns from topic modelling results instead of original data so the dimensionality of the features is reduced and more scalable.

