The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Clustering with Probabilistic Topic Models on

 ,
#
 Recently, probabilistic topic models such as Latent Dirichlet Allocation (LDA) have been widely used f or applications in many text mining tasks such as retr ieval, summarization and clustering on different languages. In this paper, we present a first comparative study between LDA an d K/means, two well/known methods respectively in topics identification and clustering applied on Arabic texts. Our aim is to compare the influence of morpho/syntactic charac teristics of Arabic language on performance of first method compared to the second one. In order to, study different aspects of those methods the study is conducted on four benchmark document colle ctions in which the quality of clustering was measured by the use of four well/known evaluation measures, Rand index, Jaccard index, F/measure and Entropy. The results consistently show that LDA perform best results more than K/means in most case s.


[1] Ababneh M., Al4Shalabi R., Kanaan G., and Al4 Nobani A., Building an Effective Rule4Based Light Stemmer for Arabic Language to Improve Search Effectiveness, the International Arab Journal of Information Technology , vol. 9, no. 4, pp. 3684372, 2012.

[2] Abbas M., Smaili K., and Berkani D., Multi4 Category Support Vector Machines for Identifying Arabic Topics, Advances in Computational Linguistics , Special Issue of Journal of Research in Computing Science , vol. 41, pp. 2174226, 2009. Clustering with Probabilistic Topic Models on Arabic Texts: A Comparative Study of LDA and K/Means 337

[3] Blei D. and Lafferty J., A Correlated Topic Model of Science, the Annals of Applied Statistics , vol. 1, no. 1, pp. 17435, 2007.

[4] Blei D. and Lafferty J., Dynamic Topic Models, in Proceedings of the 23 rd International Conference on Machine Learning , New York, USA, pp. 1134120, 2006.

[5] Blei D., NG Y., and Jordan I., Latent Dirichlet Allocation, the Journal of Machine Learning Research , vol. 3, pp. 99341022, 2003.

[6] Brahmi A., Ech4Cherif A., and Benyettou A., Arabic Texts Analysis for Topic Modeling Evaluation, Information Retrieval , vol. 15, no. 1, pp. 33453, 2012.

[7] Darwish K. and Oard W., Evidence Combination for Arabic4English Retrieval, available at: https://terpconnect.umd.edu/~oard/ pdf/trec02.pdf, last visited 2002.

[8] Darwish K., Hassan H., and Emam O., Examining the Effect of Improved Context Sensitive Morphology on Arabic Information Retrieval, in Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages , Ann Arbor, USA, pp. 25430, 2005.

[9] Diab M., Hacioglu K., and Jurafsky D., Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks, in Proceedings of the 5 th Meeting of the North American Chapter of the Association for Computational Linguistics/ Human Language Technologies Conference , USA, pp. 1494152, 2004.

[10] El Sulaiti L., L arabe Contemporain, Radio Qatar, Qatar, 2003.

[11] Griffiths L. and Steyvers M., Finding Scientific Topics, in Proceedings of the National Academy of Science , USA, pp. 522845235, 2004.

[12] Hamers L., Hemeryck Y., Herweyers G., Janssen M., Keters H., Rousseau R., and Vanhoutte A., Similarity Measures in Scientometric Research: The Jaccard Index versus Salton s Cosine Formula, Information Processing and Management , vol. 25, no. 3, pp. 3154318, 1989.

[13] Huot H. and Coupet P., Le Text Mining sur la langue Arabe : Application au Traitement des Sources Ouvertes, TEMIS SA, Paris, France, 2005.

[14] Larkey S., Ballesteros L., and Connell E., Light Stemming for Arabic Information Retrieval , Arabic Computational Morphology , Springer, 2007.

[15] Larsen B. and Aone C., Fast and Effective Text Mining using Linear4Time Document Clustering, in Proceedings of the 5 th International Conference on Knowledge Discovery and Data Mining , CA, USA, pp. 164 22, 1999.

[16] Lu Y., Mei Q., and Zhai C., Investigating Task Performance of Probabilistic Topic Models: An Empirical Study of PLSA and LDA, Information Retrieval , vol. 14, no. 2, pp. 1784 203, 2011.

[17] Manning D., Raghavan P., and Sch tze H., Introduction to Information Retrieval , Cambridge University Press, Cambridge, UK, 2008.

[18] Mccallum K., MALLET: A Machine Learning for Language Toolkit, available at: http://mallet.cs.umass.edu, last visited 2002.

[19] Rand M., Objective Criteria for the Evaluation of Clustering Methods, Journal of the American Statistical Association , vol. 66, no. 336, pp. 8464 850, 1971.

[20] eh ek R. and Sojka P., Gensim4Python Framework for Vector Space Modelling, Masaryk University, Brno, Czech Republic, 2011.

[21] Rosen4zvi M., Griffiths T., Steyvers M., and Smyth P., The Author4topic Model for Authors and Documents, in Proceedings of the 20 th Conference on Uncertainty in Artificial Intelligence , Alberta, Canada, pp. 4874494, 2004.

[22] Saad K. and Achour W., OSAC: Open Source Arabic Corpora, in Proceedings of the 6 th International Symposium on Electrical and Electronics Engineering and Computer Science , European University of Lefke, pp. 1184123, 2010.

[23] Sawaf H., Zaplo J., and Ney H., Statistical Classification Methods for Arabic News Articles, available at: http://www.abdelali.net/ref/Sawaf_ArabicClassifi cation.pdf, last visited 2001.

[24] Shannon E., A Mathematical Theory of Communication, Bell System Technical Journal , vol. 27, pp. 3794423, 1948.

[25] Steinbach M., Karypis G., and Kumar V., A Comparison of Document Clustering Techniques, available at: http://www.cs.cmu.edu/~dunja/KDDpapers/Stein bach_IR.pdf, last visited 2000.

[26] Van Rijsbergen J., Information Retrieval, London, UK, 1979.

[27] Zhao Y. and Karypis G., Criterion Functions for Document Clustering: Experiments and Analysis, available at: http://citeseerx.ist.psu.edu/viewdoc/download?do i=10.1.1.402.4633&rep=rep1&type=pdf, last visited 2001. 338 The International Arab Journal of Information Techn ology VOL. 13, NO. 2, March 2016 Abdessalem Kelaiaia received his Engineer degree from Annaba University, Algeria in 1996, and his MS degree in Computer Science from the Guelma University, Algeria in 2008. Currently, he is working as an Assistant Professor at the University of May 08, Algeria and he is preparing t he PhD degree at Annaba University. His current resear ch field is text mining. Hayet Merouani received her Engineer degree from Annaba University, Algeria in 1984, PhD degree from Robert Gordon University, UK. Actually, she is full Associate Professor at Badji Mokhtar University, Annaba. She also, leads Research group of Pattern recognition a s a national program research of breast cancer. Her cur rent works focus on the computer vision, medical imaging and Biometry.