The International Arab Journal of Information Technology (IAJIT)


Preceding Document Clustering by Graph Mining Based Maximal Frequent Termsets Preservation

This paper presents an approach to cluster documents. It introduces a novel graph mining based algorithm to find frequent termsets present in a document set. The document set is initially mapped onto a bipartite graph. Based on the results of our algorithm, the document set is modified to reduce its dimensionality. Then, Bisecting K-means algorithm is executed over the modified document set to obtain a set of very meaningful clusters. It has been shown that the proposed approach, Clustering preceded by Graph Mining based Maximal Frequent Termsets Preservation (CGFTP), produces better quality clusters than produced by some classical document clustering algorithm(s). It has also been shown that the produced clusters are easily interpretable. The quality of clusters has been measured in terms of their F-measure.

[1] Beil F., Ester M., and Xu X., “Frequent Term- Based Text Clustering,” in Proceedings of 8th International Conference on Knowledge Discovery and Data Mining, Alberta, pp. 436- 442, 2002.

[2] Buckley C. and Lewit A., “Optimizations of Inverted Vector Searches,” in Proceedings of 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Montreal, pp. 97-110, 1985.

[3] Chen C., Tseng F., and Liang T., “Mining Fuzzy Frequent Itemsets for Hierarchical Document Clustering,” Information Processing and Management, vol. 46, no. 2, pp. 193-211, 2010.

[4] Fung B., Wang K., and Ester M., “Hierarchical Document Clustering Using Frequent Itemsets,” in Proceedings of 3rd SIAM International Conference on Data Mining, San Francisco, pp. 59-70, 2003.

[5] Gutierrez A., Martinez J., Garcia M., and Carrasco J., “Mining Patterns For Clustering on Numerical Datasets Using Unsupervised Decision Trees,” Knowledge-Based Systems, vol. 82, pp. 70-79, 2015. (5) ),(Pr),(Re ),(Pr*),(Re*2),(jiecisionjicall jiecisionjicalljiF 370 The International Arab Journal of Information Technology, Vol. 16, No. 3, May 2019

[6] Hernandez-Reyes E., Garcia-Hernández R., Carrasco-Ochoa J., and Martinez-Trinidad J., “Document Clustering Based on Maximal Frequent Sequences,” in Proceedings of 5th International Conference on Natural Language Processing, Turku, pp. 257-267, 2006.

[7] Karypis G., Karypis Lab. wnload, Last Visited, 2016.

[8] Kiran G., Shankar K., and Pudi V., “Frequent Itemset based Hierarchical Document Clustering using Wikipedia as External Knowledge,” in Proceedings of 14th International Conference on Knowledge-Based and Intelligent Information Engineering Systems, Cardiff, pp. 11-20, 2010.

[9] Koller D. and Sahami M., “Hierarchically Classifying Documents Using Very Few Words,” in Proceedings of 14th International Conference on Machine Learning, Nashville, pp. 170-178, 1997.

[10] Kowalski G., Information Retrieval Systems- Theory and Implementation, Kluwer Academic Publishers, 1997.

[11] Kozlowski M., “Web Search Results Clustering Using Frequent Termset Mining,” in Proceedings of 6th International Conference on Pattern Recognition and Machine Intelligence, Warsaw, pp. 525-534, 2015.

[12] Krishna S. and Bhavani S., “An Efficient Approach for Text Clustering Based on Frequent Itemsets,” European Journal of Scientific Research, vol. 42, no. 3, pp. 385-396, 2010.

[13] Li Y., Chung S., and Holt J., “Text Document Clustering Based on Frequent Word Meaning Sequences,” Data and Knowledge Engineering, vol. 64, no. 1, pp. 381-404, 2008.

[14] Malik H., Kender J., Fradkin D., and Moerchen F., “Hierarchical Document Clustering Using Local Patterns,” Data Mining and Knowledge Discovery, vol. 21, no. 1, pp. 153-185, 2010.

[15] Morzy T., Wojciechowski M., and Zakrzewicz M., “Pattern-Oriented Hierarchical Clustering,” in Proceedings of 3rd East European Conference on Advances in Databases and Information Systems, Maribor, pp. 179-190, 1999.

[16] Rijsbergen C., Information Retrieval, Buttersworth, 1979.

[17] Shankar K., Kiran G., and Pudi V., “Evolutionary Clustering using Frequent Itemsets,” in Proceedings of 1st International Workshop on Novel Data Stream Pattern Mining Techniques, Washington, pp. 25-30, 2010.

[18] Steinbach M., Karypis G., and Kumar V., “A Comparison of Document Clustering Techniques,” Technical Report, University of Minnesota, 2000.

[19] Tunali V., Turgay B., and Ali C., “An Improved Clustering Algorithm for Text Mining: Multi- Cluster Spherical K-Means,” The International Arab Journal of Information Technology, vol. 13, no. 1, pp. 12-19, 2016.

[20] Xiong H., Steinbach M., Tang P., and Kumar V., “HICAP: Hierarchical Clustering With Pattern Preservation,” in Proceedings of SIAM International Conference on Data Mining, Florida, pp. 279-290, 2004.

[21] Yang Y. and Padmanabhan B., “GHIC: A Hierarchical Pattern-Based Clustering Algorithm for Grouping Web Transactions,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 9, pp. 1300-1304, 2005.

[22] Zamir O., Etzioni O., Madani O., and Karp R., “Fast and Intuitive Clustering of Web Documents,” in Proceedings of 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, pp. 287-290, 1997.

[23] Zhang W., Yoshida T., Tang X., and Wang Q., “Text Clustering Using Frequent Itemsets,” Knowledge-Based Systems, vol. 23, no. 5, pp. 379-388, 2010. Syed Shah is pursuing Ph.D. (in Computer Engineering) from Jamia Millia Islamia, New Delhi, India. He has 7 years of academic experience. His research interests include data mining and big data analytics. Mohammad Amjad has obtained his B.Tech. degree in Computer Engineering from Aligarh Muslim University, Aligarh. He obtained his M.Tech. degree in Information Technology from IP University, New Delhi and Ph.D. in Computer Engineering from Jamia Millia Islamia, New Delhi. Dr. Amjad is currently working as Assistant Professor in the Department of Computer Engineering, Faculty of Engineering & Technology, Jamia Millia Islamia (Central University), New Delhi. He has four years industry experience and 15 years of teaching experience. He has contributed thirty research papers in various reputed journals, national and international conferences including countries like USA and China. He is actively involved in research and development activities in areas of MANET, WSN, software engineering, mobile computing, network security systems and allied areas.