The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Weighted Delta Factor Cluster Ensemble Algorithm for Categorical Data Clustering in Data Mining

Though many cluster ensemble approaches came forward as a potential and dominant method for enhancing the robustness, stability and the quality of individual clustering systems, it is intensely observed that this approach in most cases generate a final data partition with deficient information. The primary ensemble information matrix generated in the traditional cluster ensemble approaches results only the cluster data point relations with unknown entries. This paper mainly denotes the improved analysis of the Link based Cluster Ensemble (LCE) approach which overcomes the problem of degrading the quality of clustering result and in particular it presents an efficient novel Weighted Delta Factor Cluster Ensemble algorithm (WDFCE) which enhances the refined matrix by augmenting the values of similitude measures between the clusters formed in the Bipartite cluster graph. Subsequently to obtain the final ultimate cluster result, the pairwise-similarity consensus method is used in which K-means clustering technique is applied over the similarity measures that are formulated from the Refined Similitude Matrix (RSM). Experimental results on few UCI datasets and synthetic dataset reveals that this proposed method always outperforms the traditional cluster ensemble techniques and individual clustering algorithms.


[1] Andritsos P. and Tzerpos V., Information Theoretic Software Clustering, IEEE Transactions on Software Engineering, vol. 31, no. 2, pp. 150-165, 2005.

[2] Asuncion A. and Newman D., UCI Machine Learning Repository, School of Information and Computer Science, University of California, http://www.ics.uci.edu/~mlearn/MLRepository.ht ml, 2007.

[3] Ayad H. and Kamel M., Finding Natural Clusters Using Multi cluster Combiner Based on Shared Nearest Neighbours, in Proceeding of International Workshop Multiple Classifier Systems, Guildford, pp. 166-175, 2003.

[4] Barbara D., Li Y., and Couto J., COOLCAT: An Entropy-Based Algorithm for Categorical Clustering, in Proceeding of The 11th International Conference on Information And Knowledge Management, Virginia, pp. 582-589, 2002.

[5] Boulis C. and Ostendorf M., Combining Multiple Clustering Systems, in Proceeding of European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, pp. 63-74, 2004.

[6] Christou L., Coordination of Cluster Ensembles via Exact Methods, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 279-293, 2010.

[7] Cristofor D. and Simovici D., Finding Median Partitions Using Information Theoretical Based Genetic Algorithms, Journal of Universal Computer Science, vol. 8, no. 2, pp. 153-172, 2002.

[8] Domeniconi C. and Al-Razgan M., Weighted Cluster Ensembles: Methods and Analysis, ACM Transaction on Knowledge Discovery Data, vol. 2, no. 4, pp. 1-40, 2009.

[9] Fern X. and Brodley C., Solving Cluster Ensemble Problems by Bipartite Graph Partitioning, in Proceeding of International Conference on Machine Learning, Banff, pp. 36- 43, 2004.

[10] Fern X. and Brodley C., Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach, in Proceeding of International Conference on Machine Learning, Washington, pp. 186-193, 2003.

[11] Fouss F., Pirotte A., Renders J., and Saerens M., Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 3, pp. 355-369, 2007.

[12] Fred A. and Jain A., Combining Multiple Clustering Using Evidence Accumulation, IEEE Transactions Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 835-850, 2005.

[13] Ganti V., Gehrke J., and Ramakrishnan R., CACTUS: Clustering Categorical Data Using Summaries, in Proceeding of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, pp. 73- 83, 1999.

[14] George A., Efficient High Dimension Data Clustering using Constraint-Partitioning K- Means Algorithm, The International Arab Journal of Information Technology, vol. 10, no. 5, pp. 467-476, 2013.

[15] Getoor L. and Diehl C., Link Mining: A Survey, ACM SIGKDD Explorations Newsletter, vol. 7, no. 2, pp. 3-12, 2005.

[16] Gibson D., Kleinberg J., and Raghavan P., Clustering Categorical Data: An Approach Based on Dynamical Systems, Very Large Data Base Endowment Journal, vol. 8, no. 3-4, pp. 222-236, 2000.

[17] Guha S., Rastogi R., and Shim K., ROCK: A Robust Clustering Algorithm for Categorical Attributes, Information Systems, vol. 25, no. 5, pp. 345-366, 2000.

[18] Gullo F., Domeniconi C., and Tagarelli A., Projective Clustering Ensembles, Data Mining Weighted Delta Factor Cluster Ensemble Algorithm ... 283 and Knowledge Discovery, vol. 26, no. 3, pp. 452-511, 2009.

[19] He Z., Xu X., and Deng S., A Cluster Ensemble Method for Clustering Categorical Data, Journal of Information Fusion, vol. 6, no. 2, pp. 143-151, 2005.

[20] He Z., Xu X., and Deng S., Squeezer: An Efficient Algorithm for Clustering Categorical Data, Computer Science and Technology, vol. 17, no. 5, pp. 611-624, 2002.

[21] Hochbaum D. and Shmoys D., A Best Possible Heuristic for the K-Center Problem, Math of Operational Research, vol. 10, no. 2, pp. 180- 184, 1985.

[22] Hu X. and Yoo I., Cluster Ensemble and Its Applications in Gene Expression Analysis, in Proceeding of Asia-Pacific Bioinformatics Conference, New Zealand, pp. 297-302, 2004.

[23] Huang Z., Extensions to the K-Means Algorithm for Clustering Large Data Sets with Categorical Values, Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283-304, 1998.

[24] Hubert L. and Arabie P., Comparing Partitions, Journal of Classification, vol. 2, no. 1, pp. 193- 218, 1985.

[25] Iam-On N., Boongoen T., and Garrett S., Refining Pairwise Similarity Matrix for Cluster Ensemble Problem with Cluster Relations, in Proceeding of International Conference on Discovery Science, Budapest, pp. 222-233, 2008.

[26] Iam-On N., Boongeon T., Garrett S., and Price C., A Link Based Cluster Ensemble Approach for Categorical Data Clustering, IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 3, pp. 413-425, 2012.

[27] Jeh G. and Widom J, Simrank: A Measure of Structural-Context Similarity, in Proceeding of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Alberta, pp. 538-543, 2002.

[28] Jia J., Xiao X., and Liu B., Similarity-based Spectral Clustering Ensemble Selection, in Proceeding of International Conference on Fuzzy Systems and Knowledge Discovery, Sichuan, pp. 1071-1074, 2012.

[29] Karypis G. and Kumar V., Multilevel K-Way Partitioning Scheme for Irregular Graphs, Journal Parallel Distributed Computing, vol. 48, no. 1, pp. 96-129, 1998.

[30] Kittler J., Hatef M., Duin R., and Matas J., On Combining Classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226-239, 1998.

[31] Li T. and Chen Y., Fuzzy Clustering Ensemble Algorithm for Partitioning Categorical Data, in Proceeding of International Conference on Business Intelligence and Financial Engineering IEEE Computer Society, pp. 170-174, 2009.

[32] Ng A., Jordan M., and Weiss Y., On Spectral Clustering: Analysis and an Algorithm, in Proceeding of Advances in Neural Information Processing Systems, British Columbia, pp. 849- 856, 2001.

[33] Vega-Pons S. and Ruiz-Shulcloper J., A Survey of Clustering Ensemble Algorithms, International Journal of Pattern Recognition and Artificial Intelligence vol. 25, no. 3, pp. 337-372, 2011.

[34] Strehl A. and Ghosh J., Cluster Ensembles: A Knowledge Reuse Framework for Combining Multiple Partitions, The Journal of Machine Learning Research, vol. 3, pp. 583-617, 2003.

[35] Topchy A., Jain A., and Punch W., Clustering Ensembles: Models of Consensus and Weak Partitions, IEEE Transactions Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1866-1881, 2005.

[36] Wang H., Shan H., and Banerjee A., Bayesian Cluster Ensembles, Statistical Analysis and Data Mining, vol. 4, no. 1, pp. 54-70, 2011.

[37] Yang Y., Guan X., and You J., CLOPE: A Fast and Effective Clustering Algorithm for Transactional Data, in Proceeding of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Alberta, pp. 682-687, 2002.

[38] Yu Z., Wong H., and Wang H., Graph-Based Consensus Clustering for Class Discovery from Gene Expression Data, Bioinformatics, vol. 23, no. 21, pp. 2888-2896, 2007.

[39] Zaki M. and Peters M., Clicks: Mining Subspace Clusters in Categorical Data via Kpartite Maximal Cliques in Proceeding of International Conference on Data Engineering, Tokoyo, pp. 355-356, 2005.

[40] Zheng W., Zou L., Feng Y., Chen L., Zhao D., Efficient Simrank Based Similarity Join Over Large Graphs, Proceedings of the VLDB Endowment, vol. 6, no.7, pp. 493-504, 2013. 284 The International Arab Journal of Information Technology, Vol. 14, No. 3, May 2017 Sarumathi Sengottaian received BE degree in Electronics and Communication Engineering from Madras University, Madras, Tamil Nadu India in 1994 and the ME degree in Computer Science and Engineering from K.S.Rangasamy College of Technology, Namakkal, Tamil Nadu, India in 2007. She is doing her PhD programme under the area Data Mining in Anna University, Chennai. She has a teaching experience of about 16 years. At present she is working as Associate professor in Information Technology department at K.S.Rangasamy College of technology. She has published 7 papers in the reputed International Journals and 2 papers in the reputed National journals. And also she has presented papers in three International conferences and four national Conferences. She has received many cash awards for producing cent percent results in university examination. She is a life member of ISTE. Shanthi Natesan received BE degree in Computer Science and Engineering from Bharathiyar University, Coimbatore, Tamil Nadu, India in 1994 and ME degree in Computer Science and Engineering from Government College of Technology, Coimbatore, Tamil Nadu, India in 2001. She has completed PhD degree in Periyar University, Salem in offline handwritten Tamil Character recognition. She worked as a HOD in Department of Information Technology, at K.S.Rangasamy College of Technology, Tamil Nadu, India since 1994 to 2013, and currently working as a Professor and Dean in the Department of Computer Science and Engineering at Nandha Engineering College Erode. She has published 39 papers in the reputed International journals and 9 papers in the National and International conferences. She has published 2 books. She is supervising 14 research scholars under Anna University, Chennai. She acts as the reviewer for 4 International Journals. Her current research interest includes document analysis, optical character recognition, and pattern recognition and network security. She is a life member of ISTE. Sharmila Mathivanan received BTech degree and MTech degree in Information Technology from K.S.Rangasamy College of Technology, affiliated to Anna University Chennai, Tamil Nadu, India in 2012 and 2014 respectively. At present she is working as an Assistant Professor in Information Technology Department at M.Kumarasamy College of Engineering, Karur, Tamil Nadu, India. She has published 6 international journals and presented three papers in National level technical symposium. She is an active member of ISTE. Her Research interests include Mining Medical data, Opinion Mining and Web mining. Most of her current work involves the development of efficient cluster ensemble algorithms for extracting accurate clusters in large dimensional database.