The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


A study on Two-Stage Mixed Attribute Data Clustering Based on Density Peaks

A Two-stage clustering framework and a clustering algorithm for mixed attribute data based on density peaks and Goodall distance are proposed. Firstly, the subset of numerical attributes of the dataset is clustered, and then the result is mapped into one-dimensional categorical attribute and added to the subset of categorical attribute data. Finally, the new dataset is clustered by the density peaks clustering algorithm to obtain the final result. Experiments on three commonly used UCI datasets show that this algorithm can effectively realize mixed attribute clustering and produce better clustering results than the traditional K-prototypes algorithm do. The clustering accuracy on the Acute, Heart and Credit datasets are 17%, 24%, and 21% higher on average than that of the K-prototypes, respectively.


[1] Bai T., Ji J., He J., and Zhou C., “New Clustering Method of Mixed-Attribute Data,” Journal of Jilin University (Engineering and Technology Edition, vol. 43, no.1, pp.130-134, 2013.

[2] David G. and Averbuch A., “Spectralcat: Categorical Spectral Clustering of Numerical and Nominal Data,” Pattern Recognition, vol. 45, no. 1, pp. 416-433, 2012.

[3] Du M., Ding S., and Xue Y., “A Novel Density Peaks Clustering Algorithm for Mixed Data,” Pattern Recognition Letters, vol. 97, pp. 46-53, 2017.

[4] Du T., Qu S., and Wang Q., “A Data-Driven Parameter Adaptive Clustering Algorithm Based on Density Peak,” Complexity, pp.1-14, 2018.

[5] Fang F., Qiu L., and Yuan S., “Adaptive Core Fusion-Based Density Peak Clustering for Complex Data with Arbitrary Shapes and Densities,” Pattern Recognition, vol. 107, no. 3, pp. 107452, 2020.

[6] Gan G., Ma C., and Wu J., Data Clustering: Theory, Algorithms, and Applications, Siam, 2007.

[7] Goodall D., “A New Similarity Index Based on Probability” Biometrics, vol. 22, no. 4, pp. 882- 907, 1966.

[8] Han J., Kamber M., and Pei J., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2012.

[9] He Z., Xu X., and Deng S., “Squeezer: An Efficient Algorithm for Clustering Categorical Data,” Journal of Computer Science and Technology, vol. 17, no. 5, pp. 611-624, 2002.

[10] He Z., Xu X., and Deng S., “Clustering Mixed Numeric and Categorical Data: A Cluster Ensemble Approach,” High Technology Letters, vol. 9, no. 4, pp. 1-14, 2005.

[11] Huang D. and Li X., “Incremental Relative Density-Based Clustering Algorithm for Mixture Datasets,” Control and Decision, vol. 28, no. 6, pp. 815-822, 2013.

[12] Huang Z., “Clustering Large Data Sets with Mixed Numeric and Categorical Values,” in Proceedings of the 1st pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21- 34, 1997.

[13] Ji J., Research on Algorithms for the Data with Multidimensional Mixed Attributes, Theses, Jilin University, 2013.

[14] Jia Z. and Song L., “Weighted k-Prototypes Clustering Algorithm Based on the Hybrid Dissimilarity Coefficient,” Mathematical Problems in Engineering, vol. 2020, pp. 1-13, 2020.

[15] Jiang D., Zang W., Sun R., Wang Z., and Liu X., “Adaptive Density Peaks Clustering Based on K- Nearest Neighbor and Gini Coefficient,” IEEE Access, vol. 8, pp. 113900-113917, 2020.

[16] Li T., Chen Y., Zhang J., and Qin S., “Incremental Clustering Algorithm of Mixed Numerical and Categorical Data Based on Clustering Ensemble,” Control and Decision, vol. 27, no. 4, pp. 603-608, 2012.

[17] Li C. and Biswas G., “Unsupervised Learning with Mixed Numeric and Nominal Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 4, pp. 673-690, 2002.

[18] Liu S., Zhou B., Decai H., and Shen L., “Clustering Mixed Data by Fast Search and Find of Density Peaks,” Mathematical Problems in Engineering, vol. 2017, pp.1-7, 2017. A study on Two-Stage Mixed Attribute Data Clustering Based on Density Peaks 643

[19] Liu Y., Ma Z., and Yu F., “Adaptive Density Peak Clustering Based on K-Nearest Neighbors with Aggregating Strategy,” Knowledge-Based Systems, vol.133, pp. 208-220, 2017.

[20] Piao S., Chaomurilige., and Yu J., “Cluster Validity Indexes for FCM Clustering Algorithm,” Pattern Recognition and Artificial Intelligence, vol. 28, no. 5, pp. 452-461, 2015.

[21] Qian C. and Huang D., “Clustering Algorithm for Mixed Data Based on Dimensional Frequency Dissimilarity and Strongly Connected Fusion,” Pattern Recognition and Artificial Intelligence, vol. 29, no. 1, pp. 82-89, 2016.

[22] Rodriguez A. and Laio A., “Clustering by Fast Search and Find of Density Peaks,” Science, vol. 344, no. 6191, pp. 1492-1496, 2014.

[23] Romano S., Vinh N., Bailey J., and Verspoor K., “Adjusting for Chance Clustering Comparison Measures,” Journal of Machine Learning Research, vol. 17, pp. 1-32, 2015.

[24] Shah S. and Amjad M., “Preceding Document Clustering by Graph Mining Based Maximal Frequent Termsets Preservation,” The International Arab Journal of Information Technology, vol. 16, no. 3, pp. 364-370, 2019.

[25] Strehl A. and Ghosh J., “Cluster Ensembles-A Knowledge Reuse Framework for Combining Multiple Partitions,” Journal of Machine Learning Research, vol. 3, pp. 583-617, 2003.

[26] Sun L, Liu R., Xu J., and Zhang S., “An Adaptive Density Peaks Clustering Method with Fisher Linear Discriminant,” IEEE Access, vol. 7, pp. 72936-72955, 2019.

[27] Sun Z., Su H., and Liang Y., “Improved K- Prototypes Clustering Algorithm,” Computer Engineering and Applications, vol. 56, no. 21, pp. 54-59, 2020.

[28] Wang K., Li J., Zhang J., and Guo L., “Experimental Comparison of Clusters Number Estimation for Cluster Analysis,” Computer Engineering, vol. 34, no. 9, pp.198-199, 2008.

[29] Xie J. and Qu Y., “K-medoids Clustering Algorithms with Optimized Initial Seeds by Density Peaks,” Computer Science and Exploration, vol. 10, no. 2, pp. 230-247, 2016.

[30] Xu R., Xu J., and Wunsch D., “A Comparison Study of Validity Indices on Swarm-Intelligence- Based Clustering,” IEEE Transactions on Systems Man and Cybernetics Part B Cybernetics, vol. 42, no. 4, pp. 1243-1256, 2012.

[31] Xu X., Ding S., Wang L., and Wang Y., “A Robust Density Peaks Clustering Algorithm with Density- Sensitive Similarity,” Knowledge-Based Systems, vol. 200, pp.106028, 2020.

[32] Zhao Y., Li B., Li X., and Liu W., “Cluster Ensemble Method for Databases with Mixed Numeric and Categorical Values,” Journal of Tsinghua University (Science and Technology), vol. 46, no. 10, pp. 1673-1676, 2006. Shihua Liu received his M.S. degree from Zhejiang University of Technology in Hangzhou, china in 2008, and received the Ph.D. degree in control science and engineering from Zhejiang University of Technology in 2018, He is currently an Associate professor in Wenzhou Polytechnic, his research interests include Machine Learning, Data Mining and Information Security. Hao Zhang is currently an Associate professor in Information Technology Department of Wenzhou Polytechnic, his research interests include Data Mining, Digital forensics and Information Security. Xianghua Liu received her M.S. degree from Huazhong University of Science and Technology in Software Engineering. She is a lecturer in Wenzhou Polytechnic, her research interests include Web application development and security, Data Mining.