The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


A Method for Finding the Appropriate Number of Clusters

Drawback of almost partition based clustering algorithms is the requirement for the number of clusters specified at the beginning. Identifying the true number of clusters at the beginning is a difficult problem. So far, there were some works studied on this issue but no method is perfect in every case. This paper proposes a method to find the appropriate number of clusters in the clustering process by making an index indicated the appropriate number of clusters. This index is built from the intra-cluster coefficient and inter-cluster coefficient. The intra-cluster coefficient reflects intra-distortion of the cluster. The inter-cluster coefficient reflects the distance among clusters. Those coefficients are made only by extremely marginal objects of clusters. The looking for the extremely marginal objects and the building of the index are integrated in a weighted FCM algorithm and it is calculated suitably while the weighted Fuzzy C-Means (FCM) is processing. The Extended weighted FCM algorithm integrated this index is called Fuzzy C-Means-Extended (FCM-E). Not only does the FCM-E seek the clusters, but it also finds the appropriate number of clusters. The authors experiment with the FCM-E on some data sets of University of California, Irvine (UCI): Iris, Wine, Breast Cancer Wisconsin, and Glass and compare the results of the proposed method with the results of the other methods. The results of proposed method obtained are encouraging.


[1] Bezdek J., Ehrlich R., and Full W., FCM: The Fuzzy C-Means Clustering Algorithm, Computers and Geosciences, vol. 10, no. 2-3, pp. 191-203, 1984.

[2] Capitaine H. and Fr licot C., A Fuzzy Modeling Approach to Cluster Validity, in Proceedings of IEEE International Conference on Fuzzy Systems, Jeju Island, pp. 462-467, 2009.

[3] Cheong Y. and Lee H., Determining the Number of Clusters in Cluster Analysis, Journal of the Korean Statistical Society, vol. 37, no. 2, pp. 135-143, 2008.

[4] Doan H. and Nguyen T., An Adaptive Method to Determine the Number of Clusters in Clustering Process, in Proceedings of The International Conference on Computer and Information Sciences, Kuala Lumpur, pp. 1-6, 2014.

[5] Hathaway R. and Bezdek J., Recent Convergence Results for the Fuzzy c-Means 682 The International Arab Journal of Information Technology, Vol. 15, No. 4, July 2018 Clustering Algorithms, Joumal of Classificanon, vol. 5, no. 2, pp. 237-247, 1988.

[6] Kalti K. and Mahjoub M., Image Segmentation by Gaussian Mixture Models and Modified FCM Algorithm, The International Arab Journal of Information Technology, vol. 11, no. 1, pp. 11-18, 2014.

[7] Kyrgyzov I., Kyrgyzov O., Ma tre H., and Campedel M., Kernel MDL to Determine the 1 X P E H U R I & O X V W H U V in Proceedings of 5th International Conference Machine Learning and Data Mining in Pattern Recognition. Lecture Notes in Computer Science, Leipzig, pp. 203-217, 2007.

[8] Nguyen T. and Doan H., An Approach to determine the Number of Clusters for Clustering Algorithms, in Proceedings of 4th International Conference Computational Collective Intelligence. Technologies and Applications, Vietnam, pp. 485-494, 2012.

[9] Pham T., Dimov S., and Nguyen D., Selection of K in K-means Clustering, Journal of Mechanical Engineering Science, vol. 219, no.1, pp.103-119, 2005.

[10] Rosenberger C. and Chehdi K., Unsupervised Clustering Method with Optimal Estimation of the Number of Clusters: Application to Image Segmentation, in Proceedings of 15th International Conference on Pattern Recognition, Barcelona, 2000.

[11] Salvador S. and Chan P., Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms, in Proceedings of 16th IEEE International Conference on Tools with Artificial Intelligence, Boca Raton, 2004.

[12] Sanguinetti G., Laidler J., and Lawrence N., Automatic Determination of the Number of Clusters using Spectral Algorithms, IEEE Workshop on Machine Learning for Signal Processing, Mystic, 2005.

[13] Shao Q. and Wu Y., A consistent Procedure for Determining the Number of Clusters in Regression Clustering, Journal of Statistical Planning and Inference, vol. 135, no. 2, pp. 461- 476, 2005.

[14] Sugar C. and James G., Finding the Number of Clusters in a Data set: An Information Theoretic Approach, Journal of the American Statistical Association, vol. 98, no. 463, pp. 750-763, 2003.

[15] Sun H., Wang S., and Jiang Q., FCM-Based Model Selection Algorithms for Determining the Number of Clusters, Pattern Recognition, vol. 37, no. 10, pp. 2027-2037, 2004.

[16] Tibshirani R., Walther G., and Hastie T., Estimating the Number of Clusters in a Data Set Via the Gap Statistic, Journal of the Royal Statistical Society, vol. 63, no. 2, pp. 411-423 2001.

[17] UCI Machine Learning Repository, available at: http://archive.ics.uci.edu/ml/datasets.html, Last Visited, 2013.

[18] Yan M. and Ye K., Determining the Number of Clusters Using the Weighted Gap Statistic, Biometrics, vol. 63, no. 4, pp. 1031-1037, 2007.

[19] Zalik K., Cluster Validity Index for Estimation of Fuzzy Clusters of Different Sizes and Densities, Pattern Recognition, vol. 43, no. 10, pp. 3374-3390, 2010.

[20] Zhao Q., Hautamaki V., and Fr nti P., Knee Point Detection in BIC for Detecting the Number of Clusters, in Proceedings of International Conference on Advanced Concepts for Intelligent Vision Systems, France, pp. 664- 673, 2008. Huan Doan received his BSc degree in Mathematics from Hue University of Science, Vietnam in 1988, and MSc degree in Computer Science from University of Information Technology (UIT), Vietnam National University Ho Chi Minh city (VNU-HCM) in 2012. He is currently pursuing PhD degree in Computer Science from University of Information Technology (UIT), VNU- HCM. He is also the director of EnterSoft Software Solution Joint Stock Company, Ho Chi Minh City, Vietnam. He has published about 8 research papers in the area of data mining and artificial intelligence, data analysis and risk analysis at international/national level conferences and journals. Dinh Nguyen Nguyen has been the Associate Professor at Department of Information Systems, University of Information Technology (UIT), Vietnam National University Ho Chi Minh city (VNU-HCM). He received his BSc degree in Mathematics from Dalat University in 1984, MSc degree in Information Technology from University of Science (VNU-HCM) in 1997 and PhD degree in Information Technology from Institute of Information Technology (IOIT), Vietnamese Academy of Science and Technology (VAST) in 2004. He has published more than 35 research papers in the area of database, data mining and data analysis at international/national level conferences and journals. He is currently guiding 3 PhD students in the area of data mining and data analysis