The International Arab Journal of Information Technology (IAJIT)


Comparison of Dimension Reduction Techniques

High dimensional data becomes very common with the rapid growth of data that has been stored in databases or other information areas. Thus clustering process became an urgent problem. The well-known clustering algorithms are not adequate for the high dimensional space because of the problem that is called curse of dimensionality. So dimensionality reduction techniques have been used for accurate clustering results and improve the clustering time in high dimensional space. In this work different dimensionality reduction techniques were combined with Fuzzy C-Means clustering algorithm. It is aimed to reduce the complexity of high dimensional datasets and to generate more accurate clustering results. The results were compared in terms of cluster purity, cluster entropy and mutual info. Dimension reduction techniques are compared with current Central Processing Unit (CPU), current memory and elapsed CPU time. The experiments showed that the proposed work produces promising results on high dimensional space.

[1] Arabie P. and Hubert L., An Overview of Combinational Data Analysis, Clustering and Classification, 1996.

[2] Belkin M. and Niyogi P., Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering, in Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, British Columbia, pp. 585-591, 2001.

[3] Berry M., Dumais S., and O'Brien G., Using Linear Algebra for Intelligent Information Retrieval, Society for Industrial and Applied Mathematics, vol. 37, no. 4, pp. 573-595, 1995.

[4] Bilgin T., Three new Frameworks for the Design and Application of Visual Data Mining in High Dimensional Space, PhD thesis, Marmara University, 2007.

[5] Bilgin T. and Camurcu Y., A Modified Relationship based Clustering Framework for Density based Clustering and Outlier Filtering on High Dimensional Datasets, in Proccedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, Nanjing, pp. 409- 416, 2007.

[6] Bilgin T. and Camurcu Y., A Clustering Framework for Unbalanced Partitioning and Comparison of Dimension Reduction Techniques on High Dimensional Datasets 261 Outlier Filtering on High Dimensional Datasets, in Proccedings of East European Conference on Advances in Databases and Information Systems, Varna, pp. 205-216, 2007.

[7] Bingham E. And Mannila H., Random Projection in Dimensionality Reduction: Applications to Image and Text Data, in Proceedings of the 7th ACM International Conference on Knowledge Discovery and Data Mining, California, pp. 245-250, 2001.

[8] Cheeseman P. and Stutz J., Bayesian Classification (AutoClass): Theory and Results, in proccedings of Advances in Knowledge Discovery and Data Mining, Menlo Park, pp. 153-180, 1996.

[9] Chen Y., Crawford M., and Ghosh, J., Improved Nonlinear Manifold Learning for Land Cover Classification Via Intelligent Landmark Selection, in Proceedings of IEEE International Conference on Geoscience and Remote Sensing Symposium, Denver, pp. 545-548, 2006.

[10] Dasgupta S., Experiments with Random Projection, in Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, San Francisco, pp. 143-151, 2000.

[11] Davidson I., Knowledge Driven Dimension Reduction for Clustering, in Proceedings of International Joint Conference on Arti cial Intelligence, California, pp. 1034-1039, 2009.

[12] Deerwester S., Dumais S., Furnas G., Landauer T., and Harshman R., Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990.

[13] Ding C., A Similarity-Based Probability Model for Latent Semantic Indexing, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, California, pp. 58-65, 1999.

[14] Drineas P., Frieze A., Kannan R., Vempala S., and Vinay V., Clustering in Large Graphs and Matrices, in Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms, Maryland, pp. 291-299, 1999.

[15] Duda R. and Hart P., Pattern Classification and Scene Analysis, Wiley, 1973.

[16] Duda R., Hart P., and Stork D., Pattern Classification, Wiley, 2000.

[17] Fern X. and Brodley C., Cluster Ensembles for High Dimensional Clustering: An Empirical Study, Journal Machine Learning Research. 2004.

[18] Fodor I., A Survey of Dimension Reduction Techniques, Lawrence Livermore National Laboratory, 2002.

[19] Fukunaga K., Introduction to Statistical Pattern Recognition 2ed, Academic Press,1990.

[20] Golub G. and Van Loan C., Matrix computations, Johns Hopkins University Press, 1996.

[21] Greene D. and Cunningham P., Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering, in Proceedings of the 23rd International Conference on Machine Learning, Pennsylvania, pp. 377-384, 2006.

[22] Han J. and Kamber M., Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001.

[23] Hartigan J., Clustering Algorithms, John Wiley and Sons, 1975.

[24] Hinton G. and Roweis S., Stochastic Neighbor Embedding, Advances in Neural Information Processing Systems, 2003.

[25] Hotelling H., Analysis of a Complex of Statistical Variables in to Principal Components, Journal of Educational Psychology, vol. 24, no. 6, pp. 417-441, 1933.

[26] H ppner F., Klawonn F., Kruse R., and Runkler T., Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition, John Wiley, 2000.

[27] Hyv rinen A. and Oja E., Independent Component Analysis: Algorithms and Applications, Neural networks, vol. 13, no. 4-5, pp. 411-430, 2000.

[28] Izakian H. and Abraham A., Fuzzy C-means and Fuzzy Swarm for Fuzzy Clustering Problem, Expert Systems with Applications, vol. 38, no. 3, pp. 1835-1838, 2011.

[29] Jain A. and Dubes R., Algorithms for Clustering Data, Prentice-Hall, 1988.

[30] Jolliffe I., Principal Component Analysis, John Wiley and Sons, 2005.

[31] Jun S., Park S., and Jang D., Document Clustering Method Using Dimension Reduction and Support Vector Clustering to Overcome Sparseness, Expert Systems with Applications, vol. 41, no. 7, pp. 3204-3212, 2014.

[32] Kaufman L. and Rousseeuw P., Finding Groups in Data: an Introduction to Cluster Analysis, Wiley Online Library,1990.

[33] Krawczak M. and Szkatu a G., An Approach to Dimensionality Reduction in Time Series, Information Sciences, vol. 260, pp. 15-36, 2014.

[34] Lee S., Abbott A., and Araman P., Dimensionality Reduction and Clustering on Statistical Manifolds, in Proceedings of Conference on Computer Vision and Pattern Recognition, Minneapolis, pp. 1-7, 2007.

[35] Michalski R. and Stepp R., Learning from Observation: Conceptual Clustering, Machine Learning, Berlin, pp. 331-363, 1983.

[36] Nash W., Sellers T., Talbot S., Cawthorn A., and Ford W., Available from:, Last Visited, 1994. 262 The International Arab Journal of Information Technology, Vol. 15, No. 2, March 2018

[37] z en S. and Ceylan R., Comparison of AIS and Fuzzy C-means Clustering Methods on the Classification of Breast Cancer and Diabetes Datasets, Turkish Journal of Electrical Engineering and Computer Sciences, vol. 22, pp. 1241-1254, 2014.

[38] Shi J. and Luo Z., Nonlinear Dimensionality Reduction of Gene Expression Data for Visualization and Clustering Analysis of Cancertissue Samples, Computers in Biology and Medicine, vol. 40, no. 8, pp. 723-732, 2010.

[39] Somwang P. and Lilakiatsakun W., Anomaly Traffic Detection Based on PCA and SFAM, The International Arab Journal of Information Technology, vol. 12, no. 3, pp. 253-260, 2015.

[40] Tang B., Shepherd M., Heywood M., Luo X., Kegl B., and Lapalme G., Comparing Dimension Reduction Techniques for Document Clustering , in Proceedings of Conference of the Canadian Society for Computational Studies of Intelligence, Victoria, pp. 292-296, 2005.

[41] Tenenbaum J., Silva V., and Langford J., A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science, vol. 290, no. 5500, pp. 2319-2323, 2000.

[42] Teng L., Li H., Fu X., Chen W., and Shen I., Dimension Reduction of Microarray Data based on Local Tangent Space Alignment, in Proceedings of Fourth IEEE Conference on Cognitive Informatics, Irvine, pp. 154-159, 2005.

[43] Ture M., Kurt I., and Akturk Z., Comparison of Dimension Reduction Methods using Patient Satisfaction Data, Expert Systems with Applications, vol. 32, no. 2, pp. 422-426, 2007.

[44] Van der Maaten L., and Hinton G., Visualizing Data using t-SNE, Journal of Machine Learning Research, vol. 9, pp. 2579-2605, 2008.

[45] Van der Maaten L., An Introduction to Dimensionality Reduction using Matlab, Faculty of Humanities and Sciences, 2007.

[46] Yang Y. and Pedersen J., A Comparative Study on Feature Selection in Text Categorization, in Proceedings of the 14th International Conference on Machine Learning, San Francisco, pp. 412- 420, 1997.

[47] Zhang X., Wang J., Fan Z., and Li B., Spatial Clustering with Obstacles Constraints using Ant Colony and Particle Swarm Optimization, in Proceedings of Emerging Technologies in Knowledge Discovery and Data Mining, Nanjing, pp. 344-356, 2007.

[48] Zhou T., Tao D., and Wu X., Manifold Elasticnet: a Unified Framework for Sparse Dimension Reduction, Data Mining and Knowledge Discovery, vol. 22, no.3, pp. 340-371, 2011. Kazim Yildiz received a Ph.D and Msc in Electronic and Computer Education (2014) and (2010) respectively in Marmara University. He received a B.Sc. degree in Computer and Control Education from Marmara University. From August 2009 to August 2015, Kazim Yildiz worked as a research assistant. From August 2015 he has been working as an Assistant Professor in Computer Engineering department of Technology Faculty. His current research areas are digital image processing, high dimensional data mining and thermal imaging. Buket Dogan received the MS and PhD degrees in Computer-Control Educationfrom Marmara University in 2001 and 2006, respectively.From 1999 to 2007 she worked as a research assistant. She has been working as an Assistant Professor in Computer Engineering department of Technology Faculty. Her research interests include data mining, image processing and adaptive web based educational systems. Yilmaz Camurcu received the PhD degree in computer education fromMarmara University, Istanbul in 1996. He is a professor of Computer Engineering in the Faculty of Engineeringand Architecture at Fatih Sultan Mehmet Waqf University. He is a member of ACM. His current research interests are data mining, intelligent tutoring systems, and medical image processing.