Downloads 999

..............................

Views 3k

..............................

Cited by

..............................

Received date February 10, 2013

Accepted date March 17, 2014

An Improved Clustering Algorithm for Text

Author Abstract: Thanks to advances in information and communication technologies, there is a prominent increase in the amount of ,

Keywords #

Abstract Thanks to advances in information and communication technologies, there is a prominent increase in the amount of information produced specifically in the form of te xt documents. In order to, effectively deal with this “information explosion” problem and utilize the huge amount of text databas es, efficient and scalable tools and techniques are indispensable. In this study, text clustering which is one of the most imp ortant techniques of text mining that aims at extra cting useful information by processing data in textual form is addressed. An im proved variant of spherical K-Means (SKM) algorithm named multi-cluster SKM is developed for clustering high dimensional do cument collections with high performance and efficiency. Experiments were performed on several document data sets and it is shown that the new algorithm provides significant increase in clustering quality without causing considerable dif ference in CPU time usage when compared to SKM algo rithm.

References

[1] 20NG, 20 News Groups Dataset, U.C.I. Machine Learning Repository., available at: http://archive.ics.uci.edu/ml/databases/20newsgr oups/, last visited 2010.

[2] Aliguliyev M., Clustering of Document Collection2A Weighting Approach, Expert Systems with Applications, vol. 36, no. 4, pp. 790427916, 2009.

[3] Batri K., Murugesh V., and Gopalan N., Effect of Weight Assignment in Data Fusion Based Information Retrieval, the International Arab Journal of Information Technology, vol. 8, no. 3, pp. 2442250, 2011.

[4] Bezdek C., Pattern Recognition with Fuzzy Objective Function Algorithms , New York, Plenum Press, 1981.

[5] CLUTO, CLUTO Software for Clustering High2 Dimensional Datasets., available at: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/ove rview, last visited 2009.

[6] Dhillon S. and Modha S., Concept Decompositions for Large Sparse Text Data using Clustering, Machine Learning, vol. 42, no. 122, pp. 1432175, 2001.

[7] Fayyad U., Piatetsky2Shapiro G., and Smyth P., From Data Mining to Knowledge Discovery: An Overview, Advances in Knowledge Discovery and Data Mining , Menlo Park, pp. 1234, 1996.

[8] Feldman R. and Sanger J., The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data , New York, Cambridge University Press, 2007.

[9] Han J., Kamber M., and Pei J., Data Mining: Concepts and Techniques , 3rd ed., Waltham, MA: Morgan Kaufmann Publishers, 2012.

[10] Hotho A., N rnberger A., and Paa G., A Brief Survey of Text Mining, LDV Forum-GLDV Journal for Computational Linguistics and Language Technology, vol. 20, no. 1, pp. 19262, 2005.

[11] IU k M. and amurcu Y., K2Means ve AU r K resel C2Means Algoritmalar ile Belge Madencili i, Marmara niversitesi, Fen Bilimleri Enstit s Dergisi, vol. 22, pp. 1218, 2010.

[12] IU k M., B l nmeli K meleme Y ntemleri le Veri Madencili i Uygulamalar , MS Thesis, Marmara University, Istanbul, Turkey, 2006.

[13] Jur i M. and Lavra N., Fuzzy Clustering of Documents, Presented at the Conference on Data Mining and Data Warehouses , Ljubljana, Slovenia, 2008. An Improved Clustering Algorithm for Text Mining: Multi-Cluster Spherical K-Means 19

[14] Luo C., Li Y., and Chung M., Text Document Clustering based on Neighbors, Data and Knowledge Engineering, vol. 68, no. 11, pp. 127121288, 2009.

[15] Mendes S. and Sacks L., Evaluating Fuzzy Clustering for Relevance2based Information Access, in Proceedings of the 12 th IEEE International Conference on Fuzzy Systems , pp. 6482653, 2003.

[16] Salton G., Wong A., and Yang S., A Vector Space Model for Automatic Indexing, Communications of the ACM, vol. 18, no. 11, pp. 6132620, 1975.

[17] Steinbach M., Karypis G., and Kumar V., A Comparison of Document Clustering Techniques, in Proceedings of the 6 th ACM SIGKDD International Conference on Data Mining, Workshop on Text Mining , Boston, 2000.

[18] WAP, WAP Dataset, Simon Fraser University, Database and Data Mining Lab., available at: http://ddm.cs.sfu.ca/software.html, last visited 2009.

[19] Witten H., Moffat A., and Bell C., Managing Gigabytes: Compressing and Indexing Documents and Images , San Francisco, CA: Morgan Kaufmann Publishers, 1999. Volkan Tunali received the BSc and MSc degrees in computer engineering from Marmara University, Istanbul in 2001 and 2005 respectively. He received the PhD degree in computer and control education from Marmara University in 2012. He became Assistant Professor of the Software Engineering Department at Maltepe University in 2012. His research interests include data mining and knowledge discovery, text mining, information retrieval, and natural language process ing. He is a member of ACM. Turgay Bilgin received the BSc, PhD degrees in computer and control education from Marmara University, Istanbul in 2001 and 2007 respectively. His doctoral thesis was on the mining of high dimensional datasets. He became Assistant Professor of the Software Engineering Department at Maltepe University in 2008. His research interests are high dimensional data mining , web mining, service oriented architecture and web services. He is a member of ACM. Ali Camurcu received the PhD degree in computer education from Marmara University, Istanbul in 1996. His current research interests are data mining, intelligent tutoring systems, and medical image processing. He is a professor of Computer Engineering in the Faculty of Engineering and Architecture at Fatih Sultan Mehmet Waqf University. He is a member of ACM.