The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


An Improved Clustering Algorithm for Text

#
  Thanks to advances in information and communication  technologies, there is a prominent increase in the amount of  information produced specifically in the form of te xt documents. In order to, effectively deal with this “information explosion”  problem  and  utilize  the  huge  amount  of  text  databas es,  efficient  and  scalable  tools  and  techniques  are  indispensable.  In  this  study, text clustering which is one of the most imp ortant techniques of text mining that aims at extra cting useful information by  processing data in textual form is addressed. An im proved variant of spherical K-Means (SKM) algorithm  named multi-cluster  SKM  is  developed  for  clustering  high  dimensional  do cument  collections  with  high  performance  and  efficiency.  Experiments  were  performed  on  several  document  data  sets  and  it   is  shown  that  the  new  algorithm  provides  significant  increase  in  clustering quality without causing considerable dif ference in CPU time usage when compared to SKM algo rithm. 


[1] 20NG, 20 News Groups Dataset, U.C.I. Machine Learning Repository., available at: http://archive.ics.uci.edu/ml/databases/20newsgr oups/, last visited 2010.

[2] Aliguliyev M., Clustering of Document Collection2A Weighting Approach, Expert Systems with Applications, vol. 36, no. 4, pp. 790427916, 2009.

[3] Batri K., Murugesh V., and Gopalan N., Effect of Weight Assignment in Data Fusion Based Information Retrieval, the International Arab Journal of Information Technology, vol. 8, no. 3, pp. 2442250, 2011.

[4] Bezdek C., Pattern Recognition with Fuzzy Objective Function Algorithms , New York, Plenum Press, 1981.

[5] CLUTO, CLUTO Software for Clustering High2 Dimensional Datasets., available at: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/ove rview, last visited 2009.

[6] Dhillon S. and Modha S., Concept Decompositions for Large Sparse Text Data using Clustering, Machine Learning, vol. 42, no. 122, pp. 1432175, 2001.

[7] Fayyad U., Piatetsky2Shapiro G., and Smyth P., From Data Mining to Knowledge Discovery: An Overview, Advances in Knowledge Discovery and Data Mining , Menlo Park, pp. 1234, 1996.

[8] Feldman R. and Sanger J., The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data , New York, Cambridge University Press, 2007.

[9] Han J., Kamber M., and Pei J., Data Mining: Concepts and Techniques , 3rd ed., Waltham, MA: Morgan Kaufmann Publishers, 2012.

[10] Hotho A., N rnberger A., and Paa G., A Brief Survey of Text Mining, LDV Forum-GLDV Journal for Computational Linguistics and Language Technology, vol. 20, no. 1, pp. 19262, 2005.

[11] IU k M. and amurcu Y., K2Means ve AU r K resel C2Means Algoritmalar ile Belge Madencili i, Marmara niversitesi, Fen Bilimleri Enstit s Dergisi, vol. 22, pp. 1218, 2010.

[12] IU k M., B l nmeli K meleme Y ntemleri le Veri Madencili i Uygulamalar , MS Thesis, Marmara University, Istanbul, Turkey, 2006.

[13] Jur i M. and Lavra N., Fuzzy Clustering of Documents, Presented at the Conference on Data Mining and Data Warehouses , Ljubljana, Slovenia, 2008. An Improved Clustering Algorithm for Text Mining: Multi-Cluster Spherical K-Means 19

[14] Luo C., Li Y., and Chung M., Text Document Clustering based on Neighbors, Data and Knowledge Engineering, vol. 68, no. 11, pp. 127121288, 2009.

[15] Mendes S. and Sacks L., Evaluating Fuzzy Clustering for Relevance2based Information Access, in Proceedings of the 12 th IEEE International Conference on Fuzzy Systems , pp. 6482653, 2003.

[16] Salton G., Wong A., and Yang S., A Vector Space Model for Automatic Indexing, Communications of the ACM, vol. 18, no. 11, pp. 6132620, 1975.

[17] Steinbach M., Karypis G., and Kumar V., A Comparison of Document Clustering Techniques, in Proceedings of the 6 th ACM SIGKDD International Conference on Data Mining, Workshop on Text Mining , Boston, 2000.

[18] WAP, WAP Dataset, Simon Fraser University, Database and Data Mining Lab., available at: http://ddm.cs.sfu.ca/software.html, last visited 2009.

[19] Witten H., Moffat A., and Bell C., Managing Gigabytes: Compressing and Indexing Documents and Images , San Francisco, CA: Morgan Kaufmann Publishers, 1999. Volkan Tunali received the BSc and MSc degrees in computer engineering from Marmara University, Istanbul in 2001 and 2005 respectively. He received the PhD degree in computer and control education from Marmara University in 2012. He became Assistant Professor of the Software Engineering Department at Maltepe University in 2012. His research interests include data mining and knowledge discovery, text mining, information retrieval, and natural language process ing. He is a member of ACM. Turgay Bilgin received the BSc, PhD degrees in computer and control education from Marmara University, Istanbul in 2001 and 2007 respectively. His doctoral thesis was on the mining of high dimensional datasets. He became Assistant Professor of the Software Engineering Department at Maltepe University in 2008. His research interests are high dimensional data mining , web mining, service oriented architecture and web services. He is a member of ACM. Ali Camurcu received the PhD degree in computer education from Marmara University, Istanbul in 1996. His current research interests are data mining, intelligent tutoring systems, and medical image processing. He is a professor of Computer Engineering in the Faculty of Engineering and Architecture at Fatih Sultan Mehmet Waqf University. He is a member of ACM.