The International Arab Journal of Information Technology (IAJIT)


Semantic Similarity based Web Document Classification Using Support Vector Machine

With the rapid growth of information on the World Wide Web (WWW), classification of web documents has become important for efficient information retrieval. Relevancy of information retrieved can also be improved by considering semantic relatedness between words which is a basic research area in fields of natural language processing, intelligent retrieval, document clustering and classification, word sense disambiguation etc. The web search engine based semantic relationship from huge web corpus can improve classification of documents. This paper proposes an approach for web document classification that exploits information, including both page count and snippets. To identify the semantic relations between the query words, a lexical pattern extraction algorithm is applied on snippets. A sequential pattern clustering algorithm is used to form clusters of different patterns. The page count based measures are combined with the clustered patterns to define the features extracted from the word-pairs. These features are used to train the Support Vector Machine (SVM), in order to classify the web documents. Experimental results demonstrate 5% and 9% improvement in F1 measure for Reuters 21578 and 20 Newsgroup datasets in the classifier performance.

[1] Anagnostopoulos A., Broder A., and Punera K., Effective and Efficient Classification on a Search-Engine Model, in Proceeding of the 15th ACM International Conference on Information and Knowledge Management, Virginia, pp. 1-29, 2007.

[2] Arya S. and Lavanya S., An Approach for Measuring Semantic Similarity between Words Using SVM and LS-SVM, in Proceeding of International Conference on Computer Communication and Informatics, Coimbatore, pp. 1-4, 2012.

[3] Bollegala D., Matsuo Y., and Ishizuka M., A Web Search Engine-Based Approach to Measure Semantic Similarity between Words, IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 7, pp. 977-990, 2011.

[4] Cohen W. and Singer Y., Context-Sensitive Learning Method for Text Categorization, ACM Transactions on Information Systems, vol. 17, no. 2, pp. 141-173, 1999.

[5] Elberrichi Z. and Rahmoun A., Mohd.Amine .B, Using WordNet for Text Categorization, The International Arab Journal of Information Technology, vol. 5, no. 1, pp. 16-24, 2008.

[6] Forman G. and Scholz M., Apples-To-Apples in Cross-Validation Studies: Pitfalls in Classifier Performance Measurement, ACM SIGKDD Explorations Newsletter, vol. 12, no. 1, pp. 49- 57, 2010.

[7] Gracia J. and Mena E., Web-Based Measure of Semantic Relatedness, in Proceeding of 9th International Conference on Web Information Systems Engineering, Auckland, pp. 136-150, 2008.

[8] Harish B., Guru D., and Manjunath S., Representation and Classification of Text Documents: A Brief Review, International Journal of Computer Application, no. Special Issue, pp. 110-119, 2010.

[9] Joachims T., Text Categorization with Support Vector Machines Learning with Many Relevant Features, in Proceeding of the 10th European Conference on Machine Learning, Chemnitz, pp. 137-142, 1998.

[10] Kavitha C., Sadasivam G., and Priya M., Annotation-Based Document Classification Using Shuffled Frog Leaping Algorithm, International Journal of Computational Science and Engineering, vol. 9, no. 3, pp. 215-221, 2014.

[11] Khan A., Baharudin B., Lee L., and Khan K., A Review of Machine Learning Algorithms for Text-Documents Classification, Journal of Advances in Information Technology, vol. 1, no. 1, pp. 4-20, 2010.

[12] Khan A., Bahaurdin B., and Khan K., An Overview of E-Documents Classification, in Proceeding of International Conference of Machine Learning and Computing, Perth, pp. 544-552, 2009.

[13] Lewis D., Reuters-21578 Text Categorization Collection, University of California, 1997.

[14] Muflikhah L. and Baharudin B., High Performance in Minimizing of Term-Document Matrix Representation for Document Clustering, in Proceeding of International Conference on Innovative Technologies in Intelligent systems and Industrial Applications, Kuala Lumpur, pp. 225-229, 2009.

[15] Pant B. and Mayor S., Document Classification Using Support Vector Machine, International Journal of Engineering Science and Technology, vol. 4, no. 4, pp. 1741-1745, 2012.

[16] Pawar P. and Gawande S., A Comparative Study on Different Types of Approaches to Text Categorization, International Journal of Machine Learning and Computing, vol. 2, no. 4, pp. 423-426, 2012.

[17] Pazzani M. and Billsus D., Learning and Revising User Profiles, The Identification of Interesting Web Sites, Machine Learning, vol. 27, no. 3, pp. 313-331, 1997.

[18] Peng X. and Choi B., Documents Classification Based on Word Semantic Hierachies, in Proceeding of the International Conference on Artificial Intelligence and Applications, Innsbruck, pp. 362-367, 2005.

[19] Rish I., An Empirical Study of the Na ve Bayes Classifier, in Proceeding of the IJCAI-01 Workshop on Empirical Methods in Artificial Intelligence, pp. 41-46, 2001.

[20] Shoham Y. and Balabanovic M., Content-based Collaborative Recommendation, Communications of the Association for Computing Machinery, vol. 40, no. 3, pp. 66-72, 1997.

[21] Sun A., Lim E., and Liu Y., On Strategies for Imbalanced Text Classification Using SVM: A Comparative Study, Decision Support Systems, vol. 48, no. 1, pp. 191-201, 2009.

[22] Tam V., Santoso A., and Setiono R., A Comparative Study of Centroid-Based, 292 The International Arab Journal of Information Technology, Vol. 14, No. 3, May 2017 Neighborhood-Based and Statistical Approaches for Effective Document Categorization, in Proceeding of the 16th International Conference on Pattern Recognition, Quebec, pp. 235-238, 2002.

[23] Yang J. and Watada J., Decomposition of Term- Document Matrix Representation for Clustering Analysis, in Proceeding of International Conference of Fuzzy Systems, Taipei, pp.976- 983, 2011. Kavitha Chinniyan is working as an Assistant Professor (Senior Grade) in Department of Computer Science and Engineering in PSG College of Technology, India. She is pursuing her research work in Semantics in Large Scale Distributed Systems. Her area of interests includes semantic web technology, parallel processing and data structures. She has published 5 papers in referred Journals and 4 papers in Conferences. Sudha Gangadharan is working as a professor in CSE Department of PSG College of Technology. She has 20 years of teaching experience. Her area of interest includes distributed systems and software engineering. She has published 5 books, 30 papers in referred Journals and 32 papers in National and International Conferences. She has coordinated two AICTE-RPS projects in the areas of distributed computing. She is the coordinator of PSG-Yahoo research in grid and cloud computing, Nokia Research on Big Data Analytics and Xurmo Research in social networking. Kiruthika Sabanaikam is a Post Graduate student of ME-Software Engineering in Department of Computer Science and Engineering in PSG College of Technology, India. Her area of interest is data mining and semantic web technology.