The International Arab Journal of Information Technology (IAJIT)

A Novel Algorithm for Enhancing Search Results

#
The dynamic collection and voluminous growth of information on the web poses great challenges for retrieving relevant information . Though most of the researchers focused their research work in the areas of information retrieval and web mining, still their focus is only on retrieving similar patterns leaving dissimilar patterns which are likely to contain the outlying data. So this paper concentrates on mining web conte nt outliers which extract the dissimilar web document s taken from the group of documents of same domain. Mining web c ontent outliers indirectly help in promoting business activities and improving the quality of the search results. Existing algorithms for web content outliers mining are developed for structured documents, whereas , World Wide Web (WWW) contains mostly unstructured and semi structured documents. Therefore, there is need to develop a technique to mine outliers for unstructured and semi structur ed document types. In this research work, a novel statistical approach based on c orrelation method is developed for retrieving relevant web document through outlier detection technique. In addition, this method also identifies the redundant web documents. Removal of both redundant and outlaid documents improves the quality of search results catering to the user needs. Evaluation of the correlation method using Normalized Discounted Cumulative Gain method (NDCG) gives search results above 90% . The experiment al results proved that this methodology gives better results in terms of accuracy, recall and specificity than the existing methodologie s.


[1] Agyemang M., Barker K. , and Alhajj R., A C omprehensive Survey of Numeric and S ymbolic Outlier Mining Techniques , Intelligent Data Analysis , vol. 10, no. 6, pp. 521- 538, 2006.

[ 2] Agyemang M., Barker K., and Alhajj R., Framework for Mining Web Content O utliers, i n Proceedings of the 2004 ACM Symposium on Applied Computing , Cyprus, pp. 590-594, 2004.

[ 3] Agyemang M., Barker K., and Alhajj R., Hybrid Approach to Web Content Outlier Mining without Query Vector , in Proceedings of 7th International Conference Data Warehousing and Knowledge Discovery, Denmark , pp. 285- 294, 2005.

[ 4] Agyemang M., Barker K. and Alhajj R., Mining Web Content Outliers using Structure Oriented Weighting Techniques and N -Grams , in Proceedings of ACM Symposium on Applied Computing, New Mexico, pp. 482- 487, 2005.

[ 5] Agyemang M., Barker K. , and Alhajj R., WCOND -Mine: Algorithm for Detecting Web Content Outliers from Web Documents , IEEE Symposium on Computers and Communication, Spain, 2005.

[ 6] Agyemang M., Barker K., and Alhajj R., Web Out lier Mining : Discovering Outliers from Web Datasets , Intelligent Data Analysis , vol. 9, no. 5, pp. 473- 486, 2005.

[ 7] Ali H., Imon A., and Werner M., Detection of outliers Overview , Wiley Interdisciplinary Reviews: Computational Statistics , vol. 1, no. 1, pp. 57- 70, 2009. (10) The International Arab Journal of Information Technology, Vol. 14, No. 1, January 2017 68

[ 8] Alqaraleh S. and Ramadan O., Elimination of Repeated Occurrences in Multimedia Search Engines, in the International Arab Journal of Information Technology , vol. 11, no. 2, pp. 134- 139, 2014.

[ 9] Arning A., Agrawa l R. , and Raghavan P., A L ine ar Method for D eviation Detection in L arge D atabases, in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Oregon, pp. 164- 169, 1996.

[ 10] Barnett V. and Lewis T., Outliers in Statistical Data , Willey , 1994.

[ 11] Breunig M., Kriegel H., Ng R. , and Sander J., LOF: Identifying Density -Based Local Outliers, in Proceedings of 2000 ACM SIGMOD Int ernational Conf erence Management of Data, Dallas, pp. 93- 104, 2000.

[ 12] Brin S. and Page L., The A natomy of a Large- S cale Hyper Textual Web Search Engine , Computer Networks and ISDN Systems , vol. 30, no. 1- 7, pp. 107- 117, 1998.

[ 13] Chakrabarti S., Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann , 2002.

[ 14] Furnkranz J., Separate -and -Conquer Rule Learning , Artificial Inte lligence Review , vol. 13, no. 1, pp. 3- 54, 1999.

[ 15] Hawkins S., He H., Willams G. , and Baster R., Outlier Detection using Replicator Neural Networks , in Proc eeding of the DaWaK02, France, pp. 170- 180, 2002.

[ 16] Jiang M. , Tseng S., and Su C., Two Phase Clusteri ng Process for Outlier Detection , Pattern Recognition Letters , vol. 22, no. 6-7, pp. 691- 700, 2001.

[ 17] Knorr E. and Ng R., Algorithms for Mining Distance- Based Outliers in Large Dataset , in Proc eeding of 24 th VLDB Conference , New York, pp. 392- 403, 1998.

[ 18] K osala R. and Blockeel H. , Web Mining Research: A Survey , ACM SIGKDD, vol. 2, no. 1, pp. 1- 15, 2000.

[ 19] Liu B . and Chang K., Editorial: Special issue on Web Content Mining, SIGKDD Explorations, vol. 6, no. 2, pp. 1- 4, 2004.

[ 20] Poonkuzhali G., Kishore -kumar R., Kripa -keshav R., Sudhakar P., and Sarukesi K., Correlation Based Method to Detect and Remove Redundant Web Document , Advanced Materials Research , vol. 171- 172, pp. 543- 546, 2011.

[ 21] Poonkuzhali G., Thiagarajan K., and Sarukesi K., Set Theoretical Approac h for Mining Web C ontent through Outliers Detection , International Journal on R esearch and I ndustrial A pplications , vol. 2, no. 1, pp. 131-138, 2009.

[22] Poonkuzhali G., Thiagarajan K., Sarukesi K., and Uma V., Signed Approach for Mining web Content Outli ers, in Proc eeding of World Academy of Science, Engineering and Technology , pp. 820-824, 2009.

[ 23] Ramaswamy S ., Rastogi R., and Shim K ., Efficient Algorithm for Mining Outliers from Large Data Sets, in Proceeding of the 2000 ACM SIGMOD International Conferenc e on Management of Data, Texas , pp. 427- 438, 2000.

[ 24] Ruts I. and Rousseuw P., Computing depth Contours of Bivariate Points Cloud , Computational Statistics and Data Analysis , vol. 23, no. 1, pp. 153- 168, 1996.

[ 25] Siddiqui M., Fayoumi M., and Yusuf N., A Corpus Based Approach to Find Similar Keywords for Search Engine Marketing , International Arab Journal of Information Technology , vol. 10, no. 5, pp. 460-466, 2013.

[ 26] Tax D. and Duin R., Support vector data description , Machine Learning, vol. 54, no. 1, pp. 45- 66, 2004.

[ 27] Xia H., Fan Z., and Peng L., Web Text Outlier Mining Based on Domain Knowledge , in Proceeding of the 2010 Second WRI Global Congress on Intelligent Systems, Washington, pp. 73- 77, 2010.

[ 28] Yang P. and Huang B., An Efficient Outlier Mini ng Algorithm for Large Dataset , in Proceedings of International Confe rence on Information Management , Innovation Management and Industrial Engineering, Taipei , pp. 199- 202, 2008. Poonkuzahli Sugumaran has a very distinguished career span of nearly 15 yea r, currently professor and head of Information technology in Rajalakshmi Engineering College, Chennai. She obtained PhD in Computer Science from Anna University. Her areas of specialization are Web Mining, Outlier mining and Information Retrieval. Kishore Kumar Ravi currently working as Assistant Professor in Department of Information Technology in Rajalakshmi Engineering College, Chennai. He obtained M.E in Computer Science from Anna University. His areas of specialization are Web Mining, Informati on Retrieval and Service Oriented Computing. 69 A Novel Algorithm for Enhancing Search Results by Detecting Dissimilar Patterns Based on Correlation Method Thirumurugan Shanmugam currently working as Assistant Professor in Department of Information Technology in College of Applied Science -Sohar, Oman. He obtained PhD in Computer Science from Anna University . His areas of specialization are Network, Applied Mathematics and Software Reliability Engineering.