Downloads 939

..............................

Views 2k

..............................

Cited by

..............................

Received date November 27, 2014

Accepted date June 1, 2015

An Efficient Web Search Engine for Noisy Free

Author 1Department of Computer Science and Engineering, Anna University, India 2Department of Computer Science and Engineering, GKM College of Engineering and Technology, India,

Keywords #Web content extraction #relevant information #noise data elimination #noisy data cleaner algorithm #URL pattern extractor algorithm

Abstract The vast growth, various dynamic and low quality of the world wide web makes it very difficult to retrieve relevant information from internet during query search. To resolve this issue, various web mining techniques are being used. The biggest challenge in web mining is to remove noisy data information or unwanted information from the webpage such as banner, video, audio, images, hyperlinks etc. which are not associated to a user query. To overcome these issues, a novel custom search engine is proposed with efficient algorithm in this paper. The proposed Uniform Resource Locator (URL) pattern extractor algorithm will extract the all relevance index pages from the web and ranking the indexes based on user query. Then, Noisy Data Cleaner (NDC) algorithm is applied to remove the unwanted content from the retrieved web pages. The results show that the proposed URL Pattern Extractor (UPE)+NDC algorithm provides very promising results for different datasets with high precision and recall rate in comparison with the existing algorithms.

References

[1] Adhikesavan K., An Integrated Approach for Measuring Semantic Similarity between Words and Sentences Using Web Search Engine, The International Arab Journal of Information Technology, vol. 12, no. 6, pp. 589-596, 2015.

[2] Anita R., Bharani V., Nityanandam N., and Sahoo P., Deep iCrawl: An Intelligent Vision Based Deep Web Crawler, International Journal of Computer and Information Engineering, vol. 5, no. 2, pp. 128-133, 2011.

[3] Bhamare S. and Pawar B., Survey on Web Page An Efficient Web Search Engine for Noisy Free Information Retrieval 417 Noise Cleaning for Web Mining, International Journal of Computer Science and Information Technologies, vol. 4, no. 6, pp. 766-770, 2013.

[4] Bhawsar S., Pathak K., Mariya S., and Parihar S., Extraction of Business Rules from Web Logs to Improve Web Usage Mining, International Journal of Emerging Technology and Advanced Engineering, vol. 2, no. 8, pp. 333-340, 2012.

[5] Boddu S., Eliminate the Noisy Data from Web Pages Using Data Mining Techniques, Computer Science and Telecommunications, vol. 2, no. 38, pp. 39-46, 2013.

[6] Das S., Vijayaraghavan P., and Mathew M., Eliminating Noisy Information in Web Pages using Featured DOM Tree, International Journal of Applied Information Systems, vol. 2, no. 2, pp. 27-34, 2012.

[7] Garhwal R., Improving Privacy in Web Mining by Eliminating Noisy Data and Sessionization, International Journal of Latest Trends in Engineering and Technology, vol. 3, no. 3, pp. 373-378, 2014.

[8] Gupta S., Kaiser G., Grimm P., Chiang M., and Starren J., Automating Content Extraction of Html Documents, Journal of World Wide Web, vol. 8, no. 2, pp. 179-224, 2005.

[9] Hogan A., Harth A., Umbrich J., Kinsella S., Polleres A., and Decker S., Searching and Browsing Linked Data With SWSE: The Semantic Web Search Engine, Web Semantics: Science, Services and Agents on the World Wide Web, vol. 9, no. 4, pp. 365-401, 2011.

[10] Jeyalatha S., Vijayakumar B., and Firoz M., Design and Implementation of a Tool for Web Data Extraction and Storage Using Java and Uniform Interface, International Journal of Computer Applications, vol. 22, no. 4, pp. 1-6, 2011.

[11] Joachims T., Optimizing Search Engines using Clickthrough Data, in Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, pp. 133-142, 2002.

[12] Jusoh S. and Alfawareh H., Techniques, Applications and Challenging Issue in Text Mining, International Journal of Computer Science Issues, vol. 9, no. 2, pp. 431-436, 2012.

[13] Keole R., Karde P., and Thakare V., Clustering Web Search Engine Results for Improving Information Retrieval: A Survey, International Journal of Computer Science, vol. 2, no. 3, pp. 24-33, 2014.

[14] Kumar P. and Parvathi R., Neural Networking Using Multiple Web Page Noise Removing Method, International Journal of Computer Science and Technology, vol. 3, no. 1, pp. 336- 339, 2012.

[15] Lingwal S., Noise Reduction and Content Retrieval from Web Pages, International Journal of Computer Applications, vol. 73, no. 4, pp. 24-30, 2013.

[16] Lingwal S. and Gupta B., A Comparative Study of Different Approaches for Improving Search Engine Performance, International Journal of Emerging Trends and Technology in Computer Science, vol. 1, no. 3, pp. 123-132, 2012.

[17] Market Brew White paper, The Noisy Query Layer: How Brands Can Avoid Chasing Their Tails, available at http://cdn.marketbrew.com/wpcontent/uploads/20 14/09/The-Noisy-Query-Layer.pdf, Last Visited, 2015.

[18] Mary S. and Baburaj E., An Efficient Approach to Perform Pre-Processing, Indian Journal of Computer Science and Engineering, vol. 4, no. 5, pp. 404-410, 2013.

[19] Menczer F., Complementing Search Engines with Online Web Mining Agents, Decision Support System, vol. 35, no. 2, pp. 195-212, 2003.

[20] Nie Z., Ma Y., Wen J., and Ma W., Object- Level Web Information Retrieval, Technical Report of Microsoft Research, 2005.

[21] Oza A. and Mishra S., Elimination of Noisy Information from Web Pages, International Journal of Recent Technology and Engineering, vol. 2, no. 1, pp. 115-117, 2013.

[22] Phan X., Horiguchi S., Nguyen L., and Nguyen C., Semantic Analysis of Entity Contexts towards Open Named Entity Classification on the Web, in Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, China, pp. 137-144, 2007.

[23] Sedbrook T. and Lightfoot J., DEAR: A New Technique for Information Extraction and Context Dependent Text Mining, Communications of the IIMA, vol. 10, no. 3, pp. 33-47, 2010.

[24] Singh A., Web Content Extraction to Facilitate Web Mining, International Journal of Electronics and Computer Science Engineering, vol. 1, no. 3, pp. 1292-1299, 2010.

[25] Srikantaiah K., Suraj M., Venugopal K., and Patnaik L., Similarity Based Dynamic Web Data Extraction and Integration System from Search Engine Result Pages for Web Content Mining, ACEEE International Journal on Information Technology, vol. 3, no. 1, pp. 42-49, 2013. 418 The International Arab Journal of Information Technology, Vol. 15, No. 3, May 2018 Pradeep Sahoo received his B.Tech from The Institution of Engineer India, Calcutta, India in 2000. He completed his M.Tech in Computer Science and Engineering from Anna University Chennai, Tamilnadu (India). He is pursuing his Doctorate degree from Anna University Chennai, Tamilnadu (India). Currently, he is serving as Associate Professor at Computer Science & Engineering Department in Sai Ram Engineering College, Chennai, Tamilnadu India. He participated in total 8 National Conferences, Workshop & Seminar in various institutions in India. He published 4 papers in International Journal and 3 in National Journal. His research area is Data Mining in Pattern Recognition and Content Extraction and Software Engineering. He is holding the following membership: ISTE, CSI, and IAENG & IACSIT. Rajagopalagn Parthasarthy received his Master degree in Applied Mathematics from IIT Madras and has obtained PhD in Computer Science from the University of Madras. He has 40 years of teaching experience in various institutions in India. He is recognized as research supervisor for Anna University, Dr. MGR University, Vels University, Mother Terasa University and University of Madras Tamilnadu India. He has successfully led around 25 scholars for obtaining their PhD and more than 169 scholars to obtain their M.Phil degree. He served as Faculty, Visiting Professor, Project Co-ordinator, Resource Personal and Panel Member for various academic institution and government organization in India. Currently, he is serving as a veteran at Research and Development Cell in Department of CSE at GKM College Engineering of Technology Chennai, India. He served as many administrative positions like Chairman, Principal, President, Director, Chief Guest, Dean, Head of Department, Advisor, Member, Convener and Subject Expert in various organizations in India. He wrote 13 books and published 84 International journal and national journal in his related area. He awarded as Life Time Achievement Award, Seva Ratna Award, Distinguished Educationist and A Person of Eminence, Seer Seyai Maamani Award, A Living Legend, Educationalist Born-Noble and Best Teacher Award from various government & private organization and institutions. His field of interests and specialization are Quantitative Techniques, Data Processing and Project Management, Management Information System, Programming Languages, Simulation, Text Generation, Cryptography and Data Mining.