The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Focused Crawler Based on Reinforcement Learning and Decaying Epsilon-Greedy Exploration Policy

In order to serve a diversified user base with a range of purposes, general search engines offer search results for a wide variety of topics and material categories on the Internet. While Focused Crawlers (FC) deliver more specialized and targeted results inside particular domains or verticals, general search engines give a wider coverage of the web. For a vertical search engine, the performance of a focused crawler is extremely important, and several ways of improvement are applied. We propose an intelligent, focused crawler which uses Reinforcement Learning (RL) to prioritize the hyperlinks for long-term profit. Our implementation differs from other RL based works by encouraging learning at an early stage using a decaying ϵ- greedy policy to select the next link and hence enables the crawler to use the experience gained to improve its performance with more relevant pages. With an increase in the infertility rate all over the world, searching for information regarding the issues and details about artificial reproduction treatments available is in need by many people. Hence, we have considered infertility domain as a case study and collected web pages from scratch. We compare the performance of crawling tasks following ϵ-greedy and decaying ϵ-greedy policies. Experimental results show that crawlers following a decaying ϵ-greedy policy demonstrate better performance.

[1] Agichtein E., Brill E., and Dumais S., “Improving Web Search Ranking by Incorporating User Behaviour Information,” in Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, pp. 19-26, 2006. https://doi.org/10.1145/1148170.1148177

[2] Bergmark D., Lagoze C., and Sbityakov A., “Focused Crawls, Tunneling, and Digital Libraries,” in Proceedings of the Research and Advanced Technology for Digital Libraries: 6th European Conference, Rome, pp. 91-106, 2002. https://doi.org/10.1007/3-540-45747-X_7

[3] Brochu F., Robins S., Miner S., Grunberg P., Chan P., Lo K., Holzer H., Mahutte N., Ouhilal S., Tulandi T., and Zelkowitz P., “Searching the Internet for Infertility Information: A Survey of Patient Needs and Preferences,” Journal of Medical Internet Research, vol. 21, no. 12, 2019. DOI: 10.2196/15132

[4] Chakrabarti S., Van den Berg M., and Dom B., “Focused Crawling: A New Approach to Topic- Specific Web Resource Discovery,” Computer Networks, vol. 31, no. 11-16, pp. 1623-1640, 1999. https://doi.org/10.1016/S13891286(99)00052-3

[5] Cousineau T. and Domar A., “Psychological Impact of Infertility,” Best Practice and Research Clinical Obstetrics and Gynaecology, vol. 21, no. 2, pp. 293-308, 2007. DOI: 10.1016/j.bpobgyn.2006.12.003.

[6] Ehrig M. and Maedche A., “Ontology-Focused Crawling of Web Documents,” in Proceedings of the ACM Symposium on Applied Computing, Melbourne, pp. 1174-1178, 2003. https://doi.org/10.1145/952532.952761

[7] Enge E., Spencer S., and Stricchiola J., The Art of SEO: Mastering Search Engine Optimization, O’Reilly Media: Sebastopol, 2015. https://www.oreilly.com/catalog/errata.csp?isbn= 9781491948965

[8] Gossen G., Risse T., and Demidova E., “Towards Extracting Event-Centric Collections from Web Archives,” International Journal on Digital Libraries, vol. 21, no. 1, pp. 31-45, 2020. https://doi.org/10.1007/s00799-018-0258-6

[9] Grigoriadis A. and Paliouras G., “Focused Crawling Using Temporal Difference-Learning,” in Proceedings of the Methods and Applications of Artificial Intelligence: 3rd Hellenic Conference, Focused Crawler Based on Reinforcement Learning and Decaying ... 829 SETN, Samos, pp. 142-153, 2004. https://doi.org/10.1007/978-3-540-24674-9_16

[10] Han M., Wuillemin P., and Senellart P., “Focused Crawling through Reinforcement Learning,” in Proceedings of the 18th International Conference on Web Engineering, Cáceres, pp. 261-278, 2018. https://doi.org/10.1007/978-3-319-91662-0_20

[11] Houston T. and Allison J.., “Users of Internet Health Information: Differences by Health Status,” Journal of Medical Internet Research, vol. 4, no. 2, pp. 1-10, 2002. DOI: 10.2196/jmir.4.2.e7.

[12] Li J., Furuse K., and Yamaguchi K., “Focused Crawling by Exploiting Anchor Text Using Decision Tree,” in Proceedings of the Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, Chiba, pp. 1190-1191, 2005. https://doi.org/10.1145/1062745.1062933

[13] Liu L., Peng T., and Zuo W., “Topical Web Crawling for Domain-Specific Resource Discovery Enhanced by Selectively Using Link- Context,” The International Arab Journal of Information Technology, vol. 12, no. 2, pp. 196- 204, 2015. https://iajit.org/PDF/vol.12,no.2/6058.pdf

[14] Lu H., Zhan D., Zhou L., and He D., “An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation,” Mathematical Problems in Engineering, Hindawi Publishing Corporation, vol. 2016, pp. 1-10, 2016. https://doi.org/10.1155/2016/6406901

[15] McCarthy D., Scott G., Courtney D., Czerniak A., Aldeen A., Gravenor S., and Dresden S., “What Did You Google? Describing Online Health Information Search Patterns of ED patients and their Relationship with Final Diagnoses,” West Journal of Emergency Medicine, vol. 18, no. 5, pp. 928-936, 2017. doi: 10.5811/westjem.2017.5.34108

[16] Meusel R., Mika P., and Blanco R., “Focused Crawling for Structured Data,” in Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, Shanghai, pp. 1039-1048, 2014. https://doi.org/10.1145/2661829.2661902

[17] Pant G. and Menczer F., “Topical Crawling for Business Intelligence,” in Proceedings of the International 7th European Conference on Theory and Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries, Trondheim, pp. 233-244, 2003. https://doi.org/10.1007/978-3-540-45175-4_22

[18] Pant G., Srinivasan P., and Menczer F., “Exploration Versus Exploitation in Topic Driven Crawlers,” in Proceedings of the 2nd International Workshop on Web Dynamics, Honululu, pp. 88- 97, 2002. http://dblp.uni- trier.de/db/conf/www/webdyn2002.html#PantSM 02

[19] Partalas I., Paliouras G., and Vlahavas I., “Reinforcement Learning with Classifier Selection for Focused Crawling,” in Proceedings of the 18th European Conference on Artificial Intelligence, Patras, pp. 759-760, 2008. DOI:10.3233/978-1-58603-891-5-759

[20] Rajiv S. and Navaneethan C., “Hybrid Gradient Strategies in Event Focused Web Crawling,” ECS Transactions, vol. 107, no. 1, pp. 1219-1234, 2022. DOI :10.1149/10701.1219ecst

[21] Rennie J. and McCallum A., “Efficient Web Spidering with Reinforcement Learning,” in Proceedings of the 16th International Conference on Machine Learning, Bled, Slovenia, pp. 335- 343, 1999. https://www.researchgate.net/publication/227213 7_Efficient_Web_Spidering_with_Reinforcemen t_Learning

[22] Singh N., Sandhawalia H., Monet N., Poirier H., and Coursimault J., “Large scale URL-based Classification Using Online Incremental Learning,” in Proceedings of the IEEE 11th International Conference on Machine Learning and Applications, Boca Raton, pp. 402-409, 2012. DOI: 10.1109/ICMLA.2012.199

[23] Sutton R. and Barto A., Reinforcement Learning: An Introduction, MIT Press, 2018. https://freecomputerbooks.com/Reinforcement- Learning-An-Introduction.html

[24] Zegers-Hochschild F., Adamson G., De Mouzon J., Ishihara O., Mansour R., Nygren K., Sullivan E., and Van der Poel S., “The International Committee for Monitoring Assisted Reproductive Technology (ICMART) and the World Health Organization (WHO) Revised Glossary on ART Terminology,” Human Reproduction, vol. 24, no. 11, pp 2683-2687, 2009. https://doi.org/10.1093/humrep/dep343