..............................
..............................
..............................
A MapReduce-based Quick Search Approach on Large Files
String search is an important branch of pattern matching for information retrieval in various fields. In the past four
decades, the research importance has been attached on skipping more unnecessary characters to improve the search
performance, and never taken into consideration on large scale of data. In this paper, two major achievements are contributed.
At first, we propose a Quick Search algorithm for data Stream (QSS) on a single machine to support string search in a large
text file, as opposed to previous researches that limits to a bound memory. For the next, we implement the search algorithm on
MapReduce framework to improve the velocity of retrieving the search results. The experiments demonstrate that our
approach is fast and effective for large files.
[1] Adjeroh D., Bell T., and Mukherjee A., Pattern Matching in Compressed Texts and Images, Now Publishers Incorporated, 2013.
[2] Al-Ssulami A., “Hybrid String Matching Algorithm with A Pivot,” Journal of Information Science, vol. 41, no. 1, pp. 82-88, 2015.
[3] Apostolico A. and Crochemore M., “Optimal Canonization of all Substrings of a String,” Information and Computation, vol. 95, no. 1, pp. 76-95, 1991.
[4] Baeza-Yates R. and Gonnet G., “A New Approach to Text Searching,” Communications of the ACM, vol. 35, no. 10, pp. 74-82, 1992.
[5] Boyer R. and Moore J., “A Fast String Searching Algorithm,” Communications of the ACM, vol. 20, no. 10, pp. 762-772, 1977.
[6] Charras C. and Lecroq T., Handbook of Exact String Matching Algorithms, Colleague Publications, 2004.
[7] Chen L., Cheung D., and Yiu S., “Approximate String Matching in DNA Sequences,” in Proceedings of 8th International Conference on Database Systems for Advanced Applications, Kyoto, pp. 303-310, 2003.
[8] Coit C., Staniford S., and McAlerney J., “Towards Faster String Matching for Intrusion Detection or Exceeding the Speed of Snort,” in Proceedings of DARPA Information Survivability Conference and Exposition, Anaheim, pp. 367- 373, 2001.
[9] Crochemore M. and Rytter W., Text Algorithms, Oxford University Press, 1994.
[10] Dean J. and Ghemawat S., “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.
[11] Faro S. and Lecroq T., “The Exact Online String Matching Problem: A Review of the Most Recent Results,” ACM Computing Surveys (CSUR), vol. 45, no. 2, pp. 1-42, 2013.
[12] Franek F., Jennings C., and Smyth W., “A simple fast Hybrid Pattern-Matching Algorithm,” in Proceedings of Annual Symposium on Combinatorial Pattern Matching, Jeju Island, pp. 288-297, 2005.
[13] Fredriksson K. and Grabowski S., “Average- Optimal String Matching,” Journal of Discrete Algorithms, vol. 7, no. 4, pp. 579-594, 2009.
[14] Gagie T., Gawrychowski P., Kärkkäinen J., Nekrich Y., and Puglisi S., “LZ77-based Self- Indexing with Faster Pattern Matching,” in Proceedings of Latin American Symposium on Theoretical Informatics, Montevideo, pp. 731- 742, 2014.
[15] Horspool R., “Practical Fast Search in Strings,” Software: Practice and Experience, vol. 10, no. 6, pp. 501-506, 1980.
[16] Karp R. and Rabin M., “Efficient Randomized Pattern-Matching Algorithms,” IBM Journal of Research and Development, vol. 31, no. 2, pp. 249-260, 1987.
[17] Knuth D., Morris J., and Pratt V., “Fast Pattern A MapReduce-based Quick Search Approach on Large Files 797 Matching in Strings,” SIAM Journal of Computing, vol. 6, no. 2, pp. 323-350, 1977.
[18] Kofahi N. and Abusalama A., “A Framework for Distributed Pattern Matching Based on Multithreading,” The International Arab Journal of Information Technology, vol. 9, no. 1, pp. 30- 38, 2012.
[19] Kouzinopoulos C., Michailidis P., and Margaritis K., “Multiple String Matching on A GPU Using Cudas,” Scalable Computing: Practice and Experience, vol. 16, no. 2, pp. 121-138, 2015.
[20] Lin J., Adjeroh D., and Jiang Y., “A Faster Quick Search Algorithm,” Algorithms, vol. 7, no. 2, pp. 253-275, 2014.
[21] Smyth W., Computing Patterns in Strings, Addison-Wesley, 2003.
[22] Sunday D., “A Very Fast Substring Search Algorithm,” Communications of the ACM, vol. 33, no. 8, pp. 132-142, 1990.
[23] White T., Hadoop: the Definitive Guide, O’Reilly Media Inc., 2009. Ye-feng Li received his Ph.D degree from Donghua University in 2015. Currently he is a postdoctoral researcher in the college of computer science in Beijing University of Technology. His research major focuses on the big- data processing and information security technologies. Jia-jin Le is professor and Ph.D supervisor in the college of Computer Science in Donghua University. He is also a senior member of China Computer Federation. His research interests include database and data warehouse, software engineering theory and practice. Mei Wang is a professor and M.S. supervisor in the college of Computer Science in Donghua University. Her primary research interests include database, image semantic analysis, and information retrieval.