The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Improved Streaming Quotient Filter: A Duplicate Detection Approach for Data Streams

The unprecedented development and popularization of the Internet, combined with the emergence of a variety of modern applications, such as search engines, online transactions, climate warning systems and so on, enables the worldwide storage of data to grow unprecedented. Efficient storage, management and processing of such huge amounts of data has become an important academic research topic. The detection and removal of duplicate and redundant data from such multi- trillion data, while ensuring resource and computational efficiency, has constituted a challenging area of research.Because of the fact that all the data of potentially unbounded data streams can not be stored, and the need to delete duplicated data as accurately as possible, intelligent approximate duplicate data detection algorithms are urgently required. Many well-known methods based on the bitmap structure, Bloom Filter and its variants are listed in the literature. In this paper, we propose a new data structure, Improved Streaming Quotient Filter (ISQF), to efficiently detect and remove duplicate data in a data stream. ISQF intelligently stores the signatures of elements in a data stream, while using an eviction strategy to provide near zero error rates. We show that ISQF achieves near optimal performance with fairly low memory requirements, making it an ideal and efficient method for repeated data detection. It has a very low error rate. Empirically, we compared ISQF with some existing methods (especially Steaming Quotient Filter (SQF)). The results show that our proposed method outperforms theexisting methods in terms of memory usage and accuracy.We also discuss the parallel implementation of ISQF.


[1] Alon N., Matias Y., and Szegedy M., “The Space Complexity of Approximating the Frequency Moments,” in Proceedings of 28th Annual ACM Symposium on Theory of Computing, Philadelpia, pp. 20-29, 1996.

[2] Babcock B., Babu S., Datar M., Motwani R., and Widom J., “Models and Issues in Data Stream Systems,” in Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Madison, pp. 1- 16, 2002.

[3] Babcock B., Datar M., and Motwani R., “Load Shedding for Aggregation Queries over Data Streams,” in Proceedings of 20th International Conference on Data Engineering, Boston, pp. 350-361, 2004.

[4] Baboescu F. and Varghese G., “Scalable Packet Classification,” in Proceedings of Applications, Technologies, Architectures, and Protocols for Computers Communications, San Diego, pp. 199- 210, 2001.

[5] Bender M., Farach-Colton M., Johnson R., Kraner R., Kuszmaul B., Medjedovic D., Montes P., Shetty P., Spillane R., and Zadok E., “Don't Thrash: How to Cache Your Hash on Ash,” in Proceedings of Ceedings of the VLDB Endoment, Istanbul, pp. 1627-1637, 2012. 012345678m 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 Error rate(%) 776 The International Arab Journal of Information Technology, Vol. 17, No. 5, September 2020

[6] Bloom B., “Space/Time Trade-offs in Hash Coding with Allowable Errors,” Communications of the ACM, vol. 13, no. 7, pp. 422-426, 1970.

[7] Broder A. and Mitzenmacher M., “Network Applications of Bloom Filters: A Survey,” Internet Mathematics, vol. 1, no. 4, pp. 485-509, 2004.

[8] Borg M., Runeson P., Johansson J., and Mäntylä M., “A Replicated Study on Duplicate Detection: Using Apache Lucene to Search Among Android Defects,” in Proceedings of 8th International Symposium on Empirical Software Engineering and Measurement, Torino, pp. 1-4, 2014.

[9] Chen Y., Kumar A., and Xu J., “A New Design of Bloom Filter for Packet Inspection Speedup,” in Proceedings of 50th Annual IEEE Global Telecommunications Conference, GLOBECOM, Washington, pp. 1-5, 2007.

[10] Chowdhury A., Frieder O., Grossman D., and McCabe M., “Collection Statistics for Fast Duplicate Document Detection,” ACM Transactions on Information Systems, vol. 20, no. 2, pp. 171-191, 2002.

[11] Conrad J., Guo X., and Schriber C., “Online Duplicate Document Detection: Signature Reliability in A Dynamic Retrieval Environment,” in Proceedings of the 12th International Conference on Information and knowledge Management, New Orleans, pp. 443- 452, 2003.

[12] Cormode G. and Muthukrishnan S., “An Improved Data Stream Summary: The Count Min Sketch and Its Applications,” Journal of Algorithms, vol. 55, no. 1, pp. 58-75, 2005.

[13] Deng F. and Rafiei D., “Approximately Detecting Duplicates for Streaming Data Using Stable Bloom Filters,” in Proceedings of the ACM SIGMOD International Conference on Managment of Data, Chicago, pp. 25-36, 2006.

[14] Dharmapurikar S., Krishnamurthy P., and Taylor D., “Longest Prefix Matching Using Bloom Filters,” IEEE/ACM Transactions on Networking, vol. 14, no. 2, pp. 397-409, 2006.

[15] Dharmapurikar S., Krishnamurthy P., Sproull T., and Lockwood J., “Deep Packet Inspection Using Parallel Bloom Filters,” IEEE Micro, vol. 24, no. 1, pp. 52-61, 2004.

[16] Dutta S., Bhattacherjee S., and Narang A., “Twards “Intelligent Compression” in Streams: A Biased Reservoir Sampling Based Bloom Filter Approach,” in Proceedings of the 15th Interntional Conference on Extending Database Technology, Berlin, pp. 228-238, 2012.

[17] Dutta S., Narang A., and Bera S., “Streaming Quotient Filter: A Near Optimal Approximate Duplicate Detection Approach for Data Streams,” Proceedings of the VLDB Endowment, vol. 6, no. 8, pp. 589-600, 2013.

[18] Fan L., Cao P., Almeida J., and Broder A., “Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol,” Computer Communiction Review, vol. 28, no. 4, pp. 254- 265, 1998.

[19] Feng W., Kandlur D., Saha D., and Shin K., “Stochastic Fair Blue: A Queue Management Algorithm for Enforcing Fairness,” in Proceedings of 20th Annual Joint Conference of the IEEE Computer and Communications Societies, Anchorage, pp. 1520-1529, 2001.

[20] Garcia-Molina H., Ullman J., and Widom J., Database System Implementation, Prentice Hall, 2000.

[21] Gehrke J., Korn F., and Srivastava D., “On Computing Correlated Aggregates over Continual Data Streams,” in Proceedings of ACM SIGMOD International Conference on Management of Dta, Santa Barbara, pp. 13-24, 2001.

[22] Golab L., DeHaan D., Demaine E., Lpez-Ortiz A., and Munro J., “Identifying Frequent Items in Sliding Windows over On-Line Packet Streams,” in Proceedings of the 3rd ACM SIGCOMM Coference on Internet Measurement, Miami Beach, pp. 173-178, 2003.

[23] Gupta P. and McKeown N., “Packet Classification on Multiple Fields,” in Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, NY, pp. 147-160, 1999.

[24] Køien G., “A Brief Survey of Nonces and Nonce Usage,” in Proceedings of the 9th International Conference on Emerging Security Information, Systems and Technologies, Iaria Xps Press, pp. 85-91, 2015.

[25] Kumar A., Xu J., Wang J., Spatschek O., and Li L., “Space-Code Bloom Filter for Efficient PerFlow Traffic Measurement,” in Proceedings of IEEE INFOCOM Conference on Computer Communications 20 th Annual Joint Conference of the IEEE Computer and Communications Socities, Hongkong, pp. 1762-1773, 2004.

[26] Lee D. and Hull J., “Duplicate Detection for Symbolically Compressed Documents,” in Prceedings of 5th International Conference on Document Analysis and Recognition, Banglore, pp. 305-308, 1999.

[27] Little M., Shrivastava S., and Speirs N., “Using Bloom Filters to Speed-up Name Lookup in Ditributed Systems,” Computer Journal, vol. 45, no. 6, pp. 645-652, 2002.

[28] Mitzenmacher M., “Compressed Bloom Filters,” IEEE/ACM Transactions on Networking, vol. 10, no. 5, pp. 604-612, 2002.

[29] Reiter M., Anupam V., and Mayer A., “Detecting Hit Shaving in Click-Through Payment Improved Streaming Quotient Filter: A Duplicate Detection Approach for Data Streams 777 Schemes,” in Proceedings of the 3rd Conference on USENIX Workshop on Electronic Commerce, Boston, pp. 155-166, 1998.

[30] Shen H. and Zhang Y., “Improved Approximate Detection of Duplicates for Data Streams over Sliding Windows,” Journal of Computer Science and Technology, vol. 23, no. 6, pp. 973-987, 2008.

[31] Song H., Dharmapurikar S., Turner J., and Lockwood J., “Fast Hash Table Lookup Using Extended Bloom Filter: An Aid to Network Processing,” in Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Achitectures, and Protocols for Computer Comunication, Philadelphia, pp. 181- 192, 2005. Shiwei Che is currently a Ph.D. candidate in the Department of Computer Science and Technology, Harbin Engineering University. He received his M.E. degree in 2010 from the Department of Computer Science and Technology of Xinjiang University, Xinjiang, China. His main research interests include social networks and community detection. Wu Yang received a Ph.D. degree in Computer System Architecture Specialty of Computer Science and Technology School from Harbin Institute of Technology. He is currently a professor and doctoral supervisor of Harbin Engineering University. His main research interests include wireless sensor network, peer-to-peer network and information security. He is a member of ACM and senior member of CCF. Wei Wang received a Ph.D. degree in Computer System Architecture Specialty of Computer Science and Technology School from Harbin Institute of Technology. He is currently an professor in Harbin Engineering University. His main research interests include social networks and community detection.