The International Arab Journal of Information Technology (IAJIT)


An Experimental Based Study to Evaluate the Efficiency among Stream Processing Tools

With the advancement in internet technology, augmentation in regular data generation has been amplified at a drastic level. Several different industries, for instance hospitality, defense, railways, health care, social media, education, etc., are creating and crafting different and several types of raw and processed data at a significant level, whereas, each of them has their own unique reason to shelter and call their data imperative and crucial. Such large and huge amount of data needs some space to get saved and secured, this is what Big Data is. A Data Stream Processing Technology (DSPT) is the significant mechanism and the mainstay for compiling and computing the large amount of data as well as the way to collect and process the raw data to call it information. There are varieties of DSPT like Apache Spark, Flink, Kafka, Storm, Samza, Hadoop, Atlas.ti, Cassandra, etc. This paper aims at comparing the five well- known and widely used open source big data DSPT (i.e., Apache Spark, Flink, Kafka, Storm, and Samza). An extensive comparison will be performed based on 12 different yet interconnected standards. A matrix has been designed through which five different experiments were executed, based on which the juxtaposition will be prepared. This paper summarizes an extensive study of open source big data DPST with a practical experimental approach in a well-controlled and sophisticated environment.

[1] Bahri M., Bifet A., Gama J., Gomes H., and Maniu S., “Data Stream Analysis: Foundations, Major Tasks and Tools,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 11, no. 3, pp. 1-17, 2021. DOI:10.1002/widm.1405

[2] Carbone P., Gévay G., Hermann G., Katsifodimos A., Soto J., Markl V., and Haridi S., Handbook of Big Data Technologies, Springer, 2017.

[3] Cardellini V., Lo Presti F., Nardelli M., and Russ G., “Run-Time Adaptation of Data Stream Processing Systems: The State of the Art,” ACM Computing Surveys, vol. 54, no. 11s, pp. 1-36, 2022.

[4] De Assuncao M., Da Silva Veith A., and Buyya R., “Distributed Data Stream Processing and Edge Computing: A Survey on Resource Elasticity and Future Directions,” Journal of Network and Computer Applications, vol. 103, pp.1-17, 2018.

[5] Fragkoulis M., Carbone P., Kalavri V., and Katsifodimos A., “A Survey on the Evolution of Stream Processing Systems,” arXiv Preprint, arXiv:2008.00842v2, 2023.

[6] Grebovic M., Filipovic L., Katnic I., Vukotic M., Popovic T., “Machine Learning Models for Statistical Analysis,” The International Arab Journal of Information Technology, vol. 20, no. 3A, pp. 505-514, 2023. DOI: 10.34028/iajit/20/3A/8.

[7] Hesse G. and Lorenz M., “Conceptual Survey on Data Stream Processing Systems,” in Proceedings of the IEEE 21st International Conference on Parallel and Distributed Systems, Melbourne, pp. 797-802, 2015. DOI:10.1109/ICPADS.2015.106

[8] Hirzel M., Soulé R., Schneider S., Gedik B., and Grimm R., “A Catalog of Stream Processing Optimizations,” ACM Computing Surveys, vol. 46, no. 4, pp. 1-34, 2014.

[9] Isah H., Abughofa T., Mahfuz S., Ajerla D., Zulkernine F., and Khan S., “A Survey of Distributed Data Stream Processing Frameworks,” IEEE Access, vol. 7, pp. 154300- 154316, 2019. DOI: 10.1109/ACCESS.2019.2946884.

[10] Javed M., Lu X., and Panda D., “Characterization of Big Data Stream Processing Pipeline: A Case Study Using Flink and Kafka,” in Proceedings of the 4th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, Texas, pp. 1-10, 2017.

[11] Kamburugamuve S. and Fox G., “Survey of Distributed Stream Processing,” Technical Report, Bloomington: Indiana University, 2016. DOI:10.13140/RG.2.1.3856.2968

[12] Karakaya Z., Yazici A., and Alayyoub M., “A Comparison of Stream Processing Frameworks,” in Proceedings of the International Conference on Computer and Applications, Doha, pp. 1-12, 2017. DOI:10.1109/COMAPP.2017.8079733

[13] Liu X. and Buyya R., “Resource Management and Scheduling in Distributed Stream Processing Systems: A Taxonomy, Review, and Future Directions,” ACM Computing Surveys, vol. 53, no. An Experimental Based Study to Evaluate the Efficiency among Stream Processing Tools 953 3, pp. 1-41, 2020.

[14] Lobato A., Lopez M., Cardenas A., Duarte O., and Pujolle G., “A Fast and Accurate Threat Detection and Prevention Architecture Using Stream Processing,” Concurrency and Computation: Practice and Experience, vol. 34, no. 3, pp. 1-17, 2022. DOI: 10.1002/cpe.6561

[15] Lopez M., Lobato A., and Duarte O., “A Performance Comparison of Open-Source Stream Processing Platforms,” in Proceedings of the IEEE Global Communications Conference (GLOBECOM), Washington (DC), pp. 1-6, 2016. DOI:10.1109/GLOCOM.2016.7841533

[16] Mehmood E. and Anees T., “Challenges and Solutions for Processing Real-Time Big Data Stream: A Systematic Literature Review,” IEEE Access, vol. 8, pp. 119123-119143, 2020. DOI: 10.1109/ACCESS.2020.3005268

[17] Vikash., Mishra L., and Varma S., “Performance Evaluation of Real-Time Stream Processing Systems for Internet of Things Applications,” Future Generation Computer Systems, vol. 113, pp. 207-217, 2020.

[18] Ounacer S., Talhaoui M., Ardchir S., Daif A., and Azouazi M., “A New Architecture for Real Time Data Stream Processing,” International Journal of Advanced Computer Science and Applications, vol. 8, no. 11, 2017. DOI:10.14569/IJACSA.2017.081106

[19] Ramírez-Gallego S., Krawczyk B., García S., Woźniak M., and Herrera F., “A Survey on Data Preprocessing for Data Stream Mining: Current Status and Future Directions,” Neurocomputing, vol. 239, pp. 39-57, 2017.

[20] Salem F., Comparative Analysis of Big Data Stream Processing Systems, Master's Thesis, Aalto University, 2016.

[21] Soumaya O., Amine T., Soufiane A., Abderrahmane D., and Mohamed A., “Real-Time Data Stream Processing Challenges and Perspectives,” International Journal of Computer Science Issues, vol. 14, no. 5, pp. 6-12, 2017.

[22] Tantalaki N., Souravlas S., and Roumeliotis M., “A Review on Big Data Real-Time Stream Processing and its Scheduling Techniques,” International Journal of Parallel, Emergent and Distributed Systems, vol. 35, no. 5, pp. 571-601, 2020. DOI:10.1080/17445760.2019.1585848

[23] Vakilinia S., Zhang X., and Qiu D., “December. Analysis and Optimization of Big-Data Stream Processing,” in Proceedings of the IEEE Global Communications Conference (GLOBECOM), Washington (DC), pp. 1-6, 2016. DOI:10.1109/GLOCOM.2016.7841598

[24] Zhang S., He B., Dahlmeier D., Zhou A., and Heinze T., “Revisiting the Design of Data Stream Processing Systems on Multi-Core Processors,” in Proceedings of the IEEE 33rd International Conference on Data Engineering, San Diego, pp. 659-670, 2017. DOI:10.1109/ICDE.2017.119

[25] Zhang S., Zhang F., Wu Y., He B., and Johns P., “Hardware-Conscious Stream Processing: A Survey,” ACM SIGMOD Record, vol. 48, no. 4, pp. 18-29, 2020.

[26] Zhao X., Garg S., Queiroz C., and Buyya R., “A Taxonomy and Survey of Stream Processing Systems,” Software Architecture for Big Data and the Cloud, pp. 183-206. 2017. 3.00011-9

[27] Zubaroğlu A. and Atalay V., “Data Stream Clustering: A Review,” Artificial Intelligence Review, vol. 54, no. 2, pp. 1201-1236, 2021.