Downloads 776

..............................

Views 2k

..............................

Cited by 13

..............................

Received date April 20, 2017

Accepted date December 18, 2017

Privacy-Preserving for Distributed Data Streams: Towards l-Diversity

Author Mona Mohamed, Sahar Ghanem, and Magdy Nagi,

Keywords #k-anonymity #l-diversity #data streams and clustering

Abstract Privacy-preserving data publishing have been studied widely on static data. However, many recent applications generate data streams that are real-time, unbounded, rapidly changing, and distributed in nature. Recently, few work addressed k-anonymity and l-diversity for data streams. Their model implied that if the stream is distributed, it is collected at a central site for anonymization. In this paper, we propose a novel distributed model where distributed streams are first anonymized by distributed (collecting) sites before merging and releasing. Our approach extends Continuously Anonymizing STreaming data via adaptive cLustEring (CASTLE), a cluster-based approach that provides both k-anonymity and l-diversity for centralized data streams. The main idea is for each site to construct its local clustering model and exchange this local view with other sites to globally construct approximately the same clustering view. The approach is heuristic in a sense that not every update to the local view is sent, instead triggering events are selected for exchanging cluster information. Extensive experiments on a real data set are performed to study the introduced Information Loss (IL) on different settings. First, the impact of the different parameters on IL are quantified. Then k-anonymity and l-diversity are compared in terms of messaging cost and IL. Finally, the effectiveness of the proposed distributed model is studied by comparing the introduced IL to the IL of the centralized model (as a lower bound) and to a distributed model with no communication (as an upper bound). The experimental results show that the main contributing factor to IL is the number of attributes in the quasi-identifier (50%-75%) and the number of sites contributed about 1% and this proves the scalability of the proposed approach. In addition, providing l-diversity is shown to introduce about 25% increase in IL when compared to k-anonymity. Moreover, 35% reduction in IL is achieved by messaging cost (in bytes) of about 0.3% of the data set size.

References

[1] Aggarwal C. and Yu P., “A Condensation Approach to Privacy Preserving Data Mining,” in Proceedings of International Conference on Extending Database Technology, Heraklion, pp. 183-199, 2004.

[2] Aldeen Y., Salleh M., and Aljeroudi Y., “An Innovative Privacy Preserving Technique for Incremental Datasets on Cloud Computing,” Journal of Biomedical Informatics, vol. 62, pp. 107-116, 2016.

[3] Bayardo R. and Agrawal R., “Data Privacy through Optimal k-Anonymization,” in Proceedings of 21st International Conference on Data Engineering, Tokoyo, pp. 217-228, 2005.

[4] Cao J., Carminati B., Ferrari E., and Tan K., “CASTLE: Continuously Anonymizing Data Streams,” IEEE Transactions on Dependable and Secure Computing, vol. 8, no. 3, pp. 337- 352, 2011.

[5] Evfimievski A., Srikant R., Agrawal R., and Gehrke J., “Privacy Preserving Mining of Association Rules,” Information Systems, vol. 29, no. 4, pp. 343-364, 2004.

[6] Fung B., Wang K., and Yu P., “Top-Down Specialization for Information and Privacy Preservation,” in Proceedings of International Conference on Data Engineering, Tokoyo, pp. 205-216, 2005.

[7] Goryczka S., Xiong L., and Sunderam V., “Secure Multiparty Aggregation with Differential Privacy: A Comparative Study,” in Proceedings of the Joint EDBT/ICDT Workshops, Genoa, pp. 155-163, 2013.

[8] Guo K. and Zhang Q., “Fast Clustering-Based Anonymization Approaches with Time Constraints for Data Streams,” Knowledge- Based Systems, vol. 46, pp. 95-108, 2013. 0 50000 100000 150000 200000 250000 246810 Information loss Centralized Distributed w/o comm. n l=6 0 50000 100000 150000 200000 250000 246810 Information loss Centralized Distributed w/o comm. l n=4 0 50000 100000 150000 200000 250000 246810 Information loss Centralized Distributed w/o comm. l n=8 Privacy-Preserving for Distributed Data Streams: Towards l-Diversity 63

[9] LeFevre K., DeWitt D., and Ramakrishnan R., “Incognito: Efficient Full-Domain k-Anonymity,” in Proceedings of ACM SIGMOD International Conference on Management of Data, Baltimore, pp. 49-60, 2005.

[10] LeFevre K. DeWitt D., and Ramakrishnan R., “Mondrian Multidimensional Kanonymity,” in Proceedings of IEEE 22nd International Conference on Data Engineering, Atlanta, pp. 25, 2006.

[11] Li J., Ooi B., and Wang W., “Anonymizing Streaming Data for Privacy Protection,” in Proceedings of 24th International Conference on Data Engineering, Cancun, pp. 1367-1369, 2008.

[12] Li N., Li T., and Venkatasubramanian S., “t- Closeness: Privacy Beyond K-Anonymity and L- Diversity,” in Proceedings of 23st IEEE International Conference on Data Engineering, Istanbul, pp. 106-115, 2007.

[13] Lichman M., UCI Machine Learning Repository, http://archive.ics.uci.edu/ml, Last Visited, 2013.

[14] Machanavajjhala A., Kifer D., Gehrke J., and Venkitasubramaniam M., “l-diversity: Privacy Beyond K-Anonymity,” in Proceedings of 22nd International Conference on Data Engineering, Atlanta, pp. 24-35, 2006.

[15] Mohamed M., Nagi M., and Ghanem S., “A Clustering Approach for Anonymizing Distributed Data Streams,” in Proceedings of 11th IEEE International Conference on Computer Engineering and Systems, Cairo, pp. 9-16, 2016.

[16] Mohammadian E., Noferesti M., and Jalili R., “FAST: Fast Anonymization of Big Data Streams,” in Proceedings of International Conference on Big Data Science and Computing, Beijing, pp. 23-30, 2014.

[17] Sweeney L., “Achieving k-Anonymity Privacy Protection Using Generalization and Suppression,” International Journal of Uncertainty, Fuzziness, Knowledge-Based Systems, vol. 10, no. 5, pp. 571-588, 2002.

[18] Sweeney L., “K-anonymity: A Model for Protecting Privacy,” International Journal on Uncertainty, Fuzziness, and Knowledge-based Systems, vol. 10, no. 5, pp. 557-570, 2002.

[19] Victor N., Lopez D., and Abawajy J., “Privacy Models For Big Data: A Survey,” International Journal on Big Data Intelligence, vol. 3, no. 1, pp. 61-75, 2016.

[20] Wang W., Li J., Ai C., and Li Y., “Privacy Protection on Sliding Window of Data Streams,” in Proceedings of International Conference on Collaborative Computing: Networking, Applications and Worksharing, New York, pp. 213-221, 2007.

[21] Wang P., Lu J., Zhao L., and Yang J., “B- CASTLE: An Efficient Publishing Algorithm for K-Anonymizing Data Streams,” in Proceedings of 2nd WRI Global Congress on Intelligent Systems, Wuhan, pp. 132-136, 2010.

[22] Yarlagadda A., Jonnalagedda M., and Munaga K., “Clustering Based on Correlation Fractal Dimension Over an Evolving Data Stream,” The International Arab Journal of Information Technology, vol. 15, no. 1, pp. 1-9, 2018.

[23] Zakerzadeh H. and Osborn S., “FAANST: Fast Anonymizing Algorithm for Numerical Streaming Data,” in Proceedings of Data Privacy Management and Autonomous Spontaneous Security, Athens, pp. 36-50, 2011.

[24] Zhang J., Yang J., Zhang J., and Yuan Y., “KIDS: K-Anonymization Data Stream Base on Sliding Window,” in Proceedings of 2nd International Conference on Future Computer and Communication, Wuha, pp. 311-316, 2010.

[25] Zhou B., Han Y., Pei J., Jiang B., Tao Y., and Jia Y., “Continuous Privacy Preserving Publishing Of Data Streams,” in Proceedings of 12th International Conference on Extending Database Technology: Advances in Database Technology, Saint Petersburg, pp. 648-659, 2009. 64 The International Arab Journal of Information Technology, Vol. 17, No. 1, January 2020 Mona Mohamed received her BSc and MSc in Computer Science from Faculty of Engineering in Alexandria University, Egypt, in 2005 and 2011, respectively. She is currently a PhD candidate at the Faculty of Engineering in Alexandria, Egypt. Her main research interests are privacy, data mining and distributed systems. Sahar Ghanem received her BSc and MSc in Computer Science from Faculty of Engineering in Alexandria University, Egypt, in 1994 and 1997, respectively, and PhD in Computer Science from Old Dominion University in VA, USA, in 2004. She is currently an Associate Professor at the Faculty of Engineering in Alexandria, Egypt. Her main research interests are computer networks, network security, performance evaluation and data mining. She has published about 20 scientific publications in international journals and conference proceedings. Magdy Nagi received his BSc from the Faculty of Engineering Alexandria University Egypt in 1963. He received his PhD from Karlsruhe University, Karlsruhe, Germany in 1975. He is currently a Professor at the Faculty of Engineering Alexandria University. His main research interests are software engineering, DBMS, data mining and grid computing. He has published more than 35 scientific publications in international journals and conference proceedings.

Abstract: Privacy-preserving data publishing have been studied widely on static data. However, many recent applications generate data streams that are real-time, unbounded, rapidly changing, and distributed in nature. Recently, few work addressed k-anonymity and l-diversity for data streams. Their model implied that if the stream is distributed, it is collected at a central site for anonymization. In this paper, we propose a novel distributed model where distributed streams are first anonymized by distributed (collecting) sites before merging and releasing. Our approach extends Continuously Anonymizing STreaming data via adaptive cLustEring (CASTLE), a cluster-based approach that provides both k-anonymity and l-diversity for centralized data streams. The main idea is for each site to construct its local clustering model and exchange this local view with other sites to globally construct approximately the same clustering view. The approach is heuristic in a sense that not every update to the local view is sent, instead triggering events are selected for exchanging cluster information. Extensive experiments on a real data set are performed to study the introduced Information Loss (IL) on different settings. First, the impact of the different parameters on IL are quantified. Then k-anonymity and l-diversity are compared in terms of messaging cost and IL. Finally, the effectiveness of the proposed distributed model is studied by comparing the introduced IL to the IL of the centralized model (as a lower bound) and to a distributed model with no communication (as an upper bound). The experimental results show that the main contributing factor to IL is the number of attributes in the quasi-identifier (50%-75%) and the number of sites contributed about 1% and this proves the scalability of the proposed approach. In addition, providing l-diversity is shown to introduce about 25% increase in IL when compared to k-anonymity. Moreover, 35% reduction in IL is achieved by messaging cost (in bytes) of about 0.3% of the data set size.
URL: https://iajit.org/paper/2081

,abstract={Privacy-preserving data publishing have been studied widely on static data. However, many recent applications generate data streams that are real-time, unbounded, rapidly changing, and distributed in nature. Recently, few work addressed k-anonymity and l-diversity for data streams. Their model implied that if the stream is distributed, it is collected at a central site for anonymization. In this paper, we propose a novel distributed model where distributed streams are first anonymized by distributed (collecting) sites before merging and releasing. Our approach extends Continuously Anonymizing STreaming data via adaptive cLustEring (CASTLE), a cluster-based approach that provides both k-anonymity and l-diversity for centralized data streams. The main idea is for each site to construct its local clustering model and exchange this local view with other sites to globally construct approximately the same clustering view. The approach is heuristic in a sense that not every update to the local view is sent, instead triggering events are selected for exchanging cluster information. Extensive experiments on a real data set are performed to study the introduced Information Loss (IL) on different settings. First, the impact of the different parameters on IL are quantified. Then k-anonymity and l-diversity are compared in terms of messaging cost and IL. Finally, the effectiveness of the proposed distributed model is studied by comparing the introduced IL to the IL of the centralized model (as a lower bound) and to a distributed model with no communication (as an upper bound). The experimental results show that the main contributing factor to IL is the number of attributes in the quasi-identifier (50%-75%) and the number of sites contributed about 1% and this proves the scalability of the proposed approach. In addition, providing l-diversity is shown to introduce about 25% increase in IL when compared to k-anonymity. Moreover, 35% reduction in IL is achieved by messaging cost (in bytes) of about 0.3% of the data set size.},
keywords={k-anonymity, l-diversity, data streams and clustering},
ISSN={2413-9351},
month={Jan}}

AB - Privacy-preserving data publishing have been studied widely on static data. However, many recent applications generate data streams that are real-time, unbounded, rapidly changing, and distributed in nature. Recently, few work addressed k-anonymity and l-diversity for data streams. Their model implied that if the stream is distributed, it is collected at a central site for anonymization. In this paper, we propose a novel distributed model where distributed streams are first anonymized by distributed (collecting) sites before merging and releasing. Our approach extends Continuously Anonymizing STreaming data via adaptive cLustEring (CASTLE), a cluster-based approach that provides both k-anonymity and l-diversity for centralized data streams. The main idea is for each site to construct its local clustering model and exchange this local view with other sites to globally construct approximately the same clustering view. The approach is heuristic in a sense that not every update to the local view is sent, instead triggering events are selected for exchanging cluster information. Extensive experiments on a real data set are performed to study the introduced Information Loss (IL) on different settings. First, the impact of the different parameters on IL are quantified. Then k-anonymity and l-diversity are compared in terms of messaging cost and IL. Finally, the effectiveness of the proposed distributed model is studied by comparing the introduced IL to the IL of the centralized model (as a lower bound) and to a distributed model with no communication (as an upper bound). The experimental results show that the main contributing factor to IL is the number of attributes in the quasi-identifier (50%-75%) and the number of sites contributed about 1% and this proves the scalability of the proposed approach. In addition, providing l-diversity is shown to introduce about 25% increase in IL when compared to k-anonymity. Moreover, 35% reduction in IL is achieved by messaging cost (in bytes) of about 0.3% of the data set size.