The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


A Differential Geometry Perspective about

,
#
In the Multiple Data Streams (MDS) environment, dat a sources generate data with no end in sight. Because of the difference of data sources, transaction numbers of MDS are not always equal to each other during a sam e period. Preprocessing MDS to obtain same number of samples for each stream is an essential step for lots of mining tasks. All existing preprocessing methods assume that data arrive simul taneously. However, this assumption may not be true in many real environments due to multiple data sources and diffe rent ways of data generating. This asynchronous iss ue is explored in this paper, by introducing the differential geometry as a trick. First, we establish a novel stream model called POLAR. The POLAR is an intrinsic surface spanned by time, probabilit y and value. And then, we propose a preprocessing a pproach, called COPOLAR, to obtain same number of samples for each stream of MDS. COPOLAR first projects original observations onto POLAR; and then merges points with shortest geodesi c distances along a geodesic on surface into mid6point on the same geodesic iteratively and incrementally until the nu mber of points which we hope to obtain is met. Expe rimental results on synthetic and real data show that COPOLAR is effect ive in terms of maintaining characteristics of both statistics and vector.


[1] Aggarwal C., On Biased Reservoir Sampling in the Presence of Stream Evolution, in Proceedings of the 32 nd International Conference on Very Large Data Bases , Seoul, Korea, pp. 607618, 2006.

[2] Babcock B., Babu S., Datar M., Motwani R., and Widom J., Models and Issues in Data Stream Systems, in Proceedings of the 21 st ACM SIGMOD6SIGACT6SIGART Symposium on Principles of Database Systems , Wisconsin, USA, pp. 116, 2002.

[3] Beringer J. and Hllermeier E., Online Clustering of Parallel Data Streams, Data and Knowledge Engineering , vol. 58, no. 2, pp. 180204, 2006.

[4] Braverman V., Ostrovsky R., and Zaniolo C., Optimal Sampling from Sliding Windows, in Proceedings of the 28 th ACM SIGMOD6SIGACT6 SIGART Symposium on Principles of Database Systems , Rhode Island, USA, pp. 147156, 2009.

[5] ByungHoon P., George O., and Nagiza F., Sampling Streaming Data with Replacement, Computational Statistics and Data Analysis , vol. 52, no. 2, pp. 750762, 2007.

[6] Chen L., Zou L., and Tu L., A Clustering Algorithm for Multiple Data Streams based on Spectral Component Similarity, Information Sciences , vol. 183, no. 1, pp. 3547, 2012.

[7] Ciampi A., Appice A., and Malerba D., Summarization for Geographically Distributed Data Streams, in Proceedings of the 14 th International Conference Knowledge6Based and Intelligent Information and Engineering Systems , Cardiff, UK, pp. 339348, 2010. A Differential Geometry Perspective about Multiple Data Streams Preprocessing 565

[8] Crone S., Lessmann S., and Stahlbock R., The Impact of Preprocessing on Data Mining :An Evaluation of Classifier Sensitivity in Direct Marketing, European Journal of Operational Research , vol. 173, no. 3, pp. 781800, 2006.

[9] Davis J. and Clark A., Data Preprocessing for Anomaly Based Network Intrusion Detection: A Review, Computers and Security , vol. 30, no. 6 7, pp. 353375, 2011.

[10] Demaine E., L pezOrtiz A., and Munro J., Frequency Estimation of Internet Packet Streams with Limited Space, in Proceedings of the 10 th Annual European Symposium on Algorithms , Rome, Italy, pp. 348360, 2002.

[11] ric W., Numerical Methods and Optimization , Springer, 2014.

[12] Gaber M., Zaslavsky A., and Krishnaswamy S., Mining Data Streams a Review, in ACM SIGMOD Record , vol. 34, no. 2, pp. 1826, 2005.

[13] Granmo O. and Oommen B., Optimal Sampling for Estimation with Constrained Resources using a Learning Automatonbased Solution for the Nonlinear Fractional Knapsack Problem, Applied Intelligence , vol. 33, no. 1, pp. 320, 2010.

[14] Li J., Jia Q., Guan X., and Chen X., Tracking a Moving Object via a Sensor Network with a Partial Information Broadcasting Scheme, Information Sciences , vol. 181, no. 20, pp. 4733 4753, 2011.

[15] Lim Y. and Kang S., Intelligent Approach for Data Collection in Wireless Sensor Networks, the International Arab Journal of Information Technology , vol. 10, no. 1, pp. 3642, 2013.

[16] Palmer C. and Faloutsos C., Density Biased Sampling an Improved Method for Data Mining and Clustering, in Proceedings of ACM SIGMOD International Conference on Management of Data , Texas, USA, pp. 8292, 2000.

[17] Serir L., Ramasso E., and Zerhouni N., Evidential Evolving Gustafsonckessel Algorithm for Online Data Streams Partitioning using Belief Function Theory, the International Journal of Approximate Reasoning , vol. 53, no. 5, pp. 747768, 2012.

[18] Smith J., Borckardt J., and Nash M., Inferential Precision in SingleCase TimeSeries Data Streams: How Well does the EM Procedure Perform when Missing Observations Occur in Autocorrelated Data, Behavior Therapy, vol. 43, no. 3, pp. 679685, 2012.

[19] Sun J., He K., and Li H., SFFSPCNN Optimized by Genetic Algorithm for Dynamic Prediction of Financial Distress with Longitudinal Data Streams, Knowledge6Based Systems , vol. 24, no. 7, pp. 10131023, 2011.

[20] Victor T., Differential Geometry of Curves and Surfaces , SpringerVerlag, 2006.

[21] Wang Y., Zhang G., and Qian J., ApproxCCA: An Approximate Correlation Analysis Algorithm for Multidimensional Data Streams, Knowledge6 Based Systems , vol. 24, no. 7, pp. 952962, 2011.

[22] Zhang J., Xu J., and Liao S., Sampling Methods for Summarizing Unordered VehicletoVehicle Data Streams, Transportation Research: Part C Emerging Technologies , vol. 23, pp. 5667, 2012.

[23] Zhang T., Yue D., Gu Y., Wang Y., and Yu G., Adaptive Correlation Analysis in Stream Time Series with Sliding Windows, Computers and Mathematics with Applications , vol. 57, no. 6, pp. 937948, 2009.

[24] Zhang Z. and Zhou J., Transfer Estimation of Evolving Class Priors in Data Stream Classification, Pattern Recognition , vol. 43, no. 9, pp. 31513161, 2010. Li Wen-Ping received his DEng degree in College of Computer Science and Technology, Harbin Engineering University, China. Currently, he is working as Associate Professor in College of Mathematics Physics and Information Engineering, Jiaxing University, China. His research interests are in the areas of data str eam, data mining, privacy preservation and membrane computing. Yang Jing received her DEng degree in College of Computer Science and Technology, Harbin Engineering University, China. Currently, she is working as Professor in the College of Computer Science and Technology, Harbin Engineering University, China. Her research interests are in the areas of database, data mining and privacy preservation. Zhang Jian-Pei received his DEng degree in College of Computer Science and Technology, Harbin Engineering University, China. Currently, he is working as Professor in the College of Computer Science and Technology, Harbin Engineering University, China. His research interests are in the areas of database, data mining and social network.