The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Missing Values Estimation for Skylines in

Incompleteness of data is a common problem in many databases including web heterogeneous databases, multi- relational databases, spatial and temporal databases, and data integration. The incompleteness of data introduces challenges in processing queries as providing accurate results that best meet the query conditions over incomplete database is not a trivial task. Several techniques have been proposed to process queries in incomplete database. Some of these techniques retrieve the query results based on the existing values rather than estimating the missing values. Such techniques are undesirable in many cases as the dimensions with missing values might be the important dimensions of the user’s query. Besides, the output is incomplete and might not satisfy the user preferences. In this paper we propose an approach that estimates missing values in skylines to guide users in selecting the most appropriate skylines from the several candidate skylines. The approach utilizes the concept of mining attribute correlations to generate an Approximate Functional Dependencies (AFDs) that captured the relationships between the dimensions. Besides, identify the strength of probability correlations to estimate the values. Then, the skylines with estimated values are ranked. By doing so, we ensure that the retrieved skylines are in the order of their estimated precision.


[1] Antova L., Koch C., and Olteanu D., From Complete to Incomplete Information and Back, in Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, pp. 713-724, 2007.

[2] Antova L., Koch C., and Olteanu D., 10(10)6 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information, The International Journal on Very Large Data Bases, vol. 18, no. 5, pp. 1021-1040, 2009.

[3] Bartolini I., Ciaccia P., and Patella M., SaLSa: Computing the Skyline Without Scanning the Whole Sky, in Proceedings of the 15th International Conference on Information and Knowledge Managemen, Arlington, pp. 405-414, 2006.

[4] Batista G. and Monard M., An Analysis of Four Missing Data Treatment Methods for Supervised 051015202530 1020304050 Relative Error (%) Missing Rate (%) Relative Error 051015202530 1020304050 Relative Error (%) Missing Rate (%) Relative Error 051015202530 1020304050 Relative Error (%) Missing Rate (%) Relative Error 051015202530 1020304050 Relative Error (%) Missing Rate (%) Relative Error 74 The International Arab Journal of Information Technology, Vol. 15, No. 1, January 2018 Learning, Applied Artificial Intelligence Journal, vol. 17, no. 5-6, pp. 519-533, 2003.

[5] Bharuka R. and Kumar P., Finding Skylines for Incomplete Data, in Proceedings of the Twenty- Fourth Australian Database Conference, Adelaide, pp.109- 117, 2013.

[6] Bharuka R. and Kumar P., Finding Superior Skyline Points from Incomplete Data, Proceedings of the 19thInternational Conference on Management of Data, Ahmadabad, pp. 35-44, 2013.

[7] B rzs nyi S., Kossmann D., and Stocker K., The Skyline Operator, in Proceedings of the 17th International Conference on Data Engineering, Cancun, pp. 421-430, 2001.

[8] Bruy re V., Decan A., and Wijsen J., On First- Order Query Rewriting for Incomplete Database Histories, in Proceedings of the 16th International Symposium on Temporal Representation and Reasoning, Bressanone- Brixen, pp. 54-61, 2009.

[9] Canahuate G., Gibas M., and Ferhatosmanoglu H., Indexing Incomplete Databases, in Proceedings of the 10th International Conference on Advances in Database Technology, Munich, pp. 884-901, 2006.

[10] Chan C., Jagadish H., Tan K., Anthony K., and, Zhenjie Z., On High Dimensional Skylines, in Proceedings of the 10th International Conference on Extending Database Technology, Munich, pp. 478-495, 2006.

[11] Chan C., Jagadish H., Tan K., Tung A., and Zhenjie Z., Finding K-dominant Skylines in High Dimensional Space, in Proceedings of the International Conference on Management of Data, Chicago, pp. 503-514, 2006.

[12] Cheng W., Jin X., Sun J., Lin X., Zhang X., and Wang W., Searching Dimension Incomplete Databases, IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 3, pp. 725- 738, 2014.

[13] Chomicki J., Godfrey P., Gryz J., and Liang D., Skyline with Presorting, in Proceedings of the 19th International Conference on Data Engineering, Bangalore, pp. 717-719, 2003.

[14] Chomicki J., Godfrey P., Gryz J., and Liang D., Skyline with Presorting: Theory and Optimizations, in Proceedings of the International IIS: IIPWM 05, Gdansk, pp. 595- 604, 2005.

[15] George A., Efficient High Dimension Data Clustering Using Constraint-Partitioning K- Means Algorithm, The International Arab Journal of Information Technology, vol. 10, no. 5, pp. 467- 476, 2013.

[16] Godfrey P., Shipley R., and Gryz J., Maximal Vector Computation in Large Data Sets, in Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, pp. 229- 240, 2005.

[17] Green T. and Tannen V., Models for Incomplete and Probabilistic Information, IEEE Data Engineering Bulletin, vol. 29, no. 1, pp. 17-24, 2006.

[18] Grzymala-Busse J. and Hu M., A Comparison of Several Approaches to Missing Attribute Values in Data Mining, in Proceedings of the Second International Conference on Rough Sets and Current Trends in Computing, Banff, pp. 378- 385, 2000.

[19] Grzymala-Busse J., Rough Set Approach to Incomplete Data, in Proceedings of the 7th International Conference on Artificial Intelligence and Soft Computing, Zakopane, pp. 50-55, 2004.

[20] Grzymala-Busse J., Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction, Transactions on Rough Sets I, vol. 3100, pp. 78-95, 2004.

[21] Grzymala-Busse J. and Rzasa W., Local and Global Approximations for Incomplete Data, Rough Sets and Current Trends in Computing, vol. 4259, pp. 21-34, 2008.

[22] Haghani P., Michel S., and Aberer K., Evaluating Top-k Queries over Incomplete Data Streams, in Proceedings of the 18th ACM Conference on Information and Knowledge Management, Hong Kong, pp. 877-886, 2009.

[23] Jonsson P. and Wohlin C., An Evaluation of K- nearest Neighbour Imputation Using Likert Data, in Proceedings of the 10th International Symposium on Software Metrics, Chicago, pp. 108-118, 2004.

[24] Khalefa M., Mokbel M., and Levandoski J., Skyline Query Processing for Incomplete Data, in Proceedings of the 24th International Conference on Data Engineering, Cancun, pp. 556-565, 2008.

[25] Kossmann D., Ramsak F., and Rost S., Shooting Stars in the Sky: An Online Algorithm for Skyline Queries, in Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, pp. 275-286, 2002.

[26] Ooi B., Goh C., and Tan K., Fast High- Dimensional Data Search in Incomplete Databases, in Proceedings of the 24th International Conference on Very Large Data Base, San Francisco, pp. 357-367, 1998.

[27] Otsuka S. and Miyazaki N., An Incomplete Database Approach to Global Query Processing, in Proceedings of the 13th International Conference on Information Networking, Tokyo, pp. 337-342, 1998.

[28] Papadias D., Tao Y., Fu G., and Seeger B., An Optimal and Progressive Algorithm for Skyline Queries, in Proceedings of the International Missing Values Estimation for Skylines in Incomplete Database 75 Conference on Management of Data, San Diego, pp. 467-478, 2003.

[29] Razniewski S. and Nutt W., Completeness of Queries Over Incomplete Databases, in Proceedings of the 37th International Conference on Very Large Data Base, Seattle, pp. 749-760, 2011.

[30] Sarma A., Benjelloun O., Halevy A., and Widom J., Working Models for Uncertain Data, in Proceedings of the 22rd International Conference on Data Engineering, Atlanta, pp. 7- 27, 2006.

[31] Shen S., Database Relaxation: An Approach to Query Processing in Incomplete Databases, Information Processing and Management Journal, vol. 24, no. 2, pp. 151-159, 1988.

[32] Soliman M., Ilyas I., and Chang K., Top-k Query Processing in Uncertain Databases, in Proceedings of the 23rd International Conference on Data Engineering, Istanbul, pp. 896-905, 2007.

[33] Soliman M., Ilyas I., and Ben-David S., Supporting Ranking Queries on Uncertain and Incomplete Data, Very Large Database Journal, vol. 19, no. 4, pp. 477-501, 2010.

[34] Tan K., Eng P., and Ooi B., Efficient Progressive Skyline Computation, in Proceedings of the 27th International Conference on Very Large Data Bases, Roma, pp. 301-310, 2001.

[35] Twala B., Cartwright M., and Shepperd M., Comparison of Various Methods for Handling Incomplete Data in Software Engineering Databases, in Proceedings of the International Symposium on Empirical Software Engineering, Noosa Heads, pp. 105-114, 2005.

[36] Wolf G., Kalavagattu A., Khatri H., Balakrishnan R., Chokshi B., Fan J., Chen Y., and Kambhampati S., Query Processing Over Incomplete Autonomous Databases: Query Rewriting Using Learned Data Dependencies, The International Journal on Very Large Data Bases, vol. 18, no. 5, pp. 1167- 1190, 2009.

[37] Yiu M. and Mamoulis N., Efficient Processing of Top-k Dominating Queries on Multi- dimensional Data, in Proceedings of the 33rd International Conference on Very Large Data Bases, Trondheim, pp. 483-494, 2007. Ali Alwan: is currently an assistant professor at Kulliyyah (Faculty) of Information and Communication Technology, International Islamic University Malaysia (IIUM), Malaysia. He received his Master of Computer Science (2009) and Ph.D in Computer Science (2013) from Universiti Putra Malaysia (UPM), Malaysia. His research interests include preference queries, skyline queries, probabilistic and uncertain databases, query processing and optimization and management of incomplete data, data integration, location-based social networks (LBSN), recommendation systems, and data management in cloud computing. Hamidah Ibrahim: is currently a full professor at the Faculty of Computer Science and Information Technology, Universiti Putra Malaysia (UPM). She obtained her PhD in computer science from the University of Wales Cardiff, UK in 1998. Her current research interests include databases (distributed, parallel, mobile, biomedical, XML) focusing on issues related to integrity constraints checking, cache strategies, integration, access control, transaction processing, and query processing and optimization; data management in grid and knowledge- based systems. NurIzura Udzir: is an associate professor at the Faculty of Computer Science and Information Technology, Universiti Putra Malaysia (UPM) since 1998. She received her Bachelor of Computer Science (1996) and Master of Science (1998) from UPM, and her PhD in Computer Science from the University of York, UK (2006). She is a member of IEEE Computer Society. Her areas of specialization are access control, secure operating systems, intrusion detection systems, coordination models and languages, and distributed systems. She is currently the Leader of the Information Security Group at the faculty. Fatimah Sidi: is currently working as an associate professor in the discipline of Computer Science, at Department of Computer Science, Faculty of Computer Science and Information Technology, Universiti Putra Malaysia (UPM). She obtained her PhD in management information system from Universiti Putra Malaysia, Malaysia (UPM) (2008). Her current research interests are Knowledge and Information Management Systems, Data and Knowledge Engineering, Database and Data Warehouse.