An Approach for Instance Based Schema Matching

Author Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Malaysia,

Keywords #Schema matching #instance based schema matching #Google similarity #regular expression

Abstract Instance based schema matching is the process of comparing instances from different heterogeneous data sources in determining the correspondences of schema attributes. It is a substitutional choice when schema information is not available or might be available but worthless to be used for matching purpose. Different strategies have been used by various instance based schema matching approaches for discovering correspondences between schema attributes. These strategies are neural network, machine learning, information theoretic discrepancy and rule based. Most of these approaches treated instances including instances with numeric values as strings which prevents discovering common patterns or performing statistical computation between the numeric instances. As a consequence, this causes unidentiﬁed matches especially for numeric instances. In this paper, we propose an approach that addresses the above limitation of the previous approaches. Since we only fully exploit the instances of the schemas for this task, we rely on strategies that combine the strength of Google as a web semantic and regular expression as pattern recognition. The results show that our approach is able to find 1-1 schema matches with high accuracy in the range of 93%-99% in terms of Precision (P), Recall (R), and F-measure (F). Furthermore, the results showed that our proposed approach outperformed the previous approaches although only a sample of instances is used instead of considering the whole instances during the process of instance based schema matching as used in the previous works.

References

[1] Benslimane S., Malki M., and Bouchiha D., Deriving Conceptual Schema from Domain Ontology: A Web Application Reverse Engineering Approach, The International Arab Journal of Information Technology, vol. 7, no. 2, pp. 167-176, 2010.

[2] Berlin J. and Motro A., Cooperative Information Systems, Springer Link, 2001.

[3] Bilke A. and Naumann F, Schema Matching using Duplicates, in Proceeding of the 21st International Conference on Data Engineering, Washington, pp. 69-80, 2005.

[4] Christen P, Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection, Springer Link, 2012.

[5] Chua C., Chiang R., and Lim E., Instance-based Attribute Identification in Database Integration, The International Journal on Very Large Data Bases, vol. 12, no. 3, pp. 228-243, 2003.

[6] Cilibrasi R., and Vitanyi P., The Google Similarity Distance, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 3, pp. 370-383, 2007.

[7] Cilibrasi R. and Vitanyi P., Automatic Meaning Discovery using Google, Technical Report, 2004.

[8] De Carvalho M., Laender A., Gon alves M., and Da-Silva A., An Evolutionary Approach to Complex Schema Matching, Information Systems, vol. 38, no. 3, pp. 302-316, 2013.

[9] Doan A. and Halevy A., Semantic Integration Research in the Database Community: A Brief Survey, AI Magazine, vol. 26, no. 1, pp. 83-94, 2005.

[10] Doan A., Domingos P., and Halevy A., Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach in Proceedings of ACM SIGMOD International Conference on Management of Data, New York, pp. 509-520, 2001.

[11] Feng J., Hong X., and Qu Y., An Instance- Based Schema Matching Method with Attributes Ranking and Classification, in Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery, New Jersey, pp. 522-526, 2009.

[12] Friedl J., Mastering Regular Expressions, O'Reilly Media, 2006.

[13] Goyvaerts J. and Levithan S, Regular Expressions Cookbook, O'reilly, 2012.

[14] Kang J. and Naughton J., On Schema Matching with Opaque Column Names and Data Values, in Proceeding of the ACM SIGMOD International Conference on Management of Data, New York, pp. 205-216, 2003.

[15] Kang J. and Naughton J., Schema Matching using Interattribute Dependencies, Knowledge and Data Engineering IEEE Transactions, vol. 20, no. 10, pp. 1393-1407, 2008.

[16] Khan L., Partyka J., Parveen P., Thuraisingham B., and Shekhar S., Enhanced Geographically- Typed Semantic Schema Matching, Journal of Web Semantics, vol. 9, no. 1, pp. 52-70, 2011. An Approach for Instance Based Schema Matching with Google Similarity and Regular Expression 763

[17] Kleene S., Representation of Events in Nerve Nets and Finite Automata, Princeton University Press, 1951.

[18] Li W. and Clifton C., SEMINT: A Tool for Identifying Attribute Correspondences in Heterogeneous Databases using Neural Networks, Data and Knowledge Engineering, vol. 33, no. 1, pp. 49-84, 2000.

[19] Li W. and Clifton C., Semantic Integration in Heterogeneous Databases using Neural Networks, in Proceeding of the 20th International Conference on Very Large Data Bases, San Francisco, pp. 1-12, 1994.

[20] Li Y., Liu D., and Zhang W., Schema Matching using Neural Network, in Proceeding of the IEEE/WIC/ACM International Conference on Web Intelligence, Washington, pp. 743-746, 2005.

[21] Liang Y., An Instance-Based Approach for Domain-Independent Schema Matching, in Proceeding of the 46th Annual Southeast Regional Conference on XX, New York, pp. 268-271, 2008.

[22] Liu G., Huang S., and Cheng Y., Frontiers in Computer Education, Springer, 2012.

[23] Mehdi O., Ibrahim H., and Affendey L., Instance Based Matching using Regular Expression, Procedia Computer Science, vol. 10, pp. 688-695, 2012.

[24] Rahm E. and Bernstein P., A Survey of Approaches to Automatic Schema Matching, The International Journal on Very Large Data Bases, vol. 10, no. 4, pp. 334-350, 2001.

[25] Restaurant Reviews Dataset, http://www .cs.cmu.edu/~ mehrbod/RR/., Last Visited 2014.

[26] Riaz M. and Munir S., An Instance Based Approach to Find the Types of Correspondence Between the Attributes of Heterogeneous Datasets, Isseratration Academic Book Publishers, 2012.

[27] Stubblebine T., Regular Expression Pocket Reference: Regular Expressions for Perl, Ruby, PHP, Python, C, Java and. NET, Amazon, 2007.

[28] Tejada S., Knoblock C., and Minton S., Learning Object Identification Rules for Information Integration, Information Systems, vol. 26, no. 8, pp. 607-633, 2001.

[29] UCI Machine Learning Repository, http://archive. ics.uci.edu/ml/ datasets.html, Last Visited 2014.

[30] Yang Y., Chen M., and Gao B., An Effective Content-Based Schema Matching Algorithm, in Proceeding of the International Seminar on Future Information Technology and Management Engineering, Washington, pp. 7-11, 2008.

[31] Zaib K., Instance-Based Ontology Matching and the Evaluation of Matching Systems, PhD Dissertation, Dusseldorf University. Osama Mehdi received his Bachelor of Computer Science from the University of Babylon, Iraq in 2009 and M.Sc. by research degree in computer science and information technology from University Putra Malaysia, Malaysia in 2014. Currently, he is working as a lecturer at Al Mustaqbal College University. His research interests include Data Integration, Information Retrieval, Semantic Web, Pattern Recognition and Large-Scale Data Analysis (Big Data). Hamidah Ibrahim is currently a professor at the Faculty of Computer Science and Information Technology, Universiti Putra Malaysia. She obtained her PhD in computer science from the University of Wales Cardiff, UK in 1998. Her current research interests include databases (distributed, parallel, mobile, bio-medical, XML) focusing on issues related to integrity constraints checking, cache strategies, integration, access control, transaction processing, and query processing and optimization; data management in grid and knowledge-based systems. (e-mail: hamidah.ibrahim@upm.edu.my). Lilly Affendey received her Bachelor of Computer Science from the University of Agriculture, Malaysia in 1991 and MSc in Computing from the University of Bradford, UK in 1994. In 2007 she received her PhD in Database Systems from University Putra Malaysia. Her research interests are in Multimedia Database, Content-based Video Retrieval and Big Data Analytics. She is currently an Associate Professor in University Putra Malaysia.