The International Arab Journal of Information Technology (IAJIT)


Machine Learning based Intelligent Framework for Data Preprocessing

Data preprocessing having a pivotal role in data mining ensures reduction in cost by catering inconsistent, incomplete and irrelevant data through data cleansing to assist knowledge workers in making effective decisions through knowledge extraction. Prevalent techniques are not much effective for having more manual effort, increased processing time, less accuracy percentage etc with constrained data volumes. In this research, a comprehensive, semi-automatic pre- processing framework based on hybrid of two machine learning techniques namely Conditional Random Fields (CRF) and Hidden Markov Model (HMM) is devised for data cleansing. Proposed framework is envisaged to be effective and flexible enough to manipulate data set of any size. A bucket of inconsistent dataset (comprising of customer’s address directory) of Pakistan Telecommunication Company (PTCL) is used to conduct different experiments for training and validation of proposed approach. Small percentage of semi cleansed data (output of preprocessing) is passed to hybrid of HMM and CRF for learning and rest of the data is used for testing the model. Experiments depict superiority of higher average accuracy of 95.50% for proposed hybrid approach compared to CRF (84.5%) and HMM (88.6%) when applied in separately.

[1] Ahmed M. and Zaman M., Data Quality Tools for Data Warehousing: Enterprise Case Study, IOSR Journal of Engineering, vol. 3, no. 1, pp. 75-76, 2013.

[2] Address Cleansing, www.sqlpower/ca/page/ dqguru, Last Visited, 2014.

[3] Banerjee S., Kulia A., Roy A., Naskar S., Rosso P., and Bandyo S., A Hybrid Approach for Transliterated Word-Level Language Identification: CRF with Post-Processing Heuristics, in Proceedings of the Forum for Information Retrieval Evaluation, Bangalore, pp. 54-59, 2015.

[4] Borkar V., Deshmukh K., and Sarawagi S., Automatically Extracting Structure from Free Text Addresses, IEEE Data Engineering Bulletin, vol. 23, no. 4, pp. 27-32. 2000.

[5] Breiman L., Jerome F., Richard A., and Charles J., Classification and Regression Trees, Chapman and Hall, 1984.

[6] Canisius S., Bosch A., and Daelemans W., Discrete Versus Probabilistic Sequence Classi ers for Domain-speci c Entity Chunking, in Proceedings of the 18th Belgium- Netherlands Conference on Artificial Intelligence, Namur, pp. 175-186, 2006.

[7] Christen P., Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, Springer Publishing Company, 2012.

[8] Christen P., Automatic Record Linkage using Seeded Nearest Neighbor and Support Vector Machine Classification, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, pp. 151-159, 2008.

[9] Chu H., He Y., Chakrabarti K., and Ganjam K., TEGRA: Table Extraction by Global Record Alignment, ACM SIGMOD, in Proceedings of International Conference on Management of Data, Victoria, pp. 1713-1728, 2015.

[10] Data Cleansing Tools, www.premier- data_ cleansing.aspx, Last Visited, 2013.

[11] Galhardas H., Florescu D., Shasha D., Simon E., and Saita C., Declarative Data Cleaning: Language, Model, and Algorithms, in Proceedings of 27th International Conference on Very Large Databases, Roma, pp. 371-380, 2001. Machine Learning based Intelligent Framework for Data Preprocessing 1015

[12] Kolb L., Thor A., and Rahm E., Parallel Sorted Neighborhood Blocking with MapReduce, Journal of Computer Science-Research and Development, vol. 27, no. 3, pp. 45-63, 2012.

[13] Kulkarni P. and Bakal L., Article: Hybrid Approaches for Data Cleaning in Data Warehouse, International Journal of Computer Applications, vol. 88, no. 18, pp. 7-10, 2014.

[14] Noor S. and Bashir S., Evaluating Bias in Retrieval Systems for Recall Oriented Documents Retrieval, The International Arab International Journal of Information Technology, vol. 12, no. 1, pp. 53-59, 2015.

[15] PTCL, pd_id=41, Last Visited, 2014.

[16] Raham E. and Hai H., Data Cleaning: Problems and Current Approaches, Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 4, pp. 3-13, 2000.

[17] Shuxin Z., Zhonghong X., and Yuehong C., Information Extraction from Research Papers based on Conditional Random Field Model, International Journal Electrical Engineering and Computer Science, vol. 11, no. 3, pp. 1213-1220, 2013.

[18] Tan Y., Yao T., Chen Q., and Zhu J., Applying Conditional Random Fields to Chinese Shallow Parsing, in Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, pp. 167- 176, 2005.

[19] Tools for Data Warehousing,, Data Quality-A Small Sample Survey, Center for Technology in Government, Last Visited, 2014.

[20] Torra V., Information Fusion-Methods and Aggregation Operators, in Proceedings of Data Mining and Knowledge Discovery Handbook, Boston, pp. 999-1008, 2010.

[21] Wallach H., Efficient Training of Conditional Random Fields, in Proceedings of 6th Annual Computational Linguistics U.K. Research Colloquium, 2002. Sohail Sarwar is a PhD student at Department of Computing, Iqra University Islamabad in Pakistan. His domain of interest is application of machine learning techniques such as data mining, software testing and e-learning. Zia Ul Qayyum is a Professor in Department of Computing, Iqra University. His area of research covers application of AI techniques in image processing and classification, natural language processing, data preprocessing, semantics and recommender systems. Abdul Kaleem is a Master s student in Iqra University. He has research interest is applying knowledge engineering in big data. He has keen interest in building web based software applications pattern mining and business intelligence.