..............................
            ..............................
            ..............................
            
Machine Learning based Intelligent Framework for Data Preprocessing
        
        Data  preprocessing  having  a  pivotal  role  in  data  mining  ensures  reduction  in  cost  by  catering  inconsistent, 
incomplete  and  irrelevant  data  through  data  cleansing  to  assist  knowledge  workers  in  making  effective  decisions  through 
knowledge  extraction. Prevalent techniques are  not much effective for having more  manual effort,  increased processing time, 
less  accuracy  percentage  etc  with  constrained  data  volumes.  In  this  research,  a  comprehensive,  semi-automatic  pre-
processing  framework  based  on  hybrid  of  two  machine  learning  techniques  namely  Conditional  Random  Fields  (CRF)  and 
Hidden  Markov  Model  (HMM)  is  devised  for  data  cleansing.  Proposed  framework  is  envisaged  to  be  effective  and  flexible 
enough  to  manipulate  data  set  of  any  size.  A  bucket  of  inconsistent  dataset  (comprising  of  customer’s  address  directory)  of 
Pakistan  Telecommunication  Company  (PTCL)  is  used  to  conduct  different  experiments  for  training  and  validation  of 
proposed approach. Small percentage  of semi cleansed data (output of preprocessing) is passed to hybrid of HMM and CRF 
for  learning  and  rest  of  the  data  is  used  for  testing  the  model.  Experiments  depict  superiority  of  higher  average  accuracy  of 
95.50% for proposed hybrid approach compared to CRF (84.5%) and HMM (88.6%) when applied in separately.    
            [1] Ahmed M. and Zaman M., Data Quality Tools for Data Warehousing: Enterprise Case Study, IOSR Journal of Engineering, vol. 3, no. 1, pp. 75-76, 2013.
[2] Address Cleansing, www.sqlpower/ca/page/ dqguru, Last Visited, 2014.
[3] Banerjee S., Kulia A., Roy A., Naskar S., Rosso P., and Bandyo S., A Hybrid Approach for Transliterated Word-Level Language Identification: CRF with Post-Processing Heuristics, in Proceedings of the Forum for Information Retrieval Evaluation, Bangalore, pp. 54-59, 2015.
[4] Borkar V., Deshmukh K., and Sarawagi S., Automatically Extracting Structure from Free Text Addresses, IEEE Data Engineering Bulletin, vol. 23, no. 4, pp. 27-32. 2000.
[5] Breiman L., Jerome F., Richard A., and Charles J., Classification and Regression Trees, Chapman and Hall, 1984.
[6] Canisius S., Bosch A., and Daelemans W., Discrete Versus Probabilistic Sequence Classi ers for Domain-speci c Entity Chunking, in Proceedings of the 18th Belgium- Netherlands Conference on Artificial Intelligence, Namur, pp. 175-186, 2006.
[7] Christen P., Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, Springer Publishing Company, 2012.
[8] Christen P., Automatic Record Linkage using Seeded Nearest Neighbor and Support Vector Machine Classification, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, pp. 151-159, 2008.
[9] Chu H., He Y., Chakrabarti K., and Ganjam K., TEGRA: Table Extraction by Global Record Alignment, ACM SIGMOD, in Proceedings of International Conference on Management of Data, Victoria, pp. 1713-1728, 2015.
[10] Data Cleansing Tools, www.premier- international.com/solutions_name_and_address_ data_ cleansing.aspx, Last Visited, 2013.
[11] Galhardas H., Florescu D., Shasha D., Simon E., and Saita C., Declarative Data Cleaning: Language, Model, and Algorithms, in Proceedings of 27th International Conference on Very Large Databases, Roma, pp. 371-380, 2001. Machine Learning based Intelligent Framework for Data Preprocessing 1015
[12] Kolb L., Thor A., and Rahm E., Parallel Sorted Neighborhood Blocking with MapReduce, Journal of Computer Science-Research and Development, vol. 27, no. 3, pp. 45-63, 2012.
[13] Kulkarni P. and Bakal L., Article: Hybrid Approaches for Data Cleaning in Data Warehouse, International Journal of Computer Applications, vol. 88, no. 18, pp. 7-10, 2014.
[14] Noor S. and Bashir S., Evaluating Bias in Retrieval Systems for Recall Oriented Documents Retrieval, The International Arab International Journal of Information Technology, vol. 12, no. 1, pp. 53-59, 2015.
[15] PTCL, www.ptcl.com.pk/pd-content.php? pd_id=41, Last Visited, 2014.
[16] Raham E. and Hai H., Data Cleaning: Problems and Current Approaches, Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 4, pp. 3-13, 2000.
[17] Shuxin Z., Zhonghong X., and Yuehong C., Information Extraction from Research Papers based on Conditional Random Field Model, International Journal Electrical Engineering and Computer Science, vol. 11, no. 3, pp. 1213-1220, 2013.
[18] Tan Y., Yao T., Chen Q., and Zhu J., Applying Conditional Random Fields to Chinese Shallow Parsing, in Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, pp. 167- 176, 2005.
[19] Tools for Data Warehousing, www.ctg.albany.edu, Data Quality-A Small Sample Survey, Center for Technology in Government, Last Visited, 2014.
[20] Torra V., Information Fusion-Methods and Aggregation Operators, in Proceedings of Data Mining and Knowledge Discovery Handbook, Boston, pp. 999-1008, 2010.
[21] Wallach H., Efficient Training of Conditional Random Fields, in Proceedings of 6th Annual Computational Linguistics U.K. Research Colloquium, 2002. Sohail Sarwar is a PhD student at Department of Computing, Iqra University Islamabad in Pakistan. His domain of interest is application of machine learning techniques such as data mining, software testing and e-learning. Zia Ul Qayyum is a Professor in Department of Computing, Iqra University. His area of research covers application of AI techniques in image processing and classification, natural language processing, data preprocessing, semantics and recommender systems. Abdul Kaleem is a Master s student in Iqra University. He has research interest is applying knowledge engineering in big data. He has keen interest in building web based software applications pattern mining and business intelligence.
