..............................
..............................
..............................
An Effective Sample Preparation Method for Diabetes Prediction
Diabetes is a chronic disorder caused by metabolic malfunction in carbohydrate metabolism and it has become a
serious health problem worldwide. Early and correct detection of diabetes can significantly influence the treatment process of
diabetic patients and thus eliminate the associated side effects. Machine learning is an emerging field of high importance for
providing prognosis and a deeper understanding of the classification of diseases such as diabetes. This study proposed a high
precision diagnostic system by modifying k-means clustering technique. In the first place, noisy, uncertain and inconsistent
data was detected by new clustering method and removed from data set. Then, diabetes prediction model was generated by
using Support Vector Machine (SVM). Employing the proposed diagnostic system to classify Pima Indians Diabetes data set
(PID) resulted in 99.64% classification accuracy with 10-fold cross validation. The results from our analysis show the new
system is highly successful compared to SVM and the classical k-means algorithm & SVM regarding classification
performance and time consumption. Experimental results indicate that the proposed approach outperforms previous methods.
[1] Abbas O., Comparisons Between Data Clustering Algorithms, The International Arab Journal of Information Technology, vol. 5, no. 3, pp. 320-325, 2008.
[2] Anand R., Kirar V., and Burse K., K-fold Cross Validation and Classification Accuracy of PIMA Indian Diabetes Data set using Higher Order Neural Network and PCA, International Journal of Soft Computing and Engineering, vol. 2, no. 6, pp. 2231-2307, 2013.
[3] Burgers J., A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121- 167, 1998.
[4] ali ir D. and Do antekin E., An Automatic Diabetes Diagnosis System based on LDA- Wavelet Support Vector Machine classi er, Expert Systems with Applications, vol. 38, no. 7, pp. 8311-8315, 2011.
[5] Causes of Diabetes, National Institute of Diabetes and Digestive and Kidney Diseases, https://www.niddk.nih.gov/health- information/diabetes/causes, Last Visited, 2014.
[6] Christobel Y. and Sivaprakasam P., A new Class wise k Nearest Neighbor Method for the Classification of Diabetes Dataset, International Journal of Engineering and Advanced Technology, vol. 2, no. 3, pp. 396-400, 2013.
[7] Durairaj M. and Kalaiselvi G., Prediction of Diabetes using Soft Computing Techniques-A Survey, International Journal of Scientific and Technology Research, vol. 4, no. 3, pp. 190-192, 2015.
[8] Frank A. and Asuncion A., UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science 2010.
[9] Ganji M. and Abadeh M., A Fuzzy Classi cation System based on Ant Colony Optimization for Diabetes Disease Diagnosis, Expert Systems with Applications, vol. 38, no. 12, pp. 14650-14659, 2011.
[10] Iyer A., Jeyalatha S., and Sumbaly R., Diagnosis of Diabetes using Classification Mining Techniques, International Journal of Data Mining and Knowledge Management Process, vol. 5, no. 1, pp. 1-14, 2015.
[11] Jain A., Murty M., and Flynn P., Data Clustering: A Review, ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[12] Karegowda A., Punya V., Jayaram M., and Manjunath A., Rule based Classification for Diabetic Patients using Cascaded K-Means and Decision Tree C4.5, International Journal of Computer Applications, vol. 45, no. 12, pp. 45- 50, 2012.
[13] Keerthana G. and Srividhya V., Performance Enhancement of Classifiers using Integration of Clustering and Classification Techniques, International Journal of Computer Science Engineering, vol. 3, no.3, pp. 200-203, 2014.
[14] Koklu M. and Unal Y., Analysis of A Population of Diabetic Patients Databases with Classifiers, International Journal of Biomedical and Biological Engineering, vol. 7, no. 8, pp. 481-483, 2013.
[15] Kumari A. and Chitra R., Classification of Diabetes Disease using Support Vector Machine, International Journal of Engineering Research and Applications, vol. 3, no. 2, pp. 1797-1801, 2013.
[16] Lowongtrakool C. and Hiransakolwong N., Noise Filtering in Unsupervised Clustering using Computation Intelligence, International Journal of Math. Analysis, vol. 6, no. 59, pp. 2911-2920, 2012.
[17] Sanakal S. and Jayakumari S., Prognosis of Diabetes using Data Mining Approach-Fuzzy C Means Clustering and Support Vector Machine, International Journal of Computer Trends and Technology, vol. 11, no. 2, pp. 94-98, 2014.
[18] Seera M. and Lim C., A hybrid Intelligent System for Medical Data Classi cation, Expert Systems with Applications, vol. 41, no. 5, pp. 2239-2249, 2014.
[19] Theodoridis S. and Koutroumbas K., Pattern Recognition, Academic Press, 1999.
[20] Yilmaz N., Inan O., and Uzer M., A new Data Preparation Method based on Clustering Algorithms for Diagnosis Systems of Heart and Diabetes Diseases, Journal of Medical Systems, vol. 38, no. 5, 2014. An Effective Sample Preparation Method for Diabetes Prediction 973 Shima Afzali received her B.Sc. degree in computer engineering (software engineering) from the University of Zanjan, Iran, in 2009 and her M.Sc. degree in computer engineering from Gazi University, Turkey, in 2014. She has been working toward the Ph.D. degree in computer science, Victoria University of Wellington, New Zealand, since March 2016. She has been awarded a Victoria Doctoral Scholarship. Her main area of research is machine learning, bioinformatics, evolutionary computation. Oktay Y ld z received his M.Sc. degree in Institute of Science from Gazi University, in 2004 and Ph.D. degree in Institute of Information Sciences from Gazi University, in 2012. He has been with the Computer Engineering Department at Gazi University, Ankara, Turkey since 2009. His research interests include machine learning, and data mining.