The International Arab Journal of Information Technology (IAJIT)


A Novel Approach of Clustering Documents: Minimizing Computational Complexities in

This study addresses the real-time issue of managing an academic program's documents in a university environment. In practice, document classification from a corpus is challenging when the dataset size is large, and the complexity increases if to meet some specific document management requirements. This study presents a practical approach to grouping documents based on a content similarity measure. The approach analyzes the state-of-the-art clustering algorithms performance, considers Hamiltonian graph properties and a distance function. The distance function measures (1) the content similarity between the documents and (2) the distances between the produced clusters. The proposed algorithm improves clusters’ quality by applying Hamiltonian graph properties. One of the significant characteristics of the proposed function is that it determines document types from the corpus. Hence, this does not require the initial assumption of cluster number before the algorithm execution. This approach omits the arbitrary primordial option of k-centroids of the k-means algorithm, reduces computational complexities, and overcomes some limitations of commonly practicing clustering algorithms. The proposed approach enables an effective way of document organization opportunities to the information systems developers when designing document management systems.

[1] Basu T. and Murthy C., “A Similarity Assessment Technique for Effective Grouping of Documents,” Information Sciences, vol. 311, pp. 149-162, 2015.

[2] Chandola V., Banerjee A., and Kumar V., “Anomaly Detection : A Survey,” ACM Computing Surveys, vol. 41, no. 3, pp. 1-58, 2009.

[3] Chiang M., Tsai C., and Yang C., “Time- Efficient Pattern Reduction Algorithm for K- Means Clustering,” Information Sciences, vol. 181, no. 4, pp. 716-731, 2011.

[4] Chouder M., Rizzi S., and Chalal R., “EXODUS: Exploratory OLAP over Document Stores,” Information Systems, vol. 79, pp. 44-57, 2019.

[5] Conrad J., Al-Kofahi K., Zhao Y., and Karypis G., “Effective Document Clustering for Large Heterogeneous Law Firm Collections,” in Proceedings of the International Conference on Artificial Intelligence and Law, Bologna Italy, pp. 177-187, 2005.

[6] Dasgupta S. and Ng V., “Towards Subjectifying Text Clustering,” in Proceedings 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Switzerland, pp. 483-490, 2010.

[7] Figueroa K. and Paredes R., “Approximate Direct and Reverse Nearest Neighbor Queries, and The K-Nearest Neighbor Graph,” in Proceedings 2nd International Workshop on Similarity Search and Applications, Prague, pp. 91-98, 2009.

[8] Forsati R., Mahdavi M., Shamsfard M., and Meybodi M., “Efficient Stochastic Algorithms for Document Clustering,” Information Sciences, vol. 220, 2013.

[9] Gallinucci E., Golfarelli M., and Rizzi S., “Schema Profiling of Document-Oriented Databases,” Information Systems, vol. 75, pp. 13- 25, 2018.

[10] Grygorash O., Zhou Y., and Jorgensen Z., “Minimum Spanning Tree Based Clustering Algorithms,” in ProceedingsInternational Conference on Tools with Artificial Intelligence, Arlington, pp. 73-81, 2006.

[11] Hecht R. and Jablonski S., “NoSQL Evaluation A Use Case Oriented Survey,” in Proceedings of the International Conference on Cloud and Service Computing, Hong Kong, 2011.

[12] Joudaki H., Rashidian A., Minaei-Bidgoli B., Mahmoodi M., Geraili B., Nasiri M., and Arab M., “Using Data Mining to Detect Health Care Fraud and Abuse: A Review of Literature,” Global Journal of Health Science, vol. 7, no. 1, pp. 194-202, 2015.

[13] Jing L., Yu J., Zeng T., and Zhu Y., “Semi- Supervised Clustering via Constrained Symmetric Non-negative Matrix Factorization,” in Proceedings International Conference on Brain Informatics, Athens, pp. 309-319, 2011.

[14] Langville A., Meyer C., Albright R., and Cox J., “Initializations for the Nonnegative Matrix Factorization,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, pp. 1-8, 2006.

[15] Lasek P. and Gryz J., “Density-based Clustering with Constraints,” Computer Science and Information Systems, vol. 16, no. 2, pp. 469-489, 2019.

[16] MacQueen J., “Some Methods for Classification and Analysis of Multivariate Observations,” in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281-297, 1967.

[17] Mohamed M., Ghanem S., and Nagi M., “Privacy-Preserving for Distributed Data Streams: Towards l-Diversity,” The International Arab Journal of Information Technology, vol. 17, no. 1, pp. 52-64, 2020.

[18] Ng A., Jordan M., and Weiss Y., “On Spectral 628 The International Arab Journal of Information Technology, Vol. 19, No. 4, July 2022 Clustering: Analysis and an Algorithm,” in Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, Vancouver, 2002.

[19] Pantel P. and Lin D., “Document Clustering with Committees,” in Proceedings of the 25th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere Finland, pp. 199-206, 2002.

[20] Steinbach M., Karypis G., and Kumar V., “A Comparison of Document Clustering Techniques,” in Proceedings of the IEEE International Conference on Computational Cybernetics, Slovakia, pp. 1-2, 2000.

[21] Wang J., Wu S., Quan Vu H., and Li G., “Text Document Clustering with Metric Learning,” in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Geneva, pp. 783-784, 2010.

[22] Wang Y., Choi I., and Liu H., “Generalized Ensemble Model for Document Ranking in Information Retrieval,” Computer Science and Information Systems, vol.14, no. 1, pp. 123-151, 2017.

[23] Wierzchoń S. and Kłopotek M., Studies in Big Data 34 Modern Algorithms of Cluster Analysis, Springer, 2018.

[24] Xu W. and Gong Y., “Document Clustering by Concept Factorization,” in Proceedings of Sheffield SIGIR-27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield United Kingdom, pp. 202-209, 2004. Mohammed Alghobiri is an experienced highly involved in information systems development and implementation, especially experimental and participative approaches. Part of his interest includes Management and Evaluation of Systems Development, including Software process improvement methods and ERP D&I. Databases, Data Mining, Decision Support Systems, and Electronic Government Concepts are also within his concern. Khalid Mohiuddin recognizes as a researcher and an excellent teaching practitioner at the faculty of Information Systems at King Khalid University, Saudi Arabia. His interdisciplinary research involves Information Systems management, mobile cloud computing-IoT, edge, mobile edge, fog, AI, 5G, and quality development in higher education. Mohammed Abdul Khaleel worked as a senior software developer in information systems management. Presently, he serves as a faculty member at the College of Computer Science, King Khalid University, Saudi Arabia. His research interest includes data mining, cloud data management, mobile cloud data management, and data management in server less computing. Mohammad Islam is a research scholar at King Khalid University, Saudi Arabia. He has rich experience in research and teaching Information Systems. His interdisciplinary research interest includes business IS and business intelligence, and quality development in higher education. Samreen Shahwar is a lecturer and research scholar at King Khalid University, Saudi Arabia. Her research interest involves information systems management, cloud computing, education learning, and higher education assessment. Osman Nasr He is currently working as an Assistant Professor at the Department of Management Information Systems, King Khalid University in the Kingdom of Saudi Arabia. His research interests include Data mining, web-based systems, cloud computing, educational research, and quality development in higher education.