The International Arab Journal of Information Technology (IAJIT)


Enhanced Latent Semantic Indexing Using Cosine Similarity Measures for Medical Application

The Vector Space Model (VSM) is widely used in data mining and Information Retrieval (IR) systems as a common document representation model. However, there are some challenges to this technique such as high dimensional space and semantic looseness of the representation. Consequently, the Latent Semantic Indexing (LSI) was suggested to reduce the feature dimensions and to generate semantic rich features that can represent conceptual term-document associations. In fact, LSI has been effectively employed in search engines and many other Natural Language Processing (NLP) applications. Researchers thereby promote endless effort seeking for better performance. In this paper, we propose an innovative method that can be used in search engines to find better matched contents of the retrieving documents. The proposed method introduces a new extension for the LSI technique based on the cosine similarity measures. The performance evaluation was carried out using an Arabic language data collection that contains 800 medical related documents, with more than 47,222 unique words. The proposed method was assessed using a small testing set that contains five medical keywords. The results show that the performance of the proposed method is superior when compared to the standard LSI.

[1] AbuZeina D. and Al-Anzi F., “Employing Fisher Discriminant Analysis for Arabic Text 748 The International Arab Journal of Information Technology, Vol. 17, No. 5, September 2020 Classification,” Computers and Electrical Engineering, vol. 66, pp. 474-486, 2018.

[2] Abuzeina D., “Exploring Bigram Character Features for Arabic Text Clustering,” Turkish Journal of Electrical Engineering and Computer Sciences, vol. 27, no. 4, pp. 3165-3179, 2019.

[3] Al-Anzi F. and AbuZeina D., “A Micro-Word Based Approach for Arabic Sentiment Analysis,” in Proceedings of IEEE/ACS 14th International Conference on Computer Systems and Applications, Hammamet, pp. 910-914. 2017.

[4] Al-Anzi F. and AbuZeina D., “Big Data Categorization for Arabic Text Using Latent Semantic Indexing and Clustering,” in Proceedings of International Conference on Engineering Technologies and Big Data Analytics, Bangkok, pp. 1-4, 2016.

[5] Al-Anzi F. and AbuZeina D., “Toward an Enhanced arabic Text Classification Using Cosine Similarity and Latent Semantic Indexing,” Journal of King Saud University Computer and Information Sciences, vol. 29, no. 2, pp. 189-195, 2017.

[6] Al-Anzi F., AbuZeina D., and Hasan S., “Utilizing Standard Deviation in Text Classification Weighting Schemes,” The International Journal of Innovative Computing, Information and Control, vol. 13, no. 4, pp. 1349-4198, 2017.

[7] Beebe N. and Clark J., “Digital Forensic Text String Searching: Improving Information Retrieval Effectiveness by Thematically Clustering Search Results,” Digital Investigation, vol. 4, pp. 49-54, 2007.

[8] Beil F., Ester M., and Xu X., “Frequent Term- Based Text Clustering,” in Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, pp. 436-442, 2002.

[9] Bellegarda J., Butzberger J., Chow Y., Coccaro N., and Naik D., “A Novel Word Clustering Algorithm Based on Latent Semantic Analysis,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, pp. 172-175, 1996.

[10] Bradford R., “An Empirical Study of Required Dimensionality for Large-Scale Latent Semantic Indexing Applications,” in Proceedings of the 17th ACM conference on Information and Knowledge Management, Napa Valley, pp. 153- 162, 2008.

[11] Chattamvelli R., Data Mining Algorithms, Alpha Science International Ltd, 2011.

[12] Deerwester S., Dumais S., Furnas G., Landauer T., and Harshman R., “Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990.

[13] Dhillon I. and Modha D., “Concept Decompositions for Large Sparse Text Data Using Clustering,” Machine Learning, vol. 42, no. 1-2, pp. 143-175, 2001.

[14] Dumais S., Letsche T., Littman M., and Landauer T., “Automatic Cross-Language Retrieval Using Latent Semantic Indexing,” in AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, vol. 15, pp. 15-21, 1997.

[15] Elberrichi Z., Rahmoun A., and Bentaalah M., “Using WordNet for Text Categorization,” The International Arab Journal of Information Technology, vol. 5, no. 1, pp. 16-24, 2008.

[16] Homayouni R., Heinrich K., Wei L., and Berry M., “Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts,” Bioinformatics, vol. 21, no. 1, pp. 104-115, 2005.

[17] Inouye D. and Kalita J., “Comparing Twitter Summarization Algorithms for Multiple Post Summaries,” in Proceedings of IEEE 3th International Conference on Social Computing, Boston, pp. 298-306, 2011.

[18] Kontostathis A. and Pottenger M., “A framework for understanding Latent Semantic Indexing (LSI) performance,” Information Processing and Management, vol. 42, no. 1, pp. 56-73, 2006.

[19] Kontostathis A., “Essential Dimensions of Latent Semantic Indexing (LSI),” in Proceedings of 40th Annual Hawaii International Conference on System Sciences, Waikoloa, pp. 73-73, 2007.

[20] Letsche T. and Berry M., “Large-Scale Information Retrieval with Latent Semantic Indexing,” Information Sciences, vol. 100, no. 1, pp. 105-137, 1997.

[21] Liu T., Chen Z., Zhang B., Ma W., and Wu G., “Improving Text Classification Using Local Latent Semantic Indexing,” in Proceedings of IEEE International Conference on Data Mining, Brighton, pp. 162-169, 2004.

[22] Maletic J. and Valluri N., “Automatic Software Clustering Via Latent Semantic Analysis,” in Proceedings of 14th IEEE International Conference on Automated Software Engineering, Cocoa Beach, pp. 251-254, 1999.

[23] Osinski S. and Weiss D., “A Concept-Driven Algorithm for Clustering Search Results,” IEEE Intelligent Systems, vol. 20, no. 3, pp. 48-54, 2005.

[24] Sobh I., Darwish N., and Fayek M., “A Trainable Arabic Bayesian Extractive Generic Text Summarizer,” in Proceedings of the 6th Conference on Language Engineering ESLEC, Egypt, pp. 49-154, 2006.

[25] Takçı H. and Güngör T., “A High Performance Centroid-Based Classification Approach for Enhanced Latent Semantic Indexing Using Cosine Similarity Measures for ... 749 Language Identification,” Pattern Recognition Letters, vol. 33, no. 16, pp. 2077-2084, 2012.

[26] Tata S. and Patel J., “Estimating the Selectivity of Tf-Idf Based Cosine Similarity Predicates,” ACM Sigmod Record, vol. 36, no. 2, pp. 7-12, 2007.

[27] Theodoridis S. and Koutroumbas K., Pattern Recognition, Academic Press, 2008.

[28] Yeh J., Ke H., Yang W., and Meng I., “Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis,” Information Processing and Management, vol. 41, no. 1, pp. 75-95, 2005. Fawaz Al-Anzi Professor Al-Anzi received his Ph.D. & M.Sc. in Computer Science from Rensselaer Polytechnic Institute, New York, USA in 1995. He earned his B.Sc. with honors in EE from Kuwait University in 1987. He received the National Research Production Award and Kuwait University Award. He is the founding dean of College of Computing Sciences and Engineering. His research interest includes data science and engineering, text classification and speech recognition. Dia AbuZeina received his Ph.D. in Computer Science and Engineering from King Fahd University of Petroleum and Minerals, Saudi Arabia, 2011. He received his M.Sc. in information technology from Southern New Hampshire University, Manchester, USA, 2005. He received his B.Sc. in computer system engineering from Palestine Polytechnic University, 2001. His research interest includes speech recognition and text classification for modern standard Arabic (MSA)