The International Arab Journal of Information Technology (IAJIT)


Enhanced Clustering-Based Topic Identification of Transcribed Arabic Broadcast News

This research presents an enhanced topic identification of transcribed Arabic broadcast news using clustering techniques. The enhancement includes applying new stemming technique “rule-based light stemming” to balance the negative effects of the stemming errors associated with light stemming and root-based stemming. New possibilistic-based clustering technique is also applied to evaluate the degree of membership that every transcribed document has in regard to every predefined topic, hence detecting documents causing topic confusions that negatively affect the accuracy of the topic- clustering process. The evaluation has showed that using rule-based light stemming in combination of spectral clustering technique achieved the highest accuracy, and this accuracy is further increased after excluding confusing documents.

[1] Abberley D., Renals S., and Cook G., Retrieval of Broadcast News Documents with the THISL System, in Proceeding of IEEE International Conference on Acoustic, Speech, and Signal Processing, Washington, pp. 3781-3784, 1998.

[2] Abu El-Khair I., Effects of Stop Words Elimination for Arabic Information Retrieval: a Comparative Study, International Journal of Computing and Information Sciences, vol. 4, no. 3, pp. 119-133, 2006.

[3] Al-Fares W., Arabic Root-Based Clustering: an Algorithm for Identifying Roots Based on N- Grams and Morphological Similarity, Thesis PHD, University of Essex, 2002.

[4] Al-Kharashi I. and Evens M., Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System, Journal of the American Society for Information Science, vol. 45, no. 8, pp. 548-560, 1994.

[5] Al-Shammari E. and Lin J., A Novel Arabic Lemmatization Algorithm, in Proceeding of 2nd workshop on Analytics for Noisy Unstructured Text Data, New York, pp. 113-118, 2008.

[6] Awde N. and Samano P., The Arabic Alphabet: How to Read and Write It, Lyle Stuart, 2000.

[7] Coden A. and Brown E., Speech Transcript Analysis for Automatic Search, in Proceeding of 34th Annual Hawaii International Conference, Washington, pp. 9-12, 2001.

[8] Dave R., Boundary Detection through Fuzzy Clustering, in Proceeding of IEEE International Conference on Fuzzy Systems, California, pp. 127-134, 1992.

[9] Dragon Dictation App home page on iTunes store, dictation/id341446764?mt=8, Last visited 2014. 728 The International Arab Journal of Information Technology, Volume 14, No. 5, September 2017

[10] Gustafson D. and Kessel W., Fuzzy Clustering with a Fuzzy Covariance Matrix, in Proceeding of IEEE CDC, California, pp. 761-766, 1979.

[11] Ibrahimov O., Sethi I., and Dimitrova N, A novel Similarity based Clustering algorithm for Grouping Broadcast News, in Proceeding of SPIE Conference Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, Orlando, pp. 294-304 2002.

[12] Jafar A., Fakhr M., and Hesham M., Clustering- Based Topic Identification of Transcribed Arabic Broadcast News, in Proceeding of 9th International Joint Conferences on Computer, Information, and Systems Sciences, and Engineering, New York, pp. 253-260, 2015.

[13] Kanaan G., Al-Shalabi R., Ababneh M., Al- Nobani A., Building an Effective Rule-Based Light Stemmer for Arabic Language to Improve Search Effectiveness, The International Arab Journal of Information Technology, vol. 9, no. 4, pp. 368-372, 2012.

[14] Khoja S. and Garside R., Stemming Arabic text, Lancaster, UK, Computing Department, Lancaster University, 1999.

[15] Korfhage R., Information Storage and Retrieval, John Wiley, 1997.

[16] Krishnapuram R. and Keller J., A Possibilistic Approach to Clustering, IEEE Transactions on Fuzzy Systems. vol. 1, no. 2, pp. 98-110, 1993.

[17] Larkey L., Ballesteros L., and Connell M., Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-Occurrence Analysis, in Proceeding of 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Finland, pp. 275-282, 2002.

[18] Larkey L. and Connell M., Arabic information retrieval at UMass in TREC-10, in Proceeding of Tenth Text REtrieval Conference (TREC-10), pp. 562-570, 2001.

[19] Luxburg U., A Tutorial on Spectral Clustering, Springer Statistics and Computing, vol. 17, no. 4, pp. 395-416, 2007.

[20] Robertson S., Walker S., Jones S., Hancock- Beaulieu M., and Gatford M. Okapi at TREC-3, in Proceeding of 3rd Text REtrieval Conference, New York, pp. 109-126, 1994.

[21] Salton G., Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, 1989.

[22] Schauble P., Multimedia Information Retrieval: Content-Based Information Retrieval from Large Text and Audio Databases, Kluwer Academic Publishers, 1997.

[23] Shi J. and Malik J., Normalized Cuts and Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, 2000.

[24] Singler M., Jin R., and Hauptmann A., CMU Spoken Document Retrieval in Trec-8: Analysis of the Role of Term Frequency TF, in Proceeding of 8th Text REtrieval Conference, Gaithersburg, pp. 1-10, 1999.

[25] Steinbach M., Karypis G., and Kumar V., A Comparison of Document Clustering Techniques, KDD Workshop on Text Mining, University of Minnesota, 2000.

[26] Yang M., A Survey of Fuzzy Clustering, Mathematical and Computer Modelling journal, vol. 18, no. 11, pp. 1-16, 1993. Ahmed Jafar obtained his B.Sc in Computer Science from Faculty of Information systems and Computer Science, October 6 University, Egypt in July 2006. He received his MS.c. Degree in Computer Science from College of Computing and Information Technology, Arab Academy for Science and Technology and Maritime Transport, Cairo, Egypt in September 2014. He is currently a teaching assistant at the Faculty of Information systems and Computer Science, October 6 University. Mohamed Fakhr received his Ph.D. in Electrical Engineering from Electrical and Computer Engineering department, University of Waterloo, Waterloo, Canada in May 1994. He is currently a full professor at College of Computing and Information Technology, Arab Academy for Science and Technology and Maritime Transport, Cairo, Egypt starting from September 2013 - Present. His fields of interest are Image Processing, Audio Processing, Pattern recognition, Sparse Coding, Sparse Recovery, and Machine Learning. Mohamed Farouk was graduated from Electronics Engineering Dept., at Cairo University, Egypt on 1982. He had his M. Sc. And Ph.D. in Engineering physics on 1988 and 1994, respectively. Presently, he is a full professor of Engineering Physics at Cairo University. His research interests are in the areas of Acoustic Scattering and Speech Processing. He is also cross-appointed professor at faculty of Information Systems and Computer science, October 6 University.