Speaker Naming in Arabic TV Programs

Author Mohamed Lazhar Bellagha, Mounir Zrigui,

Keywords #Speaker naming #speaker identification #name assignment #name propagation and CNN-LSTM

Abstract Automatic speaker identification is the problem of identifying speakers by their real identities. Previous approaches use textual information as a source of naming, try to associate names to neighbouring speaker segments using linguistic rules. However, these approaches have a few limitations that hinder their application on spoken text. Deep learning approaches for natural language processing have recently reached state-of-the-art results. However, deep learning requires a lot of annotated data which is difficult to obtain in the case of speaker identification task. In this paper, we present two contributions towards integrating deep learning for identifying speakers in news broadcasts: first we realise a dataset in which the names of mentioned speakers are related to the previous, next, current or other speaker turns. Moreover, we present our approach to solve the problem of speaker identification using information obtained from the transcription. We use a Long-term Recurrent Convolutional Network for name assignment and integer linear programming for name propagation into the different segments. We evaluate our model on both assignment and propagation tasks on the test part of the Arabic multi-genre broadcast dataset which consists of 17 TV programs from Aljazeera. The performance is analysed using the evaluation metrics, such as Estimated Global Error Rate (EGER) and Diarization Error Rate (DER). The outcome of the proposed method ensures better performance by achieving the lower EGER of 32.3% and DER of 8.3%.

References

[1] Abdelali A., Darwish K., Durrani N., and Mubarak H., “Farasa: A fast and furious segmenter for Arabic,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, pp. 11-16, 2016.

[2] Abdellaoui H. and Zrigui M., “Using Tweets and Emojis to Build TEAD: an Arabic Dataset for Sentiment Analysis,” Computación y Sistemas, vol. 22, no. 3, pp. 777-786, 2018.

[3] Ali A., Bell P., Glass J., Messaoui Y., Mubarak H., Renals S., and Zhang Y., “The MGB-2 Challenge: Arabic Multi-Dialect Broadcast Media Recognition,” in Proceedings of the Spoken Language Technology Workshop, San Diego, pp. 279-284, 2016.

[4] Azab M., Wang M., Smith M., Kojima N., Deng J., and Mihalcea R., “Speaker Naming in Movies,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, pp. 2206-2216, 2018.

[5] Bechet F., Favre B., and Damnati G., “Detecting Person Presence in TV Shows with Linguistic and Structural Features,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, pp. 5077-5080, 2012.

[6] Bellagha M. and Zrigui M., “Speaker Naming in TV Programs Based on Speaker Role Recognition,” in Proceedings of IEEE/ACS 17th International Conference on Computer Systems and Applications, Antalya, pp. 1-8, 2020.

[7] Bellagha M. and Zrigui M., “Using the MGB-2 Challenge Data for Creating a New Multimodal Dataset for Speaker Role Recognition in Arabic TV Broadcasts,” Procedia Computer Science, vol. 192, pp. 59-68, 2021.

[8] Bousquet P., Matrouf D., and Bonastre J., “Intersession Compensation and Scoring Methods in the I-Vectors Space for Speaker Recognition,” in Proceedings of 12th Annual Conference of the International Speech Communication Association, Florence, 2011.

[9] Bsir B. and Zrigui M., “Bidirectional LSTM for Author Gender Identification,” in Proceedings of International Conference on Computational Collective Intelligence, Bristol, pp. 393-402, 2018.

[10] Canseco-Rodriguez L., Lamel L., and Gauvain J., “A Comparative Study Using Manual and Automatic Transcriptions for Diarization,” in Proceedings of Automatic Speech Recognition and Understanding, IEEE Workshop, Cancun, pp. 415-419, 2005.

[11] Canseco-Rodriguez L., Lamel L., and Gauvain J., “Speaker Diarization from Speech Transcripts,” in Proceedings of the 8th International Conference on Spoken Language Processing ICC, Jeju, 2004.

[12] Dehak N., Kenny P., Dehak R., Dumouchel P., and Ouellet P., “Front-End Factor Analysis for Speaker Verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.

[13] Donahue, J., Anne Hendricks, L., Guadarrama S., Rohrbach M., Venugopalan S., Saenko K., and Darrell T., “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp. 2625-2634, 2015.

[14] Esteve Y., Meignier S., Deléglise P., and 852 The International Arab Journal of Information Technology, Vol. 19, No. 6, November 2022 Mauclair J., “Extracting true speaker identities from transcriptions,” in Proceedings of the 8th Annual Conference of the International Speech Communication Association, Antwerp, pp. 2601- 2604, 2007.

[15] Giraudel A., Carré M., Mapelli V., Kahn J., Galibert O. and Quintard L., “The REPERE Corpus: a Multimodal Corpus for Person Recognition,” in Proceedings of the 8th International Conference on Language Resources and Evaluation, Istanbul, pp. 1102-1107, 2012.

[16] Haffar N., Hkiri E., and Zrigui M., “Using Bidirectional LSTM and Shortest Dependency Path for Classifying Arabic Temporal Relations,” Procedia Computer Science, vol. 176, pp. 370- 379, 2020.

[17] Haurilet M., Tapaswi M., Al-Halah Z., and Stiefelhagen R., “Naming TV Characters by Watching and Analyzing Dialogs,” in Proceedings of the Applications of Computer Vision, IEEE Winter Conference, Lake Placid, pp. 1-9, 2016.

[18] Hinton G., Srivastava N., Krizhevsky A., Sutskever I., and Salakhutdinov R., “Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors,” arXiv preprint arXiv:1207.0580, 2012.

[19] Hkiri E., Mallat S., and Zrigui M., “Events Automatic Extraction from Arabic Texts,” International Journal of Information Retrieval Research, vol. 6, no. 1, pp. 36-51, 2016.

[20] Hkiri E., Mallat S., Zrigui M., and Mars M., “Constructing a Lexicon of Arabic-English Named Entity using SMT and Semantic Linked Data,” The International Arab Journal of Information Technology, vol. 14, no. 6, pp. 820-825, 2017.

[21] Hochreiter S. and Schmidhuber J., “Long Short- Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.

[22] Jousse V., Meignier S., Jacquin C., Petitrenaud S., Estève Y., and Daille B., “Analyse Conjointe Du Signal Sonore Et De Sa Transcription Pour L’identification Nommée De Locuteurs,” Traitement Automatique Des Langues, vol. 50, no. 1, pp. 201-225, 2009.

[23] Kim Y., “Convolutional Neural Networks for Sentence Classification,” arXiv preprint arXiv:1408.5882, 2014.

[24] Le N. and Odobez J., “Learning Multimodal Temporal Representation for Dubbing Detection In Broadcast Media,” in Proceedings of the 24th ACM international Conference on Multimedia, Amsterdam, pp. 202-206, 2016.

[25] Lhioui C., Zouaghi A. and Zrigui M., “Towards a Hybrid Approach to Semantic Analysis of Spontaneous Arabic Speech,” International Journal of Computational Linguistics and Applications, vol. 5, no. 2, pp. 165-193, 2014.

[26] Mahmoud A. and Zrigui M., “Semantic Similarity Analysis for Corpus Development and Paraphrase Detection in Arabic,” The International Arab Journal of Information Technology, vol. 18, no. 1, pp. 1-7, 2021.

[27] Maraoui M., Terbeh N., and Zrigui M., “Arabic Discourse Analysis Based on Acoustic, Prosodic and Phonetic Modeling: Elocution Evaluation, Speech Classification and Pathological Speech Correction,” International Journal of Speech Technology, vol. 21, no. 4, pp. 1071-1090, 2018.

[28] Mauclair J., Meignier S. and Esteve Y., “Speaker diarization: About Whom The Speaker Is Talking?,” in Proceedings of the IEEE Odyssey- The Speaker and Language Recognition Workshop, San Juan, pp. 1-6, 2006.

[29] Mikolov T., Sutskever I., Chen K., Corrado G., and Dean J., “Distributed Representations of Words and Phrases and Their Compositionality,” in Proceedings of the 26th International Conference on Neural Information Processing Systems, Tahoe Nevada, pp. 3111-3119, 2013.

[30] Moattar M. and Homayounpour M., “A Review on Speaker Diarization Systems and Approaches,” Speech Communication, vol. 54, no. 10, pp. 1065-1103, 2012.

[31] Petitrenaud S., Jousse V., Meignier S. and Estève Y., “Identification of speakers by name using belief functions,” in Proceedings of the International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Dortmund, pp. 179- 188, 2010.

[32] Poignant J., Besacier L., and Quénot G., “Nommage Non-Supervisé Des Personnes Dans Les Emissions De Télévision: Une Revue Du Potentiel De Chaque Modalité,” CORIA 2013, Papier Long (oral), 2013.

[33] Poignant J., Bredin H., Besacier L., Quénot G., and Barras C., “Towards a better Integration of Written Names for Unsupervised Speakers Identification In Videos,” in Proceedings of the 1st Workshop on Speech, Language and Audio in Multimedia, SLAM, Marseille, pp. 84-89, 2013.

[34] Poignant J., Bredin H., Le V., Besacier, L., Barras C. and Quénot G., “Unsupervised Speaker Identification Using Overlaid Texts in TV Broadcast,” in Proceedings of the Interspeech Conference of the International Speech Communication Association, Portland, pp. 2650- 2653, 2012.

[35] Poignant J., Fortier G., Besacier L., and Quénot, G., “Naming Multi-Modal Clusters to Identify Persons in TV broadcast,” Multimedia Tools and Applications, vol. 75, no. 15, pp. 8999-9023, 2016.

[36] Ramanathan V., Joulin A., Liang P., and Fei-Fei L., “Linking People in Videos with “Their” Names Using Coreference Resolution,” Speaker Naming in Arabic TV Programs 853 Proceedings of the European Conference on Computer Vision, Zurich, pp. 95-110, 2014.

[37] Ren J., HuY., Tai Y., Wang C., Xu L., Sun W., and Yan Q., “Look, Listen and Learn-A Multimodal LSTM for Speaker Identification,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix Arizona, pp. 3581-3587, 2016.

[38] Rouvier M. and Meignier S., “A Global Optimization Framework for Speaker Diarization,” Odyssey, 2012.

[39] Sghaier M. and Zrigui, M., “Rule-Based Machine Translation from Tunisian Dialect to Modern Standard Arabic,” KES 2020, Verona, 2020.

[40] Shen D., Min M., Li Y., and Carin L., “Adaptive Convolutional Filter Generation for Natural Language Understanding,” CoRR. abs/1709.08294, 2017.

[41] Socher R., Perelygin A., Wu J., Chuang J., Manning C., Ng A., and Potts C., “Recursive Deep Models for Semantic Compositionality Over A Sentiment Treebank,” in Proceedings of the Conference on Empirical Methods In Natural Language Processing, Seattle, pp. 1631-1642, 2013.

[42] Terbeh N. and Zrigui M., “Vers La Correction Automatique De La Parole Arabe,” Citala 2014, 2014.

[43] Terbeh N. and Zrigui M., “Vocal Pathologies Detection and Mispronounced Phonemes Identification: Case of Arabic Continuous Speech,” in Proceedings of the 10th International Conference on Language Resources and Evaluation, Portorož, pp. 2108-2113, 2016.

[44] Tranter S., “Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio,” in Proceedings of the Acoustics, Speech and Signal Processing, ICASSP IEEE International Conference, Toulouse, 2006.

[45] Vinyals O., Toshev A., Bengio S., and Erhan D., “Show and Tell: A Neural Image Caption Generator,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Boston, pp. 3156-3164, 2015.

[46] Zhang Y. and Wallace B., “A Sensitivity Analysis of (and Practitioners’ Guide To) Convolutional Neural Networks for Sentence Classification,” arXiv preprint arXiv:1510, 2015.

[47] Zhou C., Sun C., Liu Z., and Lau F., “A C-LSTM Neural Network for Text Classification,” arXiv preprint arXiv:1511.08630, 2015.

[48] Zrigui M., Charhad M., and Zouaghi A., “A Framework of Indexation and Document Video Retrieval Based on the Conceptual Graphs,” Journal of Computing and Information Technology, vol. 18, no. 3, pp. 245-256, 2010. Mohamed Lazhar Bellagha a PhD student in the Higher Institute of Computer Science and Communication Techniques ISITCom, Hammam Sousse, Tunisia. He is a member of Research Laboratory in Algebra, Numbers Theory and Intelligent Systems RLANTIS, Monastir, Tunisia. His areas of interest include Speaker identification, machine learning and natural Language Processing. Mounir Zrigui a full professor at the University of Monastir, Tunisia. He received his PhD from the Paul Sabatier University, Toulouse, France in 1987 and his HDR from the Stendhal University, Grenoble, France in 2008. Since 1986, he is a Computer Science Assistant Professor in Brest University, France, and after in the Faculty of Science of Monastir, Tunisia. He has started his research, focused on all aspects of automatic natural language processing (written and oral). He has run many research projects and published many research papers in reputed international journals/ conferences.