The International Arab Journal of Information Technology (IAJIT)


Audiovisual Speaker Identification Based on Lip and Speech Modalities

In this article, we pre sent a bimodal speaker identification method, which integrates both acoustic and visual features and where the two audiovisual stream modalities are processed in parallel. We also propose a fusion technique that combines the two modalities to make the final recognition decision. Experiments are conducted on an audiovisual dataset containing the 28 Arabic syllables pronounced by ten speakers. Results show the importance of the visual information that is provided by Discrete Cosine Transform (DCT) and Discret e Wavelet Transform (DWT) in addition to the audio information corresponding to the Mel Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Predictive (PLP). Furthermore, some artificial neural networks such as Multilayer Perceptron (MLP) and Radi al Basis Function (RBF) were investigated and tested successfully in this dataset by presenting good recognition performances with serial concatenation for the acoustic and visual vectors.

[1] Abushariah M., Ainon R., Zainuddine R., Elshafei M. , and Khalifa O., Arabic S peaker I ndependent -C ontinuous Automatic S peech R ecognition Based on P honetically R ich and Balanced Speech C orpus, The International, Arab Journal of Information Technology , vol. 9, no. 1, pp. 84- 93, 2012.

[ 2] Brunelli R. and Falavigna D., Person I dent ification Using Multiple Clues, IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 17, no. 10, pp. 955- 966, 1995.

[ 3] Bregler C. and Konig Y., Eigenlips for Robust S peech Recognition, in Proceedings of IEEE Conference on Acoustics, Speech and Signal Processing , pp. 669-672, 1994.

[ 4] Cetingul H., Erzin E ., Yemez Y., and Tekalp A., Multimodal Speaker/ Speech Recognition Using L ip M otion, Lip Texture and A udio, Signal Processing , vol. 86, no. 12, pp.3549-3558 2006.

[ 5] Civanlar M. and Chen T., Password- Free N etwork Security through Joint Use of A udio and Video , in Proceedings of SPIE Photonic , Boston, pp. 120-125, 1996.

[ 6] Chaudhari U., Ramaswamy G., Potamianos G., and Neti C., Information F usion and Decision C ascading for A udio-V isual Speaker R ecognition Based on Time -V arying Stream R eliability Prediction, in Proceedings of the Internat ional Conference on Multimedia and Expo ICME , pp. 9-12, 2003.

[ 7] Chakraborty P., Ahmed F., Monirul M., Shahjaha n M., and Murase K., An Automat ic Speaker Recognition System, in Proceedings of International Conference on Neural Information Windowing MFCC /PLP extraction Facial and lip detection DCT and DWT analysis Speech signal Image signal Serial concatenation of audio/visual feature MLP and RBF neural network classifier RR% Visual feature extraction Audio feature AV=

[AD ;VD] 109 Audiovisual Speaker Identification Based on Lip and Speech Modalities Processing , Berlin Heidelberg , pp. 517- 526, 2007.

[ 8] Chelali F. and D jeradi A., Face recognition System based on DCT and Neural Network, in Proceedings of Artificial Intelligence and Pattern Recognition (AIPR -10) , pp.13- 18, Florida, 2010.

[ 9] Chelali F. and D jeradi A., Face Recognition System using Discrete Cosine Transform Combined with MLP and RBF Neural Networks , International Journal of Mobile Computing and Multimedia Communication (IJMCMC) , vol. 4, no. 4, pp. 1- 35, 2012.

[10] Eleyan A. and Demirel H., PCA and LDA based Neural Netwo rks for Human Face Recognition, INTECH Open Access Publisher, 2007.

[11] Frischholz R. and Dieckmann U., BioID: a M ultimodal Biometric Identification System , Computer , vol. 33, no. 2, pp. 64- 68, 2000.

[ 12] Gerasimos P., Audio- Visual Automatic Speech Recognition: An Overview , MIT Press, 2004.

[ 13] Hermansky H ., Perceptual Linea r Predictive (PLP) Ana lysis of Speech, Journal of the Acoustical Society of America , vol. 87, no. 4, pp. 1738- 1752, 1990.

[ 14] Husmeier D. , Neural Networks for Conditional Probability Estimation, Pers pectives in Neural Computation, Springer -Verlag, 1999.

[ 15] Jourl in P., Luettin J., Genoud D., and Wassner H., Acousticlabial Speaker Verification, Pattern Recognition Letters , vol. 1 8, no. 9, pp. 853- 858, 1997.

[ 16] Joo M., Chen W., and Wu S., High -Speed Face Recognition Based on Discrete Cosine Tr ansform and RBF Neural Network, IEEE Transactions on N eural Networks , vol. 16, no. 3, pp. 679-691, 2005.

[ 17] Minh D., An Automa tic Speaker Recognition System, Digital Signal Processing Mini -Project, Audio Visual Communications Laboratory, Swiss Federal Institute of Tec hnology, Sw itzerland, pp.1- 14, White paper, 1996.

[ 18] Nocedal J. , Theory of A lgorithms for U nconstrained Optimization, Acta Numerica1 , pp. 199- 242, 1992.

[ 19] Parizeau M., Le Perceptron Multicouche et S on A lgorithme de R etropropagation des E rreurs, Technical Report, 2004.

[20] Sumby W . and Pollak I., Visual C ontribution to S peech Intelligibility in Noise , Journal of the A coustical Society of America , vol . 26, no. 2, pp. 212- 215, 1954.

[21] Shivappa S., Trivedi M. , and Rao B., Audiovisual Information Fusion in Human - Computer Inter faces and Inte lligent Environments: A Survey, the IEEE, vol. 98, no. 10, pp. 1692- 1715, 2010.

[22] Sheela K. and Prasad K., Linear Discriminant Analysis F -Ratio for Optimization of TESPAR and MFCC Fe atures for Speaker Recognition, Journal of Multimedia , vol. 2, no. 6, pp. 34- 43, 2007.

[ 23] Sanderson C. and Paliwal K., Noise C ompensation in a Person Verification System U sing Face and Multiple Speech Features, Pattern Recognition , vol. 36, no. 2, pp. 293- 302, 2003.

[ 24] Satori H., Hiyassat H., Harti M ., and Chenfour N. , Investigation Arabic Speech Recog nition Using CMU Sphinx System, The International, Arab Journal of Information Technology , vol. 6, no. 2, pp. 186- 190, 2009.

[ 25] Senthil G. and Dandapat S., Speaker R ecogn ition under Stressed Condition, International Journal of Speech Technology , vol. 13, no. 3, pp. 141- 161, 2010.

[ 26] Shih F. , Chuang C., and Wang P., Performance C omparisons of Facial Expression Recognition in JAFFE Database, International Journal of Pattern Recognition and Artificial Intell igence, v ol. 22, no. 3, pp. 445-459, 2008.

[ 27] Tsuhan C., Audiovisual Speech Processing, Lip R eading and L ip Synchronization, IEEE Signal P rocessing Magazine , vol. 18, no. 1, pp. 9-21, 2001.

[ 28] Wang Y., Guan L., and Venetsanopoulos A., Kernel Cross -Modal Fact or Analysis for M ultimodal Information Fusion , in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Prague , pp. 2384- 2387, 2011.

[ 29] Wark T. and Sridharan S., Adaptive F usion of S peech and L ip Information for R obust Spe aker I dentification, Digital Signal Processing , vol. 11, no. 3, pp. 169- 186, 2001.

[ 30] Zhang D., Automated Biometrics , Springer US , 2000. The International Arab Journal of Information Technology, Vol. 14, No. 1, January 2017 110 Fatma Zohra Chelali received her engineering degree in Electronic engineering from University of science and technology Houari Boumedienne of Algiers; ALGERIA (USTHB) in 1994. She works as Assistant teacher in the high school of Aeronautical Technicians (Ecole sup rieure des technicien s de l A ronautique ESTA) from 1997 to 2008, she receive d an academic certificate for tea ching from the Algerian institute of management (Institut international de management d Alger) in 1999. She spent a year of post graduation from 2002 to 2003. Then, she received a magister degree in speech communication in 2006 and Doctorate degree in speech communication and signal processing laboratory (LCPTS, USTHB,Algiers) in 2012, the subject of her thesis treats audiovisual speaker recognition applied to Arabic phonemes. She teaches courses with telecommunications department on Electromagnetic waves, transmission lines and digital electronics since 2007 in Electronic engineering and computer science Faculty, university of science and technology (USTHB). Her interests include audiovisual analysis and recognition, pattern recognition and classification, speech and image processing. Amar Djeradi received his engi neering degree in Electronics in 1984, his magister degree in applied electronics, and Doctorate degree in 1992.He teaches since 1985 in different modules for graduation and post graduation such as Electronics, Television , digital electronics, and principal functions of electronics, pattern recognition, and human-machine communication. His current research interests are in the area of speech communication, human- machine Communication, multimodal interfaces and signal analysis.