Downloads 166

..............................

..............................

Cited by

..............................

Received date March 6, 2025

Accepted date September 21, 2025

A Bimodal Emotion Recognition Algorithm for Audio and Video Based on Emotion Modeling

Author Yang Liu, Shudan Feng, *Corresponding author,

Keywords #Audio #video #fourier transform #soundnet neural network #recurrent neural network #long and short-term memory network

Abstract

In audio-video bimodal emotion recognition, audio features and video features come from different modalities and have different representations and semantic information. Traditional methods rely only on the information of a single modality, which makes the fused features unable to comprehensively represent the emotional state, resulting in poor recognition results and small correlation coefficients. For this reason, a bimodal emotion recognition algorithm based on emotion modeling is proposed for audio and video. Firstly, the emotional audio is sub-framed by Fourier transform to obtain the Meier Frequency Cepstrum Coefficient (MFCC) features of emotional audio, extract the frame-level speech time-domain signal input features, and mine the audio SoundNET coding features of emotion; Secondly, the above three features are spliced together to complete the mining of total emotional audio features of emotion; then, the Recurrent Neural Network (RNN) and the long and Short-Term Memory Network (LSTM) are used to capture the emotional video features in depth; Finally, cross-modal learning and attention mechanism are used to integrate the extracted emotion features, and the emotion type is analyzed by the decision-level fusion network to complete the audio and video bimodal emotion recognition, which effectively avoids the problem of poor single-modal recognition results and improves the recognition accuracy and reliability. The results show that the proposed algorithm is effective in recognizing bimodal emotions in audio and video, and the correlation coefficient of the recognition results is large.

References

[1] Antonino V., Chiara B., and Giovanna M., “The Effect of Emotion Intensity on Time Perception: A Study with Transcranial Random Noise Stimulation,” Experimental Brain Research, vol. 241, no. 8, pp. 2179-2190, 2023. DOI:10.1007/s00221-023-06668-9

[2] Demiris G., Oliver D., Washington K., Chadwick C., and et al., “Examining Spoken Words and Acoustic Features of Therapy Sessions to Understand Family Caregivers’ Anxiety and Quality of Life,” International Journal of Medical Informatics, vol. 160, pp. 104716, 2022. DOI: 10.1016/j.ijmedinf.2022.104716

[3] Dhelim S., Chen L., Ning H., and Nugent C., “Artificial Intelligence for Suicide Assessment using Audiovisual Cues: A Review,” Artificial Intelligence Review, vol. 56, no. 6, pp. 5591-5618, 2022. DOI: 10.1007/s10462-022-10290-6

[4] Fu Y., Huang B., Wen Y., and Zhang P., “FDR- MSA: Enhancing Multimodal Sentiment Analysis Through Feature Disentanglement and Reconstruction,” Knowledge-Based Systems, vol. 297, no. 3, pp. 1-12, 2024. DOI: 218 The International Arab Journal of Information Technology, Vol. 23, No. 2, March 2026 10.1016/j.knosys.2024.111965

[5] Garg R., Gao R., and Grauman K., “Visually- Guided Audio Spatialization in Video with Geometry-Aware Multi-Task Learning,” International Journal of Computer Vision, vol. 131, no. 10, pp. 2723-2737, 2023. DOI: 10.1007/s11263-023-01816-8

[6] Hussain S., Chalicham N., Garine L., Chunduru S., and et al., “Low-Light Image Restoration Using a Convolutional Neural Network,” Journal of Electronic Materials, vol. 53, no. 7, pp. 3582- 3593, 2024. DOI: 10.1007/s11664-024-11079-9

[7] Lahoti G., Ranjan C., Chen J., Yan H., and Zhang C., “Convolutional Neural Network-Assisted Adaptive Sampling for Sparse Feature Detection in Image and Video Data,” IEEE Intelligent Systems, vol. 38, no. 1, pp. 45-57, 2023. DOI: 10.1109/MIS.2022.3215779

[8] Lee C., Ortiz J., Glenn C., Kleiman E., and Liu R., “An Evaluation of Emotion Recognition, Emotion Reactivity, and Emotion Dysregulation as Prospective Predictors of 12-Month Trajectories of Non-Suicidal Self-Injury in an Adolescent Psychiatric Inpatient Sample,” Journal of Affective Disorders, vol. 358, no. 1, pp. 302-308, 2024. DOI: 10.1016/j.jad.2024.02.086

[9] Lin M., Wu J., Meng J., Wang W., and Wu J., “State of Health Estimation with Attentional Long Short-Term Memory Network for Lithium-Ion Batteries,” Energy, vol. 268, no. 1, pp. 126706, 2023. DOI: 10.1016/j.energy.2023.126706

[10] Liu J., Wang Z., Nie W., Zeng J., and et al., “Multimodal Emotion Recognition for Children with Autism Spectrum Disorder in Social Interaction,” International Journal of Human- Computer Interaction, vol. 40, no. 5/8, pp. 1921- 1930, 2024. DOI: 10.1080/10447318.2023.2232194

[11] Middya A., Nag B., and Roy S., “Deep Learning Based Multimodal Emotion Recognition Using Model-Level Fusion of Audio-Visual Modalities,” Knowledge-Based Systems, vol. 244, no. 23, pp. 108580, 2022. DOI: 10.1016/j.knosys.2022.108580

[12] Sayed H., Eldeeb H., and Taie S., “Bimodal Variational Autoencoder for Audiovisual Speech Recognition,” Machine Learning, vol. 112, no. 4, pp. 1201-1226, 2023. DOI: 10.1007/s10994-021- 06112-5

[13] Tian M., Dong H., Cao X., and Yu K., “Temporal Convolution Network with a Dual Attention Mechanism for φ-OTDR Event Classification,” Applied Optics, vol. 61, no. 20, pp. 5951-5956, 2022. DOI: 10.1364/AO.458736

[14] Wang Z. and Zuo R., “Mineral Prospectivity Mapping Using a Joint Singularity-Based Weighting Method and Long Short-Term Memory Network,” Computers and Geosciences, vol. 158, pp. 104974, 2022. DOI10.1016/j.cageo.2021.104974

[15] Wu X., Zhang X., Feng X., Lopez M., and Liu L., “Audio-Visual Kinship Verification: A New Dataset and a Unified Adaptive Adversarial Multimodal Learning Approach,” IEEE Transactions on Cybernetics, vol. 54, no. 3, pp. 1523-1536, 2024. DOI: 10.1109/TCYB.2022.3220040

[16] Zhang F., Li X., Lim C., Hua Q., and et al., “Deep Emotional Arousal Network for Multimodal Sentiment Analysis and Emotion Recognition,” Information Fusion, vol. 88, no. 12, pp. 296-304, 2022. DOI: 10.1016/j.inffus.2022.07.006

[17] Zhang J., Jiang Y., Wu S., Li X., and et al., “Prediction of Remaining Useful Life Based on Bidirectional Gated Recurrent Unit with Temporal Self-Attention Mechanism,” Reliability Engineering and System Safety, vol. 221, pp. 108297, 2022. DOI: 10.1016/j.ress.2021.108297

[18] Zhang R., Qin B., Zhao J., Zhu Y., and et al., “Locating X-Ray Coronary Angiogram Keyframes via Long Short-Term Spatiotemporal Attention with Image-to-Patch Contrastive Learning,” IEEE Transactions on Medical Imaging, vol. 43, no. 1, pp. 51-63, 2024. DOI: 10.1109/TMI.2023.3286859

[19] Zhang Y., Wu L., Wang J., and Li S., “Multi- Modal Emotion Recognition Based on Multi- LSTMs Fusion,” Journal of Chinese Information Processing, vol. 36, no. 5, pp. 145-152, 2022. DOI: 10.3969/j.issn.1003-0077.2022.05.015

[20] Zhu Q. and Peng Y., “Semi-Supervised Kernel Discriminative Low-Rank Ridge Regression for Data Classification,” The International Arab Journal of Information Technology, vol. 21, no. 5, pp. 800-814, 2024. DOI:10.34028/iajit/21/5/3

[21] Zong L., Zhou J., Xie Q., Zhang X., and Xu B., “Multi-modal Emotion Recognition Based on Hypergraph,” Journal of Computer Science, vol. 46, no. 12, pp. 2520-2534, 2023. DOI: 10.11897/SP.J.1016.2023.02520