
A Spatio-Temporal Feature Representation of Multimodal Surveillance Images for Behavioral Recognition
Due to the difficulty of accurately expressing complex learning behaviors based on features obtained from a single behavioral modality, research is being conducted on a multimodal monitoring image Spatio-Temporal (ST) feature representation method for behavior recognition to improve the effectiveness of learning behavior recognition. Using an improved 3D Convolutional Neural Network (CNN) with Spatio-Temporal Pyramid Pooling (STPP), an attention based Long Short-Term Memory neural network (LSTM), and a special orthogonal popular spatial network, the RGB spatial features, RGB temporal features, and 3D skeletal features of the monitoring images are extracted from each channel; by improving the dual attention mechanism and integrating three modal features to complement each other’s strengths; using bounding box regression analysis to fuse the ST features of multimodal monitoring images, the learning behavior recognition results are obtained. Experimental results have shown that this method can effectively extract ST features of multimodal monitoring images, and the edge information retention of multimodal ST feature fusion is relatively high at different lighting conditions, close to 1, indicating that the feature fusion effect is excellent and the learning behavior recognition accuracy is high, above 96%.
[1] Abdallah T., Elleuch I., and Guermazi R., “Student Behavior Recognition in Classroom Using Deep Transfer Learning with VGG-16-ScienceDirect,” 842 The International Arab Journal of Information Technology, Vol. 22, No. 4, July 2025 Procedia Computer Science, vol. 192, no. 4, pp. 951-960, 2021. https://doi.org/10.1016/j.procs.2021.08.098
[2] Agyeman R., Rafiq M., Shin H., Rinner B., and Choi G., “Optimizing Spatiotemporal Feature Learning in 3D Convolutional Neural Networks with Pooling Blocks,” IEEE Access, vol. 9, pp. 70797-70805, 2021. DOI: 10.1109/ACCESS.2021.3078295
[3] Chen Y., “Human Behavior Recognition Based on Multiscale Convolutional Neural Network,” IEEE Access, vol. 11, no. 2, pp. 13533-13544, 2023. DOI:10.1109/ACCESS.2022.3209816
[4] Chonggao P., “Simulation of Student Classroom Behavior Recognition based on Cluster Analysis and Random Forest Algorithm,” Journal of Intelligent and Fuzzy Systems, vol. 40, no. 2, pp. 2421-2431, 2021. DOI:10.3233/JIFS-189237
[5] Damaneh M., Mohanna F., and Jafari P., “Static Hand Gesture Recognition in Sign Language Based on Convolutional Neural Network with Feature Extraction Method Using ORB Descriptor and Gabor Filter,” Expert Systems with Applications, vol. 211, pp. 118559, 2022. https://doi.org/10.1016/j.eswa.2022.118559
[6] Gendy G., Sabor N., Hou J., and He G., “Balanced Spatial Feature Distillation and Pyramid Attention Network for Lightweight Image Super- Resolution,” Neurocomputing, vol. 509, pp. 157- 166, 2022. https://doi.org/10.1016/j.neucom.2022.08.053
[7] Gomez L., Biten A., Tito R., Mafla A., Rusiol M., Valveny E., and Karatzas D., “Multimodal Grid Features and Cell Pointers for Scene Text Visual Question Answering,” Pattern Recognition Letters, vol. 150, pp. 242-249, 2021. https://doi.org/10.1016/j.patrec.2021.06.026
[8] Han X., Huang D., Eun-Lee S., and Hoon-Yang J., “Artificial Intelligence-Oriented User Interface Design and Human Behavior Recognition Based on Human-Computer Nature Interaction,” International Journal of Humanoid Robotics, vol. 20, no. 6, pp. 2250020, 2023. https://doi.org/10.1142/S0219843622500207
[9] Li G., Liu F., Wang Y., Gou Y., Xiao L., and Zhu L., “A Convolutional Neural Network (CNN) Based Approach for the Recognition and Evaluation of Classroom Teaching Behavior,” Scientific Programming, vol. 2021, pp. 1-8, 2021. https://doi.org/10.1155/2021/6336773
[10] Li X., Ding M., and Pizurica A., “Spectral Feature Fusion Networks with Dual Attention for Hyperspectral Image Classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-14, 2022. DOI: 10.1109/TGRS.2021.3084922
[11] Mo J., Zhu R., Shou Z., Yuan H., and Chen L., “Student Behavior Recognition Based on Multitask Learning,” Multimedia Tools and Applications, vol. 82, no. 12, pp. 19091-19108, 2023. https://doi.org/10.1007/s11042-022-14100- 7
[12] Sheng W., Sun Y., and Zhang H., “Multi-Focus Image Fusion Algorithm Based on Sparse Theory and FFST-GIF,” Journal of Jiangsu University: Natural Science Edition, vol. 43, no. 2, pp. 195- 200, 2022. DOI: 10.3969/j.issn.1671- 7775.2022.02.011
[13] Wang S., “Online Learning Behavior Analysis Based on Image Emotion Recognition,” Traitement du Signal, vol. 38, no. 3, pp. 865-873, 2021. https://doi.org/10.18280/ts.380333
[14] Wu S., “Simulation of Classroom Student Behavior Recognition based on PSO-KNN Algorithm and Emotional Image Processing,” Journal of Intelligent and Fuzzy Systems, vol. 40, no. 4, pp. 7273-7283, 2021. https://doi.org/10.3233/JIFS-189553
[15] Wu S., Jin S., Liu W., and Bai L., et al., “Graph- based 3D Multi-Person Pose Estimation Using Multi-View Images,” arXiv Preprint, vol. arXiv:2109.05885v1, pp. 1-13, 2021. https://doi.org/10.48550/arXiv.2109.05885
[16] Xie Y., Zhang S., and Liu Y., “Abnormal Behavior Recognition in Classroom Pose Estimation of College Students Based on Spatiotemporal Representation Learning,” Traitement du Signal: Signal Image Parole, vol. 38, no. 1, pp. 89-95, 2021. https://doi.org/10.18280/ts.380109
[17] Xu C., Gao Z., Zhang H., Li S., and De Albuquerque V., “Video Salient Object Detection Using Dual-Stream Spatiotemporal Attention,” Applied Soft Computing, vol. 108, pp. 107433, 2021. https://doi.org/10.1016/j.asoc.2021.107433
[18] Zhang L., “Enterprise Employee Work Behavior Recognition Method Based on Faster Region- Convolutional Neural Network,” The International Arab Journal of Information Technology, vol. 22, no. 2, pp. 291-302, 2025. https://doi.org/10.34028/iajit/22/2/7
[19] Zhang L., Song H., Aletras N., and Lu H., “Node- Feature Convolution for Graph Convolutional Networks,” Pattern Recognition, vol. 128, no. 8, pp. 108661, 2022. https://doi.org/10.1016/j.patcog.2022.108661
[20] Zhang Y., Guan S., Xu C., and Liu H., “RETRACTED: Based on Spatio-Temporal Graph Convolution Networks with Residual Connection for Intelligence Behavior Recognition,” International Journal of Electrical Engineering Education, vol. 60, no. 1S, pp. 52-59, 2021. https://doi.org/10.1177/0020720921996600
[21] Zhao H., Liu J., and Wang W., “Research on Human Behavior Recognition in Video Based on 3DCCA,” Multimedia Tools and Applications, vol. 82, no. 13, pp. 20251-20268, 2023. A Spatio-Temporal Feature Representation of Multimodal Surveillance Images … 843 https://doi.org/10.1007/s11042-023-14355-8
[22] Zhao L., Mo C., Ma J., Chen Z., and Yao C., “LSTM-MFCN: A Time Series Classifier Based on Multi-Scale Spatial-Temporal Features,” Computer Communications, vol. 182, pp. 52-59, 2022. https://doi.org/10.1016/j.comcom.2021.10.036