The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Deep Learning-Based Control System for Context-Aware Surveillance Using Skeleton Sequences from IP and Drone Camera video

Human Activity Recognition (HAR) combined with face recognition is set to play a decisive role in next-generation surveillance systems. This work presents a hybrid methodology that integrates deep learning and machine learning models for recognizing multi person activities and faces. The work is structured into two different parts: face recognition and human activity recognition. For face recognition, faces are detected using the state-of-the-art Multi-Task Cascaded Convolutional Neural Network (MTCNN) model, followed by key point extraction with the FaceNet model. The extracted embeddings are classified using a Support Vector Machine (SVM) to identify individuals. SVM model achieved classification accuracy of 0.99. For activity recognition, an ensemble model is employed to classify six activities: walking, standing, sitting, punching, kicking, and crawling. The YOLOv8 large pose model is used to extract human skeletons, which are then fed into the ensemble machine learning model for classification. This integrated system demonstrates promising performance for real-time surveillance applications that detect and recognize the multi person activity and track the person. Generation of summary report is one of the most important phase of this work where the location details of a person is stored along with activity being performed by the person. If abnormal activity is recorded, then the system will generate the early warning system that helps for better surveillance purposes.

[1] Amraee S., Vafaei A., Jamshidi K., and Adibi P., “Anomaly Detection and Localization in Crowded Scenes Using Connected Component Analysis,” Multimedia Tools and Applications, vol. 77, no. 7, pp. 14767-14782, 2018. https://doi.org/10.1007/s11042-017-5061-7

[2] Angelini F., Fu Z., Long Y., Shao L., and Naqvi S., “2D Pose-Based Real-Time Human Action Recognition with Occlusion-Handling,” IEEE Transactions on Multimedia, vol. 22, no. 6, pp. 1433-1446, 2020. DOI:10.1109/TMM.2019.2944745

[3] Baek S., Shi Z., Kawade M., and Kim T., “Kinematic-Layout-Aware Random Forests for Depth-Based Action Recognition,” in Proceedings the of the British Machine Vision Deep Learning-Based Control System for Context-Aware Surveillance Using Skeleton ... 927 Conference, London, pp. 1-10, 2017. DOI:10.5244/C.31.13

[4] Bukht T., Rahman H., Shaheen M., Algarni A., Almujally N., and Jalal A., “A Review of Video- Based Human Activity Recognition: Theory, Methods and Applications,” Multimedia Tools and Applications, vol. 84, pp. 18499-18545, 2024. https://doi.org/10.1007/s11042-024-19711-w

[5] Butt A., Manzoor S., Baig A., Imran A., Ullah I., and Muhammad W., “On-the-Move Heterogeneous Face Recognition in Frequency and Spatial Domain Using Sparse Representation,” PLoS ONE, vol. 9, no. 10, pp. 1- 24, 2024. https://doi.org/10.1371/journal.pone.0308566

[6] Chang Y., Tu Z., Xie W., Luo B., Zhang S., Sui H., and Yuan J., “Video Anomaly Detection with Spatio-Temporal Dissociation,” Pattern Recognition, vol. 122, no. 4, pp. 108213, 2022. https://doi.org/10.1016/j.patcog.2021.108213

[7] Chong Y. and Tay Y., “Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder,” in Proceedings of the 14th International Symposium, Sapporo, pp. 189-196, 2017. https://doi.org/10.1007/978-3-319-59081-3_23

[8] Escalera S., Gonzàlez J., Baró X., and Shotton J., “Guest Editors’ Introduction to the Special Issue on Multimodal Human Pose Recovery and Behavior Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 8, pp. 1489-1491, 2016. DOI:10.1109/TPAMI.2016.2557878

[9] Feichtenhofer C., Fan H., Malik J., and He K., “Slowfast Networks for Video Recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, pp. 6202- 6211, 2019. DOI:10.1109/ICCV.2019.00630

[10] Feichtenhofer C., Pinz A., and Zisserman A., “Convolutional Two-Stream Network Fusion for Video Action Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp. 1933-1941, 2016. DOI:10.1109/CVPR.2016.213

[11] Fu C., Wu X., Hu Y., Huang H., and He R., “DVG- Face: Dual Variational Generation for Heterogeneous Face Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 2938-2952, 2020. DOI:10.1109/TPAMI.2021.3052549

[12] George A. and Marcel S., “Modality Agnostic Heterogeneous Face Recognition with Switch Style Modulators,” in Proceedings of the IEEE International Joint Conference on Biometrics, Buffalo, pp. 1-10, 2024. DOI:10.1109/IJCB62174.2024.10744437

[13] Hasan M., Choi J., Neumann J., Roy-Chowdhury A., and Davis L., “Learning Temporal Regularity in Video Sequences,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp. 733-742, 2016. DOI:10.1109/CVPR.2016.86

[14] Kamencay P., Benco M., Mizdos T., and Radil R., “A New Method for Face Recognition Using Convolutional Neural Network,” Advances in Electrical and Electronic Engineering, vol. 15, no. 4, pp. 663-672, 2017. DOI:10.15598/aeee.v15i4.2389

[15] Kremic E. and Subasi A., “Performance of Random Forest and SVM in Face Recognition,” The International Arab Journal of Information Technology, vol. 13, no. 2, pp. 287-293, 2016. https://www.ccis2k.org/iajit/PDF/Vol.13,%20No. 2/8468.pdf

[16] Li L., Mu X., Li S., and Peng H., “A Review of Face Recognition Technology,” IEEE Access, vol. 8, pp. 139110-139120, 2020. DOI:10.1109/ACCESS.2020.3011028

[17] Lin J., Gan C., and Han S., “TSM: Temporal Shift Module for Efficient Video Understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, pp. 7082- 7092, 2019. DOI:10.1109/ICCV.2019.00718

[18] Liu D., Wang N., and Gao X., Heterogeneous Face Recognition, Handbook of Face Recognition, Springer, 2023. https://doi.org/10.1007/978-3-031-43567-6_14

[19] Liu H., Ren B., Liu M., and Ding R., “Grouped Temporal Enhancement Module for Human Action Recognition,” in Proceedings of the IEEE International Conference on Image Processing, Abu Dhabi, pp. 1801-1805, 2020. DOI:10.1109/ICIP40778.2020.9190958

[20] Liu M., Liu H., and Chen C., “Enhanced Skeleton Visualization for View Invariant Human Action Recognition,” Pattern Recognition, vol. 68, pp. 346-362, 2017. https://doi.org/10.1016/j.patcog.2017.02.030

[21] Liu M., Meng F., Chen C., and Wu S., “Novel Motion Patterns Matter for Practical Skeleton- Based Action Recognition,” in Proceedings of the 37th Conference on Artificial Intelligence, Washington, pp. 1701-1709, 2023. https://doi.org/10.1609/aaai.v37i2.25258

[22] Liu Q., Zhou Z., Shakya S., Uduthalapally P., Qiao M., and Sung A., “Smartphone Sensor-Based Activity Recognition by Using Machine Learning and Deep Learning Algorithms,” International Journal of Machine Learning and Computing, vol. 8, no. 2, pp. 121-126, 2018. DOI:10.18178/ijmlc.2018.8.2.674

[23] Naik A. and Naik M., Single Person Violent Activity Dataset, 2022. https://doi.org/10.34740/KAGGLE/DSV/411420 9, Last Visited, 2025.

[24] Newaz N. and Hanada E., “A Low-Resolution Infrared Array for Unobtrusive Human Activity 928 The International Arab Journal of Information Technology, Vol. 22, No. 5, September 2025 Recognition that Preserves Privacy,” Sensors, vol. 24, no. 3, pp. 1-16, 2024. https://doi.org/10.3390/s24030926

[25] Rajput S., Bilal M., and Habib A., Human Activity Recognition (HAR-Video Dataset), 2023. https://doi.org/10.34740/KAGGLE/DSV/572206 8, Last Visited, 2025.

[26] Ren B., Tang H., Meng F., Runwei D., Torr P., and Sebe N., “Cloth Interactive Transformer for Virtual Try-On,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 4, pp. 1- 20, 2023. https://doi.org/10.1145/3617374

[27] Ren Z., Yuan J., Meng J., and Zhang Z., “Robust Hand Gesture Recognition with Kinect Sensor,” IEEE Transactions on Multimedia, vol. 15, no. 5, pp. 1110-1120, 2013. DOI:10.1109/TMM.2013.2246148

[28] Rezaei A., Stevens M., Argha A., Mascheroni A., Puiatti A., and Lovell N., “An Unobtrusive Human Activity Recognition System Using Low Resolution Thermal Sensors, Machine and Deep Learning,” IEEE Transactions on Biomedical Engineering, vol. 70, no. 1, pp. 115-124, 2023. DOI:10.1109/TBME.2022.3186313

[29] Salehzadeh A., Calitz A., and Greyling J., “Human Activity Recognition Using Deep Electroencephalography Learning,” Biomedical Signal Processing and Control, vol 62, pp. 102094, 2020. https://doi.org/10.1016/j.bspc.2020.102094

[30] Sultani W., Chen C., and Shah M., “Real-World Anomaly Detection in Surveillance Videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, pp. 6479-6488, 2018. DOI:10.1109/CVPR.2018.00678

[31] Tarmissi K., Allaaboun H., Abouellil O., Alharbi S., and Soqati S., “Automated Attendance Taking System Using Face Recognition,” in Proceedings of the 21st Learning and Technology Conference, Jeddah, pp. 19-24, 2024. DOI:10.1109/LT60077.2024.10469452

[32] Thatipelli A., Narayan S., Khan S., Anwer R., Khan F., and Ghanem B., “Spatio-Temporal Relation Modeling for Few-Shot Action Recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, pp. 19958-19967, 2022. DOI:10.1109/CVPR52688.2022.01933

[33] Tolulope O., Funke O., and Adewale O., “Development of an Attendance Management System Using Facial Recognition Technology,” Journal of Engineering Research and Reports, vol. 26, no. 10, pp. 297-307, 2024. https://doi.org/10.9734/jerr/2024/v26i101307

[34] Tran D., Wang H., Torresani L., Ray J., LeCun Y., and Paluri M., “A Closer Look at Spatiotemporal Convolutions for Action Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, pp. 6450-6459, 2018. DOI:10.1109/CVPR.2018.00675

[35] Tu Z., Liu Y., Zhang Y., Mu Q., and Yuan J., “DTCM: Joint Optimization of Dark Enhancement and Action Recognition in Videos,” IEEE Transactions on Image Processing, vol. 32, pp. 3507-3520, 2023. DOI:10.1109/TIP.2023.3286254

[36] Verma P., Sah A., and Srivastava R., “Deep Learning-Based Multi-Modal Approach Using RGB and Skeleton Sequences for Human Activity Recognition,” Multimedia Systems, vol. 26, no. 6, pp. 671-685, 2020. https://doi.org/10.1007/s00530-020-00677-2

[37] Wang L., Xiong Y., Wang Z., Qiao Y., Lin D., Tang X., and Gool L., “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,” in Proceedings of the European Conference on Computer Vision, Amsterdam, pp. 20-36, 2016. https://doi.org/10.1007/978-3-319- 46484-8_2

[38] Wang Y., Kang H., Wu D., Yang W., and Zhang L., “Global and Local Spatio-Temporal Encoder for 3D Human Pose Estimation,” IEEE Transactions on Multimedia, vol. 26, pp. 4039-4049, 2024. DOI:10.1109/TMM.2023.3321438

[39] Wei J., Jian-qi Z., and Xiang Z., “Face Recognition Method Based on Support Vector Machine and Particle Swarm Optimization,” Expert Systems with Applications: An International Journal, vol. 38, no. 4, pp. 4390- 4393, 2011. https://doi.org/10.1016/j.eswa.2010.09.108

[40] Xu C., Govindarajan L., and Zhang Y., and Cheng L., “Lie-X: Depth Image Based Articulated Object Pose Estimation, Tracking, and Action Recognition on Lie Groups,” International Journal of Computer Vision, vol. 123, pp. 454- 478, 2017. https://doi.org/10.1007/s11263-017- 0998-6

[41] Yadav S., Sai S., Gundewar A., Rathore H., Tiwari K., Pandey H., and Mathur M., “CSITime: Privacy-Preserving Human Activity Recognition Using WiFi Channel State Information,” Neural Networks, vol. 146, pp. 11-21, 2022. https://doi.org/10.1016/j.neunet.2021.11.011

[42] Yadav S., Tiwari K., Pandey H., and Akbar S., “A Review of Multimodal Human Activity Recognition with Special Emphasis on Classification, Applications, Challenges and Future Directions,” Knowledge-Based Systems, vol. 223, pp. 106970, 2021. https://doi.org/10.1016/j.knosys.2021.106970

[43] Yang F., Wu Y., Sakti S., and Nakamura S., “Make Skeleton-Based Action Recognition Model Deep Learning-Based Control System for Context-Aware Surveillance Using Skeleton ... 929 Smaller, Faster and Better,” in Proceedings of the 1st ACM International Conference on Multimedia in Asia, Beijing, pp. 1-6, 2019. https://doi.org/10.1145/3338533.336656

[44] Yang Y., Xian Y., Fu Z., and Naqvi S., “Video Anomaly Detection for Surveillance Based on Effective Frame Area,” in Proceedings of the IEEE 24th International Conference on Information Fusion, Sun City, pp. 1-5, 2021. DOI:10.23919/FUSION49465.2021.9626932

[45] Yin C., Miao X., Chen J., Jiang H., Chen D., and Tong Y., “Human Activity Recognition with Low- Resolution Infrared Array Sensor Using Semi- Supervised Cross-Domain Neural Networks for Indoor Environment,” IEEE Internet of Things Journal, vol. 10, no. 13, pp. 11761-11772, 2023. DOI:10.1109/JIOT.2023.3243944