Downloads 22

..............................

..............................

Cited by

..............................

Received date April 4, 2024

Accepted date September 8, 2024

3D VAE Video Prediction Model with Kullback Leibler Loss Enhancement

Author Zahraa Al Mokhtar, Shefa Dawwd,

Keywords #3D sampling stage #kullback leibler loss #temporal 3D sampling stage #three dimensional RNN layers #variational autoencoder

Abstract

The Video Prediction (VP) models adopted many techniques to build suitable structures to extract the spatiotemporal features and predict the future frame. The VP techniques extracted the spatial and temporal features in separated models and then fused both features to generate the future frame. However, these architectures suffered from the design complexity and time for prediction required. So, many efforts introduced VP based on decreasing design complexity and producing good results. This study produces the VP model based on a Three-Dimensional Variational Auto Encoder (3D VAE). The proposed model builds all layers depending on 3D convolutional layers. This leads to better extraction of spatiotemporal information and decreases the design complexity. Second, the Kullback Leibler Loss (KL Loss) is enhanced by a 3D sampling stage which allows to calculation of the 3D latent loss. This helps to extract the better and proper spatiotemporal latent variable from the 3D Encoder. The 3D sampling represents a good regularizer in the model. The proposed model outperforms in terms of SNR=34.8673, SSIM= 0.9616 which applied to Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) and Caltech pedestrian datasets and records 5.2 M parameters.

References

[1] Asperti A., “Balancing Reconstruction Error and Kullback-Leibler Divergence in Variational Autoencoders,” Sensors, vol. 8, no. 1, 2020. DOI:10.1109/ACCESS.2020.3034828

[2] Byeon W., Wang Q., Srivastava R., and Koumoutsakos P., “ContextVP: Fully Context- Aware Video Prediction,” in Proceedings of the European Conference on Computer Vision, Munich, pp. 753-769, 2018. https://doi.org/10.1007/978-3-030-01270-0_46

[3] Castrejon L., Ballas N., and Courville A., “Improved Conditional VRNNs for Video Prediction,” in Proceedings of the IEEE/ International Conference on Computer Vision, Seoul, pp. 7608-7617, 2019. https://doi.org/10.48550/arXiv.1904.12165

[4] Cheng Z., Sun H., Takeuchi M., and Katto J., “Deep Convolutional AutoEncoder-based Lossy Image Compression,” in Proceedings of the Picture Coding Symposium, San Francisco, pp. 253-257, 2018. DOI:10.1109/PCS.2018.8456308

[5] Cordts M., Omran M., Ramos S., Rehfeld T., Enzweiler M., Benenson R., Franke U., Roth S., and Schiele B.,“The Cityscapes Dataset,” CVPR Work Future Datasets, vol. 2, pp. 3213-3223, 2015. DOI:10.1109/CVPR.2016.350

[6] Courtney L. and Sreenivas R., “Comparison of Spatiotemporal Networks For Learning Video Related Tasks,” arXiv:2009.07338v1, 2020. DOI:10.48550/arXiv.2009.07338

[7] Desai P., Sujatha C., Chakraborty S., Ansuman S., Bhandari S., and Kardiguddi S., “Next Frame Prediction Using ConvLSTM,” in Proceedings of the 1st International Conference on Artificial Intelligence, Computational Electronics and Communication System, Manipal, pp. 28-30, 2021. DOI:10.1088/1742-6596/2161/1/012024

[8] Dollar P., Wojek C., Schiele B., and Perona P., “Pedestrian Detection: A Benchmark,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, pp. 304- 311, 2010. DOI:10.1109/cvpr.2009.5206631

[9] Gao Z., Tan C., Wu L., and Li S., “SimVP: Simpler Yet Better Video Prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, pp. 3170-3180, 2022. DOI:10.1109/CVPR52688.2022.00317

[10] Geiger A., “Vision Meets Robotics: The KITTI Dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231-1237, 2013. doi: 10.1177/0278364913491297

[11] Goliński A., Pourreza R., Yang Y., Sautière G., and Cohen T., “Feedback Recurrent Autoencoder for Video Compression,” in Proceedings of the 15th Asian Conference on Computer Vision, Kyoto, pp. 591-607, 2021. DOI:10.1007/978-3-030-69538- 5_36

[12] Gupta A., Tian S., Zhang Y., Wu J., Martín R., and Fei L., “MaskViT: Masked Visual Pre-Training for Video Prediction,” in Proceedings of the 11th International Conference on Learning Representations, Kigali, pp. 1-24, 2022. https://openreview.net/pdf?id=QAV2CcLEDh

[13] Hao Z., Huang X., and Belongie S., “Controllable Video Generation with Sparse Trajectories,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, pp. 7854-7863, 2018. DOI:10.1109/CVPR.2018.00819

[14] Hou X., Sun K., Shen L., and Qiu G., “Improving Variational Autoencoder with Deep Feature Consistent and Generative Adversarial Training,” Neurocomputing, vol. 341, no. 14, pp. 183-194, 2019. https://doi.org/10.1016/j.neucom.2019.03.013

[15] Hu W., Li H. C., Pan L., Li W., Tao R., and Du Q., “Spatial-Spectral Feature Extraction via Deep ConvLSTM Neural Networks for Hyperspectral Image Classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 6, pp. 4237-4250, 2020. 3D VAE Video Prediction Model with Kullback Leibler Loss Enhancement 887 DOI:10.1109/TGRS.2019.2961947

[16] Kapoor S., Sharma A., Verma A., Dhull V., and Goyal C., “A Comparative Study on Deep Learning and Machine Learning Models for Human Action Recognition in Aerial Videos,” The International Arab Journal of Information Technology, vol. 20, no. 4, pp. 567-74, 2023. https://doi.org/10.34028/iajit/20/4/2

[17] Kwon Y. and Park M., “Predicting Future Frames using Retrospective Cycle GAN,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, pp. 1811-1821, 2019. DOI:10.1109/CVPR.2019.00191

[18] Liang X., Lee L., Dai W., and Xing E., “Dual Motion GAN for Future-Flow Embedded Video Prediction,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, pp. 1744-1752, 2017. https://doi.org/10.1109/iccv.2017.194

[19] Liu B., Chen Y., Liu S., and Kim H., “Deep Learning in Latent Space for Video Prediction and Compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, pp. 701-710 2021. DOI:10.1109/CVPR46437.2021.00076

[20] Lotter W., Kreiman G., and Cox D., “Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning,” Machine Learning, arXiv:1605.08104v5, pp. 1-18, 2017. https://doi.org/10.48550/arXiv.1605.08104

[21] Lu W., Cui J., Chang Y., and Zhang L., “A Video Prediction Method Based on Optical Flow Estimation and Pixel Generation,” IEEE Access, vol. 9, pp. 100395-100406, 2021. doi: 10.1109/ACCESS.2021.3096788

[22] Lu Y., Mahesh Kumar K., Seyed Shahabeddin N., and Wang Y., “Future Frame Prediction Using Convolutional VRNN for Anomaly Detection,” in Proceedings of the 16th IEEE International Conference on Advanced Video and Signal Based Surveillance, Taipei, pp. 1-8, 2019. DOI:10.1109/AVSS.2019.8909850

[23] Michelucci U., “An Introduction to Autoencoders,” arXiv:2201.03898v1, 2022. http://arxiv.org/abs/2201.03898

[24] Odaibo S., “Tutorial : Deriving the Standard Variational (VAE) Loss Function,” arXiv Preprint, arXiv1907.08956, no. 2019, pp. 1-8, 2019. https://doi.org/10.48550/arXiv.1907.08956

[25] Oprea S., Martinez-Gonzalez P., Garcia-Garcia A., Castro-Vargas J., Orts-Escolano S., Garcia- Rodriguez J., and Argyros A., “A Review on Deep Learning Techniques for Video Prediction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 16, pp. 2806-2832, 2020. https://doi.org/10.1109/TPAMI.2020.3045007

[26] Pan J. Wang C., Jia X., Shao J., Sheng L., Yan J., and Wang X., “Video Generation from Single Semantic Label Map,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, pp. 3733-3742, 2019. DOI: 10.1109/CVPR.2019.00385

[27] Pratella D., Saadi S. Bannwarth S., Paquis‐ Fluckinger V., and Bottini S., “A Survey of Autoencoder Algorithms to Pave The Diagnosis of Rare Diseases,” International Journal of Molecular Sciences, vol. 22, no. 19, 2021. doi:10.3390/ijms221910891

[28] Ranjan N., Bhandari S., Kim Y., and Kim H., “Video Frame Prediction by Joint Optimization of Direct Frame Synthesis and Optical-Flow Estimation,” Computers, Materials and Continua, vol. 75, no. 2, pp. 2615-2639, 2023. DOI:10.32604/cmc.2023.026086

[29] Razali H. and Fernando B., “A Log-Likelihood Regularized KL Divergence for Video Prediction with a 3D Convolutional Variational Recurrent Network,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision Workshops, Waikola, pp. 209-217, 2021. DOI:10.1109/WACVW52041.2021.00027

[30] Straka Z., Svoboda T., and Hoffmann M., “PreCNet: Next Frame Video Prediction Based on Predictive Coding,” IEEE Transactions on Neural Networks and Learning Systems, vol. 8, no. 57, 2023. http://arxiv.org/abs/2004.14878

[31] Suzuki M. and Matsuo Y., “A Survey of Multimodal Deep Generative Models,” Advanced Robotics, vol. 36, no. 5-6, pp. 261-278, 2022. DOI:10.1080/01691864.2022.2035253

[32] Villegas R., Yang J., Hong S., Lin X., and Lee H., “Decomposing Motion and Content for Natural Video Sequence Prediction,” arXiv Preprint, arXiv1706, pp. 08033-08055, 2017. http://arxiv.org/abs/1706.08033

[33] Wang Y., Jiang L., Yang M., Li L., Long M., and Fei-Fei L., “Eidetic 3d LSTM: A Model For Video Prediction and Beyond,” in Proceedings of the International Conference on Learning Representations, New Orleans, pp. 1-14, 2019. https://openreview.net/pdf?id=B1lKS2AqtX

[34] Wei R. and Mahmood A., “Recent Advances in Variational Autoencoders with Representation Learning for Biomedical Informatics : A Survey,” IEEE Access, vol. 9, pp. 4939-4956, 2021. DOI:10.1109/ACCESS.2020.3048309

[35] Wu B., Nair S., Martín-Martín R., Fei-Fei L., and Finn C., “Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, pp. 2318-2328, 2021. DOI:10.1109/CVPR46437.2021.00235

[36] Yang Y., Zheng K., Wu C., and Yang Y., “Improving the Classification Effectiveness of Intrusion Detection by Using Improved 888 The International Arab Journal of Information Technology, Vol. 21, No. 5, September 2024 Conditional Variational Autoencoder And Deep Neural Network,” Sensors, vol. 19, no. 11, 2019. DOI:10.3390/s19112528

[37] Ye X., Bilodeau G., and Montr P., “A Unified Model for Continuous Conditional Video Prediction,” in Proceedings of the EEE Conference on Computer Vision and Pattern Recognition, Vancouver, pp. 3603-3612, 2023. DOI:10.1109/CVPRW59228.2023.00368