The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Neural Volumetric Representations for Real-Time 3D Scene Reconstruction using Multi-Modal Learning Algorithm

Deep Learning (DL) is a subfield of Machine Learning (ML) models used in various complex fields. DL algorithms are mostly widely used to reconstruct 3D images collected from multiple online sources. It is a very challenging task for the existing algorithms to reconstruct 2D images into 3D pictures without losing high-quality pixels because of the complex scenes with different lighting situations, dynamic components, and occlusions. This paper presents a novel real-time 3D scene reconstruction using neural volumetric representations combined with a Multi-Modal Learning Algorithm (MMLA). The proposed MMLA focuses on solving issues like volumetric representations of scenes, which are improved by combining numerous modalities such as RGB images, depth sensors, and Inertial Measurement Unit (IMU) data. The MMLA combines the DeepVoxels model and Neural Radiance Fields (NeRF) model, which it calls the Neural Rendering technique, to learn complex patterns in 3D scenes. The pre-trained model EfficientNet accurately obtained the 3D- reconstruction patterns and understood the spatial structures that transfer to the proposed MMLA. The proposed MMLA performance is analyzed using the ShapeNet dataset, which consists of 2D images. Finally, the experimental results show that the proposed MMLA outperforms the superior performance in terms of Mean Squared Error (MSE) of 0.167, Root Mean Squared Error (RMSE) of 0.50, and Mean Absolute Error (MAE) of 1.1. These results may differ from other datasets.


[1] Abate D., Themistocleous K., and Hadjimitsis D., “The Application of Neural Radiance Fields (NeRF) in Generating Digital Surface Models from UAV Imagery,” in Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Athens, pp. 10228-10231, 2024. https://ieeexplore.ieee.org/document/10641392

[2] Ahmad B., Floor P., Farup I., and Hovde O., “3D Reconstruction of Gastrointestinal Regions Using Single-View Methods,” IEEE Access, vol. 11, pp. 61103-61117, 2023. https://ieeexplore.ieee.org/document/10154004

[3] Ahmed M., Alazeb A., Al Mudawi N., Sadiq T., and et al., “Perception of Natural Scenes: Objects Detection and Segmentations Using Saliency Map with AlexNet,” The International Arab Journal of Information Technology, vol. 22, no. 3, pp. 461- 475, 2025. https://doi.org/10.34028/iajit/22/3/4

[4] Anciukevicius T., Xu Z., Fisher M., Henderson P., and et al., “RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, pp. 12608-12618, 2023. DOI: 10.1109/CVPR52729.2023.01213

[5] Banani M., Corso J., and Fouhey D., “Novel Object Viewpoint Estimation through Reconstruction Alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, pp. 3110-3119, 2020. DOI: 10.1109/CVPR42600.2020.00318

[6] Bautista M., Talbott W., Zhai S., Srivastava N., and Susskind J., “On the Generalization of Learning-based 3D Reconstruction,” arXiv Preprint, vol. arXiv:2006.15427v1, pp. 1-10, 2020. https://arxiv.org/abs/2006.15427

[7] Bernardini F., Mittleman J., Rushmeier H., Silva C., and Taubin G., “The Ball-Pivoting Algorithm for Surface Reconstruction,” IEEE Transactions on Visualization and Computer Graphics, vol. 5, no. 4, pp. 349-359, 1999. DOI: 10.1109/2945.817351

[8] Chen Y., Xie R., Yang S., Dai L., and et al., “Single-View 3D Garment Reconstruction Using Neural Volumetric Rendering,” IEEE Access, vol. 12, pp. 49682-49693, 2024. DOI: 10.1109/ACCESS.2024.3380059

[9] Choy C., Xu D., Gwak J., Chen K., and Savarese S., “3D-R2N2: A Unified Approach for Single and Multi-View 3D Object Reconstruction,” in Proceedings of the 14th European Conference on Computer Vision, Amsterdam, pp. 628-644, 2016. https://link.springer.com/chapter/10.1007/978-3- 319-46484-8_38

[10] Farshian A., Gotz M., Cavallaro G., Debus C., and et al., “Deep-Learning-based 3-D Surface Reconstruction-a Survey,” Proceedings of the IEEE, vol. 111, no. 11, pp. 1464-1501, 2023. DOI: 10.1109/JPROC.2023.3321433

[11] Gotz M., Cavallaro G., Geraud T., Book M., and Riedel M., “Parallel Computation of Component Trees on Distributed Memory Machines,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 11, pp. 2582-2598, 2018. DOI: 10.1109/TPDS.2018.2829724

[12] Gwak J., Choy C., Chandraker M., Garg A., and Savarese S., “Weakly Supervised 3D Reconstruction with Adversarial Constraint,” in Proceedings of the International Conference on 3D Vision, Qingdao, pp. 263-272, 2017. https://ieeexplore.ieee.org/document/8374579

[13] Han X., Laga H., and Bennamoun M., “Image- based 3D Object Reconstruction: State-of-the-Art and Trends in the Deep Learning Era,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 5, pp. 1578-1604, 2021. DOI: 10.1109/TPAMI.2019.2954885

[14] Huang Y., Huang S., Hsu H., and Wang Y., “Interpreting Latent Representation in Neural Radiance Fields for Manipulating Object Semantics,” in Proceedings of the IEEE International Conference on Image Processing, Kuala Lumpur, pp. 470-474, 2023. https://ieeexplore.ieee.org/document/10222650

[15] Isola P., Zhu J., Zhou T., and Efros A., “Image-to- Image Translation with Conditional Adversarial Networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, pp. 5967-5976, 2017. https://ieeexplore.ieee.org/document/8100115

[16] Jo S., Lee D., and Rhee C., “Occlusion-Aware Amodal Depth Estimation for Enhancing 3D Reconstruction from a Single Image,” IEEE Access, vol. 12, pp. 106524-106536, 2024. https://doi.org/10.1109/access.2024.3436570

[17] Ko K., Kim S., and Lee M., “Zero-Shot 3D Scene Representation with Invertible Generative Neural Radiance Fields,” IEEE Access, vol. 13, pp. 68561-68576, 2025. https://ieeexplore.ieee.org/document/10967257

[18] Laga H., Jospin L., Boussaid F., and Bennamoun M., “A Survey on Deep Learning Techniques for Stereo-based Depth Estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 4, pp. 1738-1764, 2022. https://ieeexplore.ieee.org/document/9233988

[19] Rezende D., Ali Eslami S., Mohamed S., Battaglia 1200 The International Arab Journal of Information Technology, Vol. 22, No. 6, November 2025 P., and et al., “Unsupervised Learning of 3D Structure from Images,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, pp. 5004-5011, 2016. https://dl.acm.org/doi/10.5555/3157382.3157656

[20] Samavati T. and Soryani M., “Deep Learning- based 3D Reconstruction: A Survey,” Artificial Intelligence Review, vol. 56, no. 9, pp. 9175-9219, 2023. https://doi.org/10.1007/s10462-023-10399- 2

[21] Seitz S., Curless B., Diebel J., Scharstein D., and Szeliski R., “A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, pp. 519-528, 2006. https://ieeexplore.ieee.org/document/1640800

[22] Shan Y., Liang C., and Xu M., “3D Reconstruction and Estimation from Single-View 2D Image by Deep Learning-A Survey,” in Proceedings of the IEEE Conference on Artificial Intelligence, Singapore, pp. 1-7, 2024. DOI: 10.1109/CAI59869.2024.00010

[23] Sitzmann V., Thies J., Heide F., Niebner M., and et al., “DeepVoxels: Learning Persistent 3D Feature Embeddings,” arXiv Preprint, vol. arXiv:1812.01024v2, pp. 1-10, 2018. https://arxiv.org/abs/1812.01024

[24] Tatarchenko M., Dosovitskiy A., and Brox T., “Multi-View 3D Models from Single Images with a Convolutional Network,” arXiv Preprint, vol. arXiv:1511.06702v2, pp. 1-20, 2016. https://arxiv.org/abs/1511.06702

[25] Tatarchenko M., Dosovitskiy A., and Brox T., “Octree Generating Networks: Efficient Convolutional Architectures for High-Resolution 3D Outputs,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, pp. 2107-2115, 2017. https://ieeexplore.ieee.org/document/8237492

[26] Tian Y., Zhang H., Liu Y., and Wang L., “Recovering 3D Human Mesh from Monocular Images: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 12, 15406-15425, 2023. DOI: 10.1109/TPAMI.2023.3298850

[27] Tulsiani S., Kar A., Carreira J., and Malik J., “Learning Category-Specific Deformable 3D Models for Object Reconstruction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 719-731, 2017. https://ieeexplore.ieee.org/document/7482798

[28] Vinodkumar P., Karabulut D., Avots E., Ozcinar C., and Anbarjafari G., “Deep Learning for 3D Reconstruction, Augmentation, and Registration: A Review Paper,” Entropy, vol. 26, no. 3, pp. 1- 44, 2024. https://doi.org/10.3390/e26030235

[29] Wang C., Reza M., Vats V., Ju Y., and et al., “Deep Learning-based 3D Reconstruction from Multiple Images: A Survey,” Neurocomputing, vol. 579, pp. 128018, 2024. https://doi.org/10.1016/j.neucom.2024.128018

[30] Worrall D., Garbin S., Turmukhambetov D., and Brostow G., “Interpretable Transformations with Encoder-Decoder Networks,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, pp. 5737-5746, 2017. https://ieeexplore.ieee.org/document/8237873

[31] Wu D., Li Y., Yang R., Li S., and et al., “Neural Radiance Field Reconstruction Technique Under Layer Training Strategy,” in Proceedings of the International Conference on HVDC, Urumqi, pp. 747-750, 2024. https://ieeexplore.ieee.org/document/10723007

[32] Xie H., Yao H., Zhang S., Zhou S., and Sun W., “Pix2Vox++: Multiscale Context-Aware 3D Object Reconstruction from Single and Multiple Images,” International Journal of Computer Vision, vol. 128, pp. 2919-2935, 2020. https://link.springer.com/article/10.1007/s11263- 020-01347-6

[33] Yang J., Zhang G., Li Y., and Yang L., “VST3D- Net: Video-based Spatio-Temporal Network for 3D Shape Reconstruction from a Video,” in Proceedings of the International Conference on 3D Immersion, Brussels, pp. 1-7, 2020. https://ieeexplore.ieee.org/document/9376350

[34] Yang L., Yang C., Xie R., Liu J., and et al., “3D Reconstruction from Traditional Methods to Deep Learning,” in Proceedings of the IEEE 10th International Conference on Cyber Security and Cloud Computing, and 9th International Conference on Edge Computing and Scalable Cloud, Xiangtan, pp. 387-392, 2023. https://ieeexplore.ieee.org/document/10195547

[35] Yu X., Tang J., Qin Y., Li C., and et al., “PVSeRF: Joint Pixel-, Voxel- and Surface-Aligned Radiance Field for Single-Image Novel View Synthesis,” in Proceedings of the 30th ACM International Conference on Multimedia, pp. 1572-1583, 2022. https://doi.org/10.1145/3503161.3547893

[36] Zhang J., Dong Y., Kuang M., Liu B., and et al., “The Art of Defense: Letting Networks Fool the Attacker,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 3267-3276, 2023. https://ieeexplore.ieee.org/document/10130393

[37] Zhang K., Liu M., Zhang J., and Dong Z., “PA- MVSNet: Sparse-to-Dense Multi-View Stereo with Pyramid Attention,” IEEE Access, vol. 9, pp. 27908-27915, 2021. https://ieeexplore.ieee.org/document/9352763

[38] Zheng Z., Yu T., Liu Y., and Dai Q., “PaMIR: Parametric Model-Conditioned Implicit Neural Volumetric Representations for Real-Time 3D Scene Reconstruction using ... 1201 Representation for Image-based Human Reconstruction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3170-3184, 2022. https://ieeexplore.ieee.org/document/9321139