Ar-CM-ViMETA: Arabic Image Captioning based on Concept Model and Vision-based Multi-Encoder Transformer Architecture
Image captioning is a major artificial intelligence research field that involves visual interpretation and linguistic description of a corresponding image. Successful image captioning relies on acquiring as much information as feasible from the original image. One of these essential bits of knowledge is the topic or the concept that the image is associated with. Recently, concept modeling technique has been utilized in English image captioning for completely capturing the image contexts and make use of these contexts to produce more accurate image descriptions. In this paper, a concept-based model is proposed for Arabic Image Captioning (AIC). A novel Vision-based Multi-Encoder Transformer Architecture (ViMETA) is proposed for handling the multi-outputs result from the concept modeling technique while producing the image caption. BiLingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) standard metrics have been used to evaluate the proposed model using the Flickr8K dataset with Arabic captions. Furthermore, qualitative analysis has been conducted to compare the produced captions of the proposed model with the ground truth descriptions. Based on the experimental results, the proposed model outperformed the related works both quantitatively, using BLEU and ROUGE metrics, and qualitatively.
[1] Afyouni I., Azhar I., and Elnagar A., “AraCap: A Hybrid Deep Learning Architecture for Arabic Image Captioning,” Procedia Computer Science, vol. 189, pp. 382-389, 2021. https://doi.org/10.1016/j.procs.2021.05.108
[2] Al-Muzaini H., Al-Yahya T., and Benhidour H., “Automatic Arabic Image Captioning Using 464 The International Arab Journal of Information Technology, Vol. 21, No. 3, May 2024 RNN-LSTM-based Language Model and CNN,” International Journal of Advanced Computer Science and Applications, vol. 9, no. 6, pp. 67-73, 2018. DOI:10.14569/IJACSA.2018.090610
[3] Antoun W., Baly F., and Hajj H., “AraBERT: Transformer-based Model for Arabic Language Understanding,” in Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, pp. 9-15, 2020. https://aclanthology.org/2020.osact-1.2
[4] Bahdanau D., Cho K., and Bengio Y., “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv Preprint, vol. arXiv:1409.0473, pp. 1-16, 2014. https://doi.org/10.48550/arXiv.1409.0473
[5] Bangalore M., Bharathi S., and Ashwin M., “Classification of Breast Cancer using Ensemble Filter Feature Selection with Triplet Attention Based Efficient Net Classifier,” The International Arab Journal of Information Technology, vol. 21, no. 1, pp. 17-31, 2024. DOI: 10.34028/iajit/21/1/2
[6] Dash S., Acharya S., Pakray P., Das R., and Gelbukh A., “Topic-based Image Caption Generation,” Arabian Journal for Science and Engineering, vol. 45, no. 4, pp. 3025-3034, 2020. https://link.springer.com/article/10.1007/s13369- 019-04262-2
[7] Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., and Dehghani M., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in Proceedings of the 9th International Conference on Learning Representations, Austria, pp. 1-21, 2021. https://openreview.net/forum?id=YicbFdNTTy
[8] ElJundi O., Dhaybi M., Mokadam K., Hajj H., and Asmar D., “Resources and End-to-End Neural Network Models for Arabic image Captioning,” in Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, SciTePress, Valletta, pp. 233-241, 2020. DOI:10.5220/0008881202330241
[9] Emami J., Nugues P., Elnagar A., and Afyouni I., “Arabic Image Captioning using Pre-training of Deep Bidirectional Transformers,” in Proceedings of the 15th International Conference on Natural Language Generation, Waterville, pp. 40-51, 2022. https://aclanthology.org/2022.inlg-main
[10] Grootendorst M., “BERTopic: Neural Topic Modeling with a Class-based TF-IDF Procedure,” arXiv Preprint, vol. arXiv:2203.05794, pp. 1-10, 2022. http://arxiv.org/abs/2203.05794
[11] Grootendorst M., https://github.com/MaartenGr/Concept, Last Visited, 2024.
[12] Hodosh M., Young P., and Hockenmaier J., “Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics,” Journal of Artificial Intelligence Research, vol. 47, pp. 853- 899, 2013. https://doi.org/10.1613/jair.3994
[13] Hossain M., Sohel F., Shiratuddin M., and Laga H., “A Comprehensive Survey of Deep Learning for Image Captioning,” ACM Computing Surveys, vol. 51, no. 6, pp. 1-36, 2019. https://doi.org/10.1145/3295748
[14] HuggingFace, https://huggingface.co/sentence- transformers/clip-ViT-B-32-multilingual-v1, Last Visited, 2024.
[15] Jindal V., “Generating Image Captions in Arabic Using Root-Word Based Recurrent Neural Networks and Deep Neural Networks,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, New Orleans, pp. 144-151, 2018. https://aclanthology.org/N18-4020
[16] Kingma D. and Ba J., “Adam: A Method for Stochastic Optimization,” in Proceedings of the International Conference on Learning Representations, San Diego, pp. 1-15, 2016. https://doi.org/10.48550/arXiv.1412.6980
[17] Lan W., Chen Y., Xu W., and Ritter A., “An Empirical Study of Pre-trained Transformers for Arabic Information Extraction,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Maine, pp. 4727-4734, 2020. https://aclanthology.org/2020.emnlp-main.382
[18] Lin C., “ROUGE: A Package for Automatic Evaluation of Summaries,” in Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, pp. 74-81, 2004. https://typeset.io/papers/rouge-a-package-for- automatic-evaluation-of-summaries-2tymbd14i8
[19] Lin T., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dollar P., and Zitnick C., “LNCS 8693-Microsoft COCO: Common Objects in Context,” in Proceedings of the Computer Vision- ECCV 13th European Conference, Zurich, pp. 740- 755, 2014. https://doi.org/10.1007/978-3-319- 10602-1_48
[20] Littell P., Lo C., Larkin S., and Stewart D., “Multi- Source Transformer for Kazakh-Russian-English Neural Machine Translation,” in Proceedings of the 4th Conference on Machine Translation, Florence, pp. 267-274, 2019. https://aclanthology.org/W19-5326
[21] Mualla R. and Alkheir J., “Development of an Arabic Image Description System,” International Journal of Computer Science Trends and Technology, vol. 6, no. 3, pp. 205-213, 2018. https://www.ijcstjournal.org/volume-6/issue- 3/IJCST-V6I3P27.pdf
[22] Osman A., Shalaby M., Soliman M., and Elsayed K., “A Survey on Attention-based Models for Image Captioning,” International Journal of Ar-CM-ViMETA: Arabic Image Captioning based on Concept Model and Vision-based ... 465 Advanced Computer Science and Applications, vol. 14, no. 2, pp. 403-412, 2023. DOI:10.14569/IJACSA.2023.0140249
[23] Papineni K., Roukos S., Ward T., and Zhu W., “BLEU: A Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, pp. 311- 318, 2002. https://aclanthology.org/P02-1040.pdf
[24] Radford A., Kim J., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A., Mishkin P., Clark J., Krueger G., and Sutskever I., “Learning Transferable Visual Models from Natural Language Supervision,” in Proceedings of the 38th International Conference on Machine Learning, Virtual, pp. 8748-8763, 2021. http://arxiv.org/abs/2103.00020
[25] Rikters M. and Nakazawa T., “Revisiting Context Choices for Context-aware Machine Translation,” arXiv Preprint, vol. arXiv:2109.02995, pp. 1-6, 2021. https://doi.org/10.48550/arXiv.2109.02995
[26] Shin J. and Lee J., “Multi-Encoder Transformer Network for Automatic Post-Editing,” in Proceedings of the 3rd Conference on Machine Translation: Shared Task Papers, Brussels, pp. 840-845, 2018. https://www.statmt.org/wmt18/pdf/WMT098.pdf
[27] Shin Y., “Multi-Encoder Transformer for Korean Abstractive Text Summarization,” IEEE Access, vol. 11, pp. 48768-48782, 2023. DOI:10.1109/ACCESS.2023.3277754
[28] Stefanini M., Cornia M., Baraldi L., Cascianelli S., Fiameni G., and Cucchiara R., “From Show to Tell: A Survey on Deep Learning-based Image Captioning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 539-559, 2023. DOI:10.1109/TPAMI.2022.3148210
[29] Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A., and Kaiser L., “Attention is all you Need,” in Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, pp. 6000-6010, 2017. https://dl.acm.org/doi/10.5555/3295222.3295349
[30] Zhu Z., Xue Z., and Yuan Z., “Topic-Guided Attention for Image Captioning,” in Proceedings of the 25th IEEE International Conference on Image Processing, Athens, pp. 2615-2619, 2018. DOI:10.1109/ICIP.2018.8451083