The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Optimizing Multimodal RAG Systems for Multilingual Product Support

Multilingual product assistance necessitates machine learning systems that incorporate text, speech, and visual data and must adapt to varied linguistic surroundings. Retrieval-Augmented Generation (RAG) frameworks are a potential solution. However, they are highly contingent on integration within retrieval strategies and language models, which is understudied, particularly in Indic languages and multimodal applications. This paper presents a systematic evaluation of five RAG architectures across seventeen pipeline configurations, combining retrieval methods such as Best Matching 25 )BM25(, Dense Passage Retrieval (DPR), chroma, and Facebook AI Similarity Search )FAISS( with multilingual embedding models including IndicBERT, mT5, and sentence transformers. A curated dataset of 170 engineering manuals, brochures, and presentations was used to replicate real-world troubleshooting scenarios. Among the evaluated approaches, a ColPali-inspired multimodal fusion mechanism-capable of jointly encoding text and images-substantially improved retrieval precision and diagnostic support in complex cases. Evaluation using Recall@5, Mean Reciprocal Rank (MRR), BLEU, ROUGE-L, and mean Average Precision (mAP) shows that hybrid pipelines, particularly Chroma-FAISS with mT5, achieve strong semantic alignment (Recall@5=0.78, ROUGE-L=0.46) while maintaining efficiency. The ColPali-based multimodal RAG further enhances performance, reaching 94% Top-1 retrieval accuracy and user satisfaction above 90%. These results indicate the possibility of carefully structured hybrid and multimodal RAG systems providing accurate, fluent, and inclusive support in real-time, giving design guidelines to be applied beyond education to any other domain requiring real-time interventions (healthcare, education, technical training, etc.).

 


[1] Bhat V., Cheerla S., Mathew J., Pathak N., and et al., “Retrieval-Augmented Generation-Based Restaurant Chatbot with AI Testability,” in Proceedings of the IEEE 10th International Conference on Big Data Computing Service and Machine Learning Applications, China, pp. 1-10, 2024. file:///C:/Users/acit2k/Downloads/2024166670.pdf

[2] Bink J., Personalized Response with Generative AI: Improving Customer Interaction with Zero- Shot Learning LLM ChatBots, Master Thesis Eindhoven University of Technology, 2023. https://research.tue.nl/en/studentTheses/personaliz ed-response-with-generative-ai/

[3] Brown T., Mann B., Ryder N., Subbiah M., and et al., “Language Models are Few-Shot Learners,” arXiv Preprint, vol. arXiv:2005.14165v4, pp. 1- 75, 2020. https://arxiv.org/abs/2005.14165v4

[4] Chang Y., Wang X., Wang J., Wu Y., and et al., “A Survey on Evaluation of Large Language Models,” ACM Transactions on Intelligent Systems and Technology, vol. 15, no. 3, pp. 1-45, 2024. https://doi.org/10.1145/3641289

[5] Tin T., Xuan S., Ee W., Tiung L., and Aitizaz A., “Interactive ChatBot for PDF Content Conversation Using an LLM Language Model,” International Journal of Advanced Computer Science and Applications, vol. 15, no. 9, pp. 1-7, 2024. https://dx.doi.org/10.14569/IJACSA.2024.015091 05

[6] Chowdhery A., Narang S., Devlin J., Bosma M., and et al., “Palm: Scaling Language Modeling with Pathways,” Journal of Machine Learning Research, vol. 24, no. 1, pp. 1-113, 2023. https://dl.acm.org/doi/10.5555/3648699.3648939

[7] Clark A. and Kay M., Pillow: Python Imaging Library, Python Software Foundation Documentation, https://python-pillow.org, Last Visited, 2025.

[8] Dettmers T., Pagnoni A., Holtzman A., and Zettlemoyer L., “Qlora: Efficient Finetuning of Quantized LLMs,” in Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, pp. 88-115, 2023. https://dl.acm.org/doi/10.5555/3666122.3666563

[9] Faysse M., Sibille H., Wu T., Omrani B., and et al., “ColPali: Efficient Document Retrieval with Vision Language Models,” arXiv Preprint, vol. arXiv:2407.01449v6, pp. 1-26. https://arxiv.org/abs/2407.01449v6

[10] Gao Y., Xiong Y., Gao X., Jia K., and et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” arXiv Preprint, vol. arXiv:2312.10992v5, pp. 1-21, 2023. https://arxiv.org/abs/2312.10997v5

[11] Jha B., Akana C., and Anand R., “Question Answering System with Indic Multilingual- BERT,” in Proceedings of the 5th International Conference on Computing Methodologies and Communication, Erode, pp. 1631-1638, 2021. https://doi.org/10.1109/ICCMC51019.2021.9418 387

[12] Joshi P., Santy S., Budhiraja A., Bali K., and Choudhury M., “The State and Fate of Linguistic Diversity and Inclusion in the NLP World,” arXiv Preprint, vol. arXiv:2004.09095v3, pp. 1-12, 2020. https://arxiv.org/abs/2004.09095v3

[13] Lai V., Ngo N., Veyseh A., Man H., and et al., “ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning,” arXiv Preprint, vol. arXiv:2304.05613v1, pp. 1-21, 2023. https://arxiv.org/abs/2304.05613v1

[14] Medeiros T., Medeiros M., Azevedo M., Silva, M., and et al., “Analysis of Language-Model-Powered Chatbots for Query Resolution in PDF-Based Automotive Manuals,” Vehicles, vol. 5, no. 4, pp. 1384-1399, 2023. https://doi.org/10.3390/vehicles5040076

[15] Mousa M., Khedr A., and Idrees A., “Hierarchical Method for Automated Text Documents Classification,” The International Arab Journal of Information Technology, vol. 22, no. 1, pp. 1-19, 2025. https://doi.org/10.34028/iajit/22/1/2

[16] NVIDIA Corporation, CUDA Deep Neural Network Library, NVIDIA Developer Documentation, https://docs.nvidia.com/deeplearning/cudnn, Last Visited, 2025.

[17] Pandya K. and Holia M., “Automating Customer Service Using LangChain: Building Custom Open- Source GPT Chatbot for Organizations,” arXiv Preprint, vol. arXiv:2310.05421v1, pp. 1-4, 2023. https://arxiv.org/abs/2310.05421v1

[18] Python Software Foundation, Installing Packages Using Pip and Virtual Environments, Python Packaging User Guide, https://packaging.python.org/en/latest/guides/insta lling-using-pip-and-virtual-environments, Last Visited, 2025. Optimizing Multimodal RAG Systems for Multilingual Product Support 519

[19] Radford A., Wu J., Child R., Luan D., and et al., “Language Models are Unsupervised Multitask Learners,” OpenAI Blog, vol. 1, no. 8, pp. 1-24, 2019. https://cdn.openai.com/better-language- models/language_models_are_unsupervised_mult itask_learners.pdf

[20] Ramjee P., Sachdeva B., Golechha S., Kulkarni S., and et al., “CataractBot: an LLM-Powered Expert- in-the-Loop Chatbot for Cataract Patients,” in Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, New York, pp. 1-31, 2025. https://dl.acm.org/doi/10.1145/3729479

[21] Reimers N. and Gurevych I., “Sentence-BERT: Sentence Embeddings Using Siamese BERT- Networks,” arXiv Preprint, vol. arXiv:1908.10084v1, pp. 1-11, 2019. https://arxiv.org/abs/1908.10084v1

[22] Salemi A. and Zamani H., “Comparing Retrieval- Augmentation and Parameter-Efficient Fine- Tuning for Privacy-Preserving Personalization of Large Language Models,” in Proceedings of the International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval, Padua, pp. 286-296, 2025. https://doi.org/10.1145/3731120.3744595

[23] Shawar B. and Atwell E., “Different Measurement Metrics to Evaluate a Chatbot System,” in Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies, Rochester, pp. 89-96, 2007. https://aclanthology.org/W07-0313/

[24] Singh U., Vora N., Lohia P., Sharma Y., and et al., “Multilingual Chatbot for Indian Languages,” in Proceedings of the 14th International Conference on Computing Communication and Networking Technologies, Delhi, pp. 1-5, 2023. https://doi.org/10.1109/ICCCNT56998.2023.1030 7978

[25] Singh V., Exploring the Role of Large Language Model-Based Chatbots for Human Resources, Master Thesis, The University of Texas at Austin, 2023. https://hdl.handle.net/2152/124540

[26] Siriwardhana S., Weerasekera R., Wen E., Kaluarachchi T., and et al., “Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering,” Transactions of the Association for Computational Linguistics, vol. 11, pp. 1-17, 2023. https://aclanthology.org/2023.tacl-1.1/

[27] Sridhar A., Audio Processing Using Pydub and Google SpeechRecognition API, GeeksforGeeks, https://www.geeksforgeeks.org/audio-processing- using-pydub-and-google-speechrecognition-api/, Last Visited, 2024.

[28] Touvron H., Lavril T., Izacard G., Martinet X., and et al., “Llama: Open and Efficient Foundation Language Models,” arXiv Preprint, vol. arXiv:2302.13971v1, pp. 1-27, 2023. https://arxiv.org/abs/2302.13971v1

[29] Tubin C., Rodriguez J., and De Marchi A., “User Experience with Conversational Agent: A Systematic Review of Assessment Methods,” Behaviour and Information Technology, vol. 41, no. 16, pp. 3519-3529, 2022. https://psycnet.apa.org/doi/10.1080/0144929X.20 21.2001047

[30] Vakayil S., Juliet D., and Vakayil S., “Rag-Based LLM ChatBot Using Llama-2,” in Proceedings of the 7th International Conference on Devices, Circuits and Systems, Coimbatore, pp. 1-5, 2024. https://doi.org/10.1109/ICDCS59278.2024.10561 020

[31] Xue L., Constant N., Roberts A., Kale M., and et al., “MT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer,” arXiv Preprint, vol. arXiv:2010.11934v3, pp. 1-17, 2020. https://arxiv.org/abs/2010.11934v3

[32] Zhang C., Yang Z., He X., and Deng L., “Multimodal Intelligence: Representation Learning, Information Fusion, and Applications,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 3, pp. 478-493, 2020. https://doi.org/10.1109/JSTSP.2020.2987728