ACLM: Developing a Compact Arabic Language Model

Author Mohamed Alkaoud, Muteb Alsaqoub, Ibrahim Aljodhi, Abdulrhman Alqadibi, Omar Altammami,

Keywords #Arabic NLP #deep learning #efficient AI #generative AI #GPT #large language models #natural language generation #NLP #small language models

Abstract

Recent advancements in Large Language Models (LLMs) have transformed Natural Language Processing (NLP). These models have demonstrated unprecedented capabilities in understanding and generating human language. However, their large-scale nature often poses challenges related to computational resource requirements, latency, and deployment, especially in resource-constrained environments. This research focuses on the design, development, and evaluation of an Arabic Small Language Model (SLM), named the Arabic Compact Language Model (ACLM), built to be compact and efficient. ACLM aims to bridge the gap between the high resource demands of existing large-scale models and the practical needs of real-world applications by leveraging high-quality Arabic data. We began with an existing language model, Pre-Trained Transformer for Arabic Language Generation (AraGPT2)-base, and further pre-trained it on high-quality Arabic data to enhance its performance while maintaining a compact size. This approach emphasizes the importance of data quality over model size, drawing on insights from recent studies that highlight the effectiveness of high-quality data in improving model performance. To evaluate ACLM, we conducted two key assessments: 1) A survey-based evaluation involving three LLMs: ChatGPT (GPT-4o), Gemini Pro, and Command R+, and 2) A perplexity analysis on generated and real-world text. ACLM outperformed AraGPT2-base in 4 out of 5 scenarios. Additionally, ACLM demonstrated superior fluency, achieving a perplexity of 31.74 on generated text compared to 165.28 for AraGPT2-base, and a perplexity of 124.67 on real-world Arabic books, significantly lower than 2011.88 for AraGPT2- base.

References

[1] Alkaoud M., ‘‘A Bilingual Benchmark for Evaluating Large Language Models,’’ PeerJ Computer Science, vol. 10, pp. 1-22, 2024. https://pmc.ncbi.nlm.nih.gov/articles/PMC10909 174/ [2] Altamimi M. and Alayba A., ‘‘ANAD: Arabic News Article Dataset,’’ Data in Brief, vol. 50, pp. 109460, 2023. DOI:10.1016/j.dib.2023.109460 [3] Antoun W., Baly F., and Hajj H., “AraGPT2: Pre- Trained Transformer for Arabic Language Generation,” in Proceedings of the 6th Arabic Natural Language Processing Workshop, Kyiv, pp. 196-207, 2021. https://aclanthology.org/2021.wanlp-1.21/ [4] Biderman D., Dan Biderman., Portes J., Ortiz J., Paul M., Greengard P., Jennings C., King D., Havens S., Chiley V., Frankle J., Blakeney C., and Cunningham J., “LoRA Learns Less and Forgets Less,” arXiv Preprint, vol. arXiv:2405.09673v2, pp. 1-39, 2024. https://doi.org/10.48550/arXiv.2405.09673 [5] Bourahouat G., Abourezq M., and Daoudi N., “Word Embedding as a Semantic Feature Extraction Technique in Arabic Natural Language Processing: An Overview,” The International Arab Journal of Information Technology, vol. 21, no. 2, pp. 313-325, 2024. https://doi.org/10.34028/iajit/21/2/13 [6] Brown T., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P., and Neelakantan A., et al., ‘‘Language Models Are Few-Shot Learners,’’ in Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, pp. 1877-1901, 2020. https://dl.acm.org/doi/abs/10.5555/3495724.3495 883 [7] Bubeck S., Chandrasekaran V., Eldan R., Gehrke J., Horvitz E., Kamar E., Lee P., Lee Y., Li Y., Lundberg S., Nori H., Palangi H., Ribeiro M., and Zhang Y., ‘‘Sparks of Artificial General Intelligence: Early Experiments with GPT-4,’’ arXiv Preprint, vol. arXiv:2303.12712v5, pp. 1- 155, 2023. https://arxiv.org/abs/2303.12712 [8] Cho K., Van Merrienboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., and Bengio Y., ‘‘Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation,’’ in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, pp. 1724-1734, 2014. https://aclanthology.org/D14-1179.pdf [9] De Vries A., ‘‘The Growing Energy Footprint of Artificial Intelligence,’’ Joule, vol. 7, no. 10, pp. 2191-2194, 2023. https://doi.org/10.1016/j.joule.2023.09.004 [10] Donisch L., Schacht S., and Lanquillon C., “Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations,” arXiv Preprint, vol. arXiv:2408.03130v1, pp. 1-12, 2024. https://arxiv.org/abs/2408.03130 [11] Eldan R. and Li Y., ‘‘TinyStories: How Small can Language Models be and Still Speak Coherent English?,’’ arXiv Preprint, vol. 544 The International Arab Journal of Information Technology, Vol. 22, No. 3, May 2025 arXiv:2305.07759v2, 2023. https://doi.org/10.48550/arXiv.2305.07759 [12] Gomez A., ‘‘Introducing Command R+: A Scalable LLM Built for Business, 2024 https://cohere.com/blog/command-r-plus- microsoft-azure, Last Visited, 2024. [13] Gunasekar S., Zhang Y., Aneja J., Mendes T., and Giorno A., et al., ‘‘Textbooks are all you Need,’’ arXiv Preprint, vol. arXiv:2306.11644v2, pp. 1- 26, 2023. https://arxiv.org/abs/2306.11644 [14] Hochreiter S. and Schmidhuber J., “Long Short- Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735 [15] Hu E., Shen Y., Wallis P., Allen-Zhu Z., Li Y., Wang S., Wang L., and Chen W., “LoRA: Low- Rank Adaptation of Large Language Models,” arXiv Preprint, vol. arXiv:2106.09685v2, pp. 1- 26, 2021. https://arxiv.org/abs/2106.09685 [16] Jelinek F., Mercer R., Bahl L., and Baker J., ‘‘Perplexity-a Measure of the Difficulty of Speech Recognition Tasks,’’ The Journal of the Acoustical Society of America, vol. 62, no. S1, pp. S63-S63, 1977. https://doi.org/10.1121/1.2016299 [17] Jordan M., Advances in Psychology, Elsevier, 1997. https://doi.org/10.1016/S0166- 4115(97)80111-2 [18] Lai V., Ngo N., Pouran Ben Veyseh A., Man H., Dernoncourt F., Bui T., and Nguyen T., “ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning,” in Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Singapore, pp. 13171- 13189, 2023. https://aclanthology.org/2023.findings- emnlp.878/ [19] Li Y., Li Z., Yang W., and Liu C., “RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of Language Models,” in Proceedings of the IEEE Real-Time Systems Symposium, Taipei, pp. 158-171, 2023. https://ieeexplore.ieee.org/document/10405961 [20] Luccioni A., Jernite Y., and Strubell E., ‘‘Power Hungry Processing: Watts Driving the Cost of AI Deployment?,’’ in Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, Rio de Janeiro, pp. 85-99, 2024. https://doi.org/10.1145/3630106.3658542 [21] Luccioni A., Viguier S., and Ligozat A., ‘‘Estimating the Carbon Footprint of Bloom, a 176b Parameter Language Model,’’ The Journal of Machine Learning Research, vol. 24, no. 1, pp. 11990-12004, 2024. https://dl.acm.org/doi/10.5555/3648699.3648952 [22] Mei T., Zi Y., Cheng X., Gao Z., Wang Q., and Yang H., “Efficiency Optimization of Large-Scale Language Models Based on Deep Learning in Natural Language Processing Tasks,” in Proceedings of the IEEE 2nd International Conference on Sensors, Electronics and Computer Engineering, Jinzhou, pp. 1231-1237, 2024. https://ieeexplore.ieee.org/document/10729518 [23] Nigst L., Romanov M., Savant S., Seydi M., Verkinderen P., and Hakimi H., OpenITI: A Machine-Readable Corpus of Islamicate Texts, 2023, Last Visited, 2024. https://zenodo.org/records/10007820 [24] OpenAI, GPT-4, Technical Report, 2023. https://cdn.openai.com/papers/gpt-4.pdf [25] Ouyang L., Wu J., Jiang X., Almeida D., Wainwright C., Mishkin P., and Zhang C., et al., ‘‘Training Language Models to Follow Instructions with Human Feedback,’’ in Proceedings of the 36th Conference on Neural Information Processing Systems, New Orleans, pp. 27730-2774, 2022. https://dl.acm.org/doi/10.5555/3600270.3602281 [26] Patterson D., Gonzalez J., Le Q., Liang C., Munguia L., Rothchild D., So D., Texier M., and Dean J., ‘‘Carbon Emissions and Large Neural Network Training,’’ arXiv Preprint, vol. arXiv:2104.10350, pp. 1-22, 2021. https://arxiv.org/abs/2104.10350 [27] Perrine P., “Inaccessible Neural Language Models Could Reinvigorate Linguistic Nativism,” arXiv Preprint, vol. arXiv:2301.05272v1, pp. 1-6, 2023. https://arxiv.org/abs/2301.05272 [28] Radford A., Narasimhan K., Salimans T., and Sutskever I., Improving Language Understanding by Generative Pre-Training, Technical Report, 2018. https://www.mikecaptain.com/resources/pdf/GPT -1.pdf [29] Radford A., Wu J., Child R., Luan D., Amodei D., and Sutskever I., ‘‘Language Models are Unsupervised Multitask Learners,’’ OpenAI blog, vol. 1, no. 8, pp. 1-24, 2019. https://cdn.openai.com/better-language- models/language_models_are_unsupervised_mul titask_learners.pdf [30] Rumelhart D., Hinton G., and Williams R., Learning Internal Representations by Error Propagation, MIT Press, 1985. https://stanford.edu/~jlmcc/papers/PDP/Volume %201/Chap8_PDP86.pdf [31] Sathish V., Lin H., Kamath A., and Nyayachavadi A., “LLeMpower: Understanding Disparities in the Control and Access of Large Language Models,” arXiv Preprint, vol. arXiv:2404.09356v1, pp. 1-11, 2024. https://arxiv.org/abs/2404.09356 [32] Selvan R., Pepin B., Igel C., Samuel G., and Dam E., “PePR: Performance Per Resource Unit as a Metric to Promote Small-Scale Deep Learning in Medical Image Analysis,” in Proceedings of the ACLM: Developing a Compact Arabic Language Model 545 6th Northern Lights Deep Learning Conference, Tromso, pp. 1-10, 2025. https://arxiv.org/abs/2403.12562 [33] Sengupta N., Sahu S., Jia B., Katipomu S., Li H., Koto F., Marshall W., and Gosal G., et al., ‘‘Jais and Jais Chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models,’’ arXiv Preprint, vol. arXiv:2308.16149v2, pp. 1-5, 2023. https://doi.org/10.48550/arXiv.2308.16149 [34] Shuttleworth R., Andreas J., Torralba A., and Sharma P., “LoRA vs Full Fine-Tuning: An Illusion of Equivalence,” arXiv Preprint, vol. arXiv:2410.21228v1, pp. 1-21, 2024. https://doi.org/10.48550/arXiv.2410.21228 [35] Suarez P., Romary L., and Sagot B., ‘‘A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages,’’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, pp. 1703-1714, 2020. https://aclanthology.org/2020.acl-main.156/ [36] Team G., Anil R., Borgeaud S., Wu Y., and Alayrac J., et al., ‘‘Gemini: A Family of Highly Capable Multimodal Models,’’ arXiv Preprint, vol. arXiv:2312.11805v4, pp. 1-90, 2024. https://doi.org/10.48550/arXiv.2312.11805 [37] Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A., Kaiser L., and Polosukhin I., “Attention is all you Need,” in Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, 2017. https://proceedings.neurips.cc/paper_files/paper/2 017/file/3f5ee243547dee91fbd053c1c4a845aa- Paper.pdf [38] Zheng Y., Chen Y., Qian B., Shi X., Shu Y., and Chen J., “A Review on Edge Large Language Models: Design, Execution, and Applications,” ACM Computing Surveys, vol. 57, no. 82024, pp. 1-35, 2025. https://arxiv.org/abs/2410.11845 Mohamed Alkaoud received the B.S. degree in Computer Science from King Saud University, Riyadh, Saudi Arabia, in 2014; the M.S. degree in Computer Science from the University of California Davis, Davis, CA, USA in 2018; and the Ph.D. degree in Computer Science from the University of California Davis, Davis, CA, USA in 2021. Since 2021, he has been an Assistant Professor with the Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia. His research interest includes Artificial Intelligence, Machine Learning, Natural Language Processing, Computational Linguistics, Computer Vision, Digital Humanities, and Education. Muteb Alsaqoub received the B.S. degree in Computer Science from King Saud University, Riyadh, Saudi Arabia, in 2024. His research interest includes Artificial Intelligence and Machine Learning. Ibrahim Aljodhi received the B.S. degree in Computer Science from King Saud University, Riyadh, Saudi Arabia, in 2024. His research interest includes Artificial Intelligence and Machine Learning. Abdulrhman Alqadibi received the B.S. degree in Computer Science from King Saud University, Riyadh, Saudi Arabia, in 2024. His research interest includes Artificial Intelligence and Machine Learning. Omar Altammami received the B.S. degree in Computer Science from King Saud University, Riyadh, Saudi Arabia, in 2024. His research interest includes Artificial Intelligence and Machine Learning. 546 The International Arab Journal of Information Technology, Vol. 22, No. 3, May 2025 Appendix I. A list of the generated examples that we used to calculate perplexity. ليجست نع مويلا ةحصلا ةرازو تنلعأ500 يفاعت عم ،انوروك سوريفب ةباصلإا نم ةديدج ةلاح400 نينطاوملا ةرازولا تثح امك .تايفشتسملا نم مهجورخو ةلاح ةيئاقولا تاءارجلإاب مازتللاا ىلععيمجلا ةملاس ىلع ظافحلل ةماعلا نكاملأا يف تامامكلا ءادتراو." "|ا سلج .ءودهلاو ةنيكسلا نم ا ً وج قلخ امم ،ليلعلا ميسنلا عم ليامتت راجشلأاو ،ءامسلا يف لألأتت موجنلا تناك ،ةئداهلا ةليللا كلت يف نم ا ً بوك يستحي هتفرش يف بتاكل يتلا ثادحلأا يف ركفي أدبو ،ياشلاراهنلا للاخ هيلع ترم." "نيدملا تدهش .ةراجتلاو ةفاقثلاو ملعلل اً زكرم تحبصأو ،روصنملا يسابعلا ةفيلخلا دي ىلع يدلايملا نماثلا نرقلا يف دادغب ةنيدم تسسأت تابتكملا ترهدزا ثيح ،ىطسولا روصعلا للاخ ا ً ريبك ا ً روطت ةعلا بذجو سرادملاوملاعلا ءاحنأ فلتخم نم نيركفملاو ءامل." "م ةيئايميك ةقاط ىلإ هليوحتو سمشلا ءوض صاصتماب قارولأا موقت ثيح ،تاتابنلا يف ةيويحلا تايلمعلا مهأ نم يئوضلا ءانبلا ةيلمع ربتعت تائيزج يف ةقاطلا هذه نيزخت متي .دقعم يئايميك لعافت للاخ ن تاتابنلا اهمدختست يتلا زوكولجلاروطتلاو ومنلل." "زخت متي .ةفلتخملا جماربلا ليغشتل ةيئاوشعلا ةركاذلاو ةيزكرملا ةجلاعملا تادحو مادختساب ةيمقرلا تانايبلا ةجلاعم ىلع بوساحلا لمعي تاكرحمو ةبلصلا صارقلأا لثم نيزختلا تادحو يف تانايبلا نيةشاشلا ىلع جئاتنلا ضرع متي امنيب ،ةبلصلا صارقلأا." " لاقتقولا ضعب ءاضق دوأو ا ً عيمج مكل تقتشا دقل .الله ءاش نإ عوبسلأا ةياهن يف مودقلا لواحأس' :يلع باجأف ،'؟يلع اي انروزتس ىتم' :دمحأ ".'ربصلا غرافب كرظتنن ا ً ضيأ نحن' :لاقو دمحأ مستبا '.مكعم "لا بيطأ امف ،ةمسابلا انمايأ ىلإ داؤفلا نحي ،ةعمد نيعلا يفو قوش بلقلا يفحابصلا سمش انملاحأ يف ىرنف ،ا ً يوس اهانشع يتلا تاظحللا كلت ،ركذتن نحنو يلايللا يرست .اهبذعأ امو ىركذ." ج لذبي نأ نود هفادهأ ىلإ لصي نأ صخش يلأ نكمي لا .ةايحلا يف حاجنلا حاتفم امه بوؤدلا لمعلاو داهتجلاا نإ .دصح عرز نمو ،دجو دج نمصإو ربصب تايدحتلا هجاويو اًدهرار." "مانربلا نكي مل اذإ .ةلدسنملا ةمئاقلا نم بسانملا جمانربلا ددح مث ،'مادختساب حتف' رتخاو ةرأفلل نميلأا رزلا ىلع طغضا ،فلملا حتفل زاهج حفصتو 'رخآ جمانرب رايتخا' ىلع رقنلا كنكمي ،اًدوجوم جبولطملا جمانربلا ىلع روثعلل رتويبمكلا." "راحلا فيصلا مايأ دحأ يفق نونبيو نوبعلي لافطلأاو ،ةيبهذلا لامرلا ىلع فطلب مطلاتت جاوملأا تناك .ةدرابلا رحبلا تامسنب عتمتلل ئطاشلا ىلإ باهذلا انررق ،ة ةريبك ةلظم تحت انسلج .لمرلا نم ا ً علاجزاطلا نوميللا ريصع برشنو ليمجلا دهشملاب عتمتسن."