The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


The Future of Protein Sequence Generation: Performance Assessment Insights

Developing a de facto method to generate synthetic protein sequences is a challenging task that ensures confidence in protein engineering, provides functional insights, and aids in target identification. We present a novel Generative Adversarial Network (GAN) framework tailored for protein sequence generation, leveraging the softmax function to handle discrete amino acid output and integrating biologically informed loss functions to guide sequence plausibility. By adversarial training a generator-discriminator pair with additional guidance from pretrained protein language models, the framework learns to produce full-length protein sequences from random noise. The evaluation demonstrates that the generated sequences achieve over 90% identity with UniProt entries, along with low Fréchet Inception Distance (FID) scores, high Template Modeling score (TM scores), and preserved secondary structure features, and confirm strong structural fidelity. These findings demonstrate the ability of the model to generate biologically relevant and structurally sound proteins, providing a scalable approach to data augmentation and design in protein science.

[1] Aarthy M. and Singh S., “Chapter 28-Envisaging the Conformational Space of Proteins by Coupling Machine Learning and Molecular Dynamics,” Advances in Protein Molecular and Structural Biology Methods, pp. 467-475, 2022. https://doi.org/10.1016/B978-0-323-90264- 9.00028-3

[2] Adhikari B., Hou J., and Cheng J., “Protein Contact Prediction by Integrating Deep Multiple Sequence Alignments, Coevolution and Machine Learning,” Proteins: Structure, Function, and Bioinformatics, vol. 86, no. 1, pp. 84-96, 2018. https://doi.org/10.1002/prot.25405

[3] Ai X., Smith M., and Feltus F., “Generative Adversarial Networks Applied to Gene Expression Analysis: An Interdisciplinary Perspective,” Computational and Systems Oncology, vol. 3, no. 3, pp. 1-17, 2023. https://doi.org/10.1002/cso2.1050

[4] Alamdari S., Thakkar N., Berg R., Tenenholtz N., and et al., “Protein Generation with Evolutionary Diffusion: Sequence is All You Need,” BioRxiv, pp. 1-62, 2023. https://doi.org/10.1101/2023.09.11.556673

[5] AlQuraishi M., “Proteinnet: A Standardized Data Set for Machine Learning of Protein Structure,” BMC Bioinformatics, vol. 20, pp. 1-10, 2019. https://link.springer.com/article/10.1186/s12859- 019-2932-0

[6] Anand N. and Huang P., “Generative Modeling for Protein Structures,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, New York, pp. 7505-7516, 2018. https://dl.acm.org/doi/10.5555/3327757.3327850

[7] Barigye S., Vega J., and Castillo Y., “Generative Adversarial Networks (GANs) Based Synthetic Sampling for Predictive Modeling,” Molecular Informatics, vol. 39, no. 10, pp. 2000086, 2020. https://doi.org/10.1002/minf.202000086

[8] Basit Z., Akram H., Iqbal M., Muhammad G., and et al., “Protein Redesign and Engineering Using Machine Learning,” Drug Design Using Machine Learning, pp. 247-282, 2022. https://doi.org/10.1002/9781394167258.ch9

[9] Bonetta R. and Valentino G., “Machine Learning Techniques for Protein Function Prediction,” Proteins: Structure, Function, and Bioinformatics, vol. 88, no. 3, pp. 397-413, 2020. https://doi.org/10.1002/prot.25832

[10] Casadio R., Martelli P., and Savojardo C., “Machine Learning Solutions for Predicting Protein-Protein Interactions,” Wiley Interdisciplinary Reviews: Computational Molecular Science, vol. 12, no. 6, pp. 1-21, 2022. https://doi.org/10.1002/wcms.1618

[11] Chavoshi S., Baets B., Qiang Y., Neutens T., and et al., “A Qualitative Approach to the Identification, Visualisation and Interpretation of Repetitive,” The International Arab Journal of Information Technology, vol. 12, no. 5, pp. 415- 423, 2015. https://www.iajit.org/portal/PDF/Vol%2012,%20 No.%205/7358.pdf

[12] Chen S., Tang Z., You L., and Chen C., “A Knowledge Distillation-Guided Equivariant Graph Neural Network for Improving Protein Interaction Site Prediction Performance,” Knowledge-Based Systems, vol. 300, pp. 112209, 2024. https://doi.org/10.1016/j.knosys.2024.112209

[13] Chen Y. and Chang S., “Recent Advances in the Integration of Protein Mechanics and Machine Learning,” Extreme Mechanics Letters, vol. 72, pp. 102236, 2024. https://doi.org/10.1016/j.eml.2024.102236

[14] Coronel L., Fajardo A., and Medina R., “Horizontal Sequence Pooling Technique in Convolutional Neural Networks to Optimize Feature Extraction for DNA Sequence Classification,” The International Arab Journal of Information Technology, vol. 21, no. 5, pp. 844- 853, 2024. https://doi.org/10.34028/iajit/21/5/6

[15] Dao T., “Flashattention-2: Faster Attention with Better Parallelism and Work Partitioning,” arXiv Preprint, vol. arXiv:2307.08691v1, pp. 1-14, 2023. https://arxiv.org/abs/2307.08691v1

[16] Detlefsen N., Hauberg S., and Boomsma W., “Learning Meaningful Representations of Protein Sequences,” Nature Communications, vol. 13, no. 1, pp. 1-12, 2022. https://www.nature.com/articles/s41467-022- 29443-w

[17] Faizi S., Singh N., Kamal A., and Raza K., “Chapter 14-Generative Adversarial Networks in Protein and Ligand Structure Generation: A Case Study,” Deep Learning Applications in Translational Bioinformatics Elsevier, vol. 15, pp. 231-248, 2024. https://doi.org/10.1016/B978-0- 443-22299-3.00014-1

[18] Garbuzynskiy S., Marchenkov V., Marchenko N., Semisotnov G., and Finkelstein A., “How Proteins Manage to Fold and How Chaperones Manage to Assist the Folding,” Physics of Life Reviews, vol. 52, pp. 66-79, 2025. https://doi.org/10.1016/j.plrev.2024.12.006

[19] Goodfellow I., Abadie J., Mirza M., Xu B., and et al., “Generative Adversarial Networks,” arXiv Preprint, vol. arXiv:1406.2661v1, pp. 1-9, 2014. https://arxiv.org/abs/1406.2661v1

[20] Hollingsworth S. and Karplus P., “A Fresh Look at the Ramachandran Plot and the Occurrence of Standard Structures in Proteins,” Biomolecular Concepts, vol. 1, no. 3, pp. 271-283, 2010. The Future of Protein Sequence Generation: Performance Assessment Insights 339 https://doi.org/10.1515/bmc.2010.022

[21] Ibrahem A., Saeed J., and Abdulazeez A., “Insights into Automated Attractiveness Evaluation from 2D Facial Images: A Comprehensive Review,” The International Arab Journal of Information Technology, vol. 22, no. 1, pp. 77-98, 2025. https://doi.org/10.34028/iajit/22/1/7

[22] Junaid M., Wang B., and Li W., “Data-Augmented Machine Learning Scoring Functions for Virtual Screening of YTHDF1 m6 A Reader Protein,” Computers in Biology and Medicine, vol. 183, pp. 109268, 2024. https://doi.org/10.1016/j.compbiomed.2024.109268

[23] Kermani M., Guessoum Z., and Boufaida Z., “A Two-Step Methodology for Dynamic Construction of a Protein Ontology,” IAENG International Journal of Computer Science, vol. 46, no. 1, pp. 25-37, 2019. https://www.iaeng.org/IJCS/issues_v46/issue_1/I JCS_46_1_03.pdf

[24] King J. and Koes D., “Sidechainnet: An All-Atom Protein Structure Dataset for Machine Learning,” Proteins: Structure, Function, and Bioinformatics, vol. 89, no. 11, pp. 1489-1496, 2021. https://doi.org/10.1002/prot.26169

[25] Koga N. and Koga R., “Inventing Novel Protein Folds,” Journal of Molecular Biology, vol. 436, no. 21, pp. 168791, 2024. https://doi.org/10.1016/j.jmb.2024.168791

[26] Kryshtafovych A., Schwede T., Topf M., Fidelis K., and Moult J., “Critical Assessment of Methods of Protein Structure Prediction (CASP)-Round XV,” Proteins: Structure, Function, and Bioinformatics, vol. 91, no. 12, pp. 1539-1549, 2023. https://doi.org/10.1002/prot.26617

[27] Kunthavai1 A., Vasantharathna S., and Thirumurugan S., “Pairwise Sequence Alignment Using Bio-Database Compression by Improved Fine Tuned Enhanced Suffix Array,” The International Arab Journal of Information Technology, vol. 12, no. 4, pp. 352-359, 2015. https://iajit.org/portal/PDF/vol.12,no.4/5968.pdf

[28] Lee J., Jung D., Moon J., and Rho S., “Advanced R-GAN: Generating Anomaly Data for Improved Detection in Imbalanced Datasets Using Regularized Generative Adversarial Networks,” Alexandria Engineering Journal, vol. 111, pp. 491-510, 2025. https://doi.org/10.1016/j.aej.2024.10.084

[29] Li Z., Yi Y., Liu L., and Wu H., “One Step Forward for Nanopore Protein Sequencing,” Clinical and Translational Medicine, vol. 14, no. 3, pp. 1-4, 2024. https://doi.org/10.1002/ctm2.1615

[30] Lv L., Lin Z., Li H., Liu Y., and et al., “ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing,” arXiv Preprint, vol. arXiv:2402.16445v3, pp. 1-12, 2024. https://arxiv.org/abs/2402.16445v3

[31] Madani A., Krause B., Greene E., Subramanian S., and et al., “Large Language Models Generate Functional Protein Sequences Across Diverse Families,” Nature Biotechnology, vol. 41, no. 8, pp. 1099-1106, 2023. https://doi.org/10.1038/s41587-022-01618-2

[32] Mardikoraem M., Wang Z., Pascual N., and Woldring D., “Generative Models for Protein Sequence Modeling: Recent Advances and Future Directions,” Briefings in Bioinformatics, vol. 24, no. 6, pp. 1-19, 2023. https://doi.org/10.1093/bib/bbad358

[33] McPartlon M. and Xu J., “An End-to-End Deep Learning Method for Protein Side-Chain Packing and Inverse Folding,” Proceedings of the National Academy of Sciences, vol. 120, no. 23, pp. 1-9, 2023. https://doi.org/10.1073/pnas.2216438120

[34] Mirimoghaddam M., Majidpour J., Pashaei F., Arabalibeik H., and et al., “HER2GAN: Overcome the Scarcity of her2 Breast Cancer Dataset Based on Transfer Learning and GAN Model,” Clinical Breast Cancer, vol. 24, no. 1, pp. 53-64, 2024. https://doi.org/10.1016/j.clbc.2023.09.014

[35] Mittal S., Jena M., and Pathak B., “Protein Sequencing with Artificial Intelligence: Machine Learning Integrated Phosphorene Nanoslit,” Chemistry-A European Journal, vol. 29, no. 59, pp. e202301667, 2023. https://doi.org/10.1002/chem.202301667

[36] Murad T., Ali S., and Patterson M., “Exploring the Potential of GANs in Biological Sequence Analysis,” Biology, vol. 12, no. 6, pp. 854, 2023. https://doi.org/10.48550/arXiv.2303.02421

[37] Ning Q. and Qi Z., “WGAN-GP_GLU: A Semi- Supervised Model Based on Double Generator- Wasserstein Gan with Gradient Penalty Algorithm for Glutarylation Site Identification,” Computers in Biology and Medicine, vol. 184, pp. 109328, 2025. https://doi.org/10.1016/j.compbiomed.2024.109328

[38] Njage P., Henri C., Leekitcharoenphon P., Mistou M., and et al., “Machine Learning Methods as a Tool for Predicting Risk of Illness Applying Next- Generation Sequencing Data,” Risk Analysis, vol. 39, no. 6, pp. 1397-1413, 2019. https://doi.org/10.1111/risa.13239

[39] Pandurangan A. and Blundell T., “Prediction of Impacts of Mutations on Protein Structure and Interactions: SDM, a Statistical Approach, and MCSM, Using Machine Learning,” Protein Science, vol. 29, no. 1, pp. 247-257, 2020. https://doi.org/10.1002/pro.3774

[40] Piganeau B., Fabbri C., Weigt M., Pagnani A., and Feinauer C., “Generating Interacting Protein Sequences Using Domain-to-Domain Translation,” Bioinformatics, vol. 39, no. 7, pp. 1- 340 The International Arab Journal of Information Technology, Vol. 23, No. 2, March 2026 10, 2023. https://doi.org/10.1093/bioinformatics/btad401

[41] Rahman T., Du Y., Zhao L., and Shehu A., “Generative Adversarial Learning of Protein Tertiary Structures,” Molecules, vol. 26, no. 5, pp. 1209, 2021. https://doi.org/10.3390/molecules26051209

[42] Rajita B., Halani V., Shah D., and Panda S., “GAN-C: A Generative Adversarial Network with a Classifier for Effective Event Prediction,” Computational Intelligence, vol. 38, no. 6, pp. 1922-1955, 2022. https://doi.org/10.1111/coin.12550

[43] Rego N. and Koes D., “3Dmol.js: Molecular Visualization with WebGL,” Bioinformatics, vol. 31, no. 8, pp. 1322-1324, 2015. https://doi.org/10.1093/bioinformatics/btu829

[44] Repecka D., Jauniskis V., Karpus L., Rembeza E., and et al., “Expanding Functional Protein Sequence Spaces Using Generative Adversarial Networks,” Nature Machine Intelligence, vol. 3, no. 4, pp. 324-333, 2021. https://www.nature.com/articles/s42256-021- 00310-5

[45] Rives A., Meier J., Sercu T., Goyal S., and et al., “Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences,” Proceedings of the National Academy of Sciences, vol. 118, no. 15, pp. 1-12, 2021. https://doi.org/10.1073/pnas.2016239118

[46] Schoenfeld B. and Aragon A., “How Much Protein Can the Body Use in a Single Meal for Muscle- Building? Implications for Daily Protein Distribution,” Journal of the International Society of Sports Nutrition, vol. 15, no. 10, pp. 1-6, 2018. https://doi.org/10.1186/s12970-018-0215-1

[47] Sharma M. and Singh A., “Unleash the Potential of GAN Model to Generate Synthetic Protein Sequences,” Authorea, pp. 1-21, 2025. https://doi.org/10.22541/au.173713534.49587019/v1

[48] Shen C., Ding J., Wang Z., Cao D., and et al., “From Machine Learning to Deep Learning: Advances in Scoring Functions for Protein-Ligand Docking,” Wiley Interdisciplinary Reviews: Computational Molecular Science, vol. 10, no. 1, pp. 1429-1439, 2020. https://doi.org/10.1002/wcms.1429

[49] Stepanyuk R., Polyakov I., Kulakova A., Marchenko E., and Khrenova M., “Towards Machine Learning Prediction of the Fluorescent Protein Absorption Spectra,” Mendeleev Communications, vol. 34, no. 6, pp. 788-791, 2024. https://doi.org/10.1016/j.mencom.2024.10.007

[50] Tang Q. and Chen W., “DeepB3p: A Transformer- Based Model for Identifying Blood-Brain Barrier Penetrating Peptides with Data Augmentation Using Feedback GAN,” Journal of Advanced Research, vol. 73, pp. 459-468, 2024. https://doi.org/10.1016/j.jare.2024.08.002

[51] Venkatesan A. and Shanmugham B., “Auto- Poietic Algorithm for Multiple Sequence,” The International Arab Journal of Information Technology, vol. 15, no. 5, pp. 842-849, 2018. https://www.iajit.org/portal/PDF/September%202 018,%20No.%205/9716.pdf

[52] Veras M., Sarker B., Aridhi S., Gomes J., and et al., “On the Design of a Similarity Function for Sparse Binary Data with Application on Protein Function Annotation,” Knowledge-Based Systems, vol. 238, pp. 107863, 2022. https://doi.org/10.1016/j.knosys.2021.107863

[53] Vishnoi S., Matre H., Garg P., and Pandey S., “Artificial Intelligence and Machine Learning for Protein Toxicity Prediction Using Proteomics Data,” Chemical Biology and Drug Design, vol. 96, no. 3, pp. 902-920, 2020. https://doi.org/10.1111/cbdd.13701

[54] Wang F., Feng X., Kong R., and Chang S., “Generating New Protein Sequences by Using Dense Network and Attention Mechanism,” Mathematical Biosciences and Engineering, vol. 20, no. 2, pp. 4178-4197, 2023. https://www.aimspress.com/article/doi/10.3934/ mbe.2023195

[55] Wang Y., Zhang Y., Zhan X., He Y., and et al., “Machine Learning for Predicting Protein Properties: A Comprehensive Review,” Neurocomputing, vol. 597, pp. 128103, 2024. https://doi.org/10.1016/j.neucom.2024.128103

[56] Wu H., Yi Y., Li Z., and Liu L., “Towards Next Generation Protein Sequencing,” Chembiochem, vol. 26, no. 6, pp. e202400824, 2025. https://doi.org/10.1002/cbic.202400824

[57] Wu Z., Johnston K., Arnold F., and Yang K., “Protein Sequence Design with Deep Generative Models,” Current Opinion in Chemical Biology, vol. 65, pp. 18-27, 2021. https://doi.org/10.1016/j.cbpa.2021.04.004

[58] Yang W., Liu Y., and Xiao C., “Deep Metric Learning for Accurate Protein Secondary Structure Prediction,” Knowledge-Based Systems, vol. 242, pp. 108356, 2022. https://doi.org/10.1016/j.knosys.2022.108356

[59] Yugandhar K. and Gromiha M., “Feature Selection and Classification of Protein-Protein Complexes Based on their Binding Affinities Using Machine Learning Approaches,” Proteins: Structure, Function, and Bioinformatics, vol. 82, no. 9, pp. 2088-2096, 2014. https://doi.org/10.1002/prot.24564

[60] Zhang P., “A Method for Functional Protein Classification Enhanced by Multiple Sequence Alignment,” IAENG International Journal of Computer Science, vol. 52, no. 3, pp. 637-643, 2025. The Future of Protein Sequence Generation: Performance Assessment Insights 341 https://www.iaeng.org/IJCS/issues_v52/issue_3/I JCS_52_3_11.pdf

[61] Zongying L., Hao L., Liuzhenghao L., Bin L., and et al., “TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation,” arXiv Preprint, vol. arXiv:2402.17156v1, pp. 1-13, 2024. https://arxiv.org/abs/2402.17156