An Improved Taylor Hyperbolic Tangent and Sigmoid Activations for Avoiding Vanishing Gradients in Recurrent Neural Nets

Author Tirupati Gullipalli, Krishna Murali, Srinivasa Peri,

Keywords #Vanishing gradient #taylor expansion #functional approximation #sigmoid #hyperbolic tangent #neural networks

Abstract

In deep learning, Hyperbolic Tangent (Tanh) and Sigmoid nonlinear activation functions can retain the complex relationship, which is more appropriate in Recurrent Neural Networks (RNNs). The gradients of these activation functions are vital in updating the weights during training the network. However, both functions are vulnerable to the vanishing gradient problem and expensive in exponent operations. It causes gradients to vanish during back propagation that leads to training overheads and low performance. Although most of the studies put forward methods to reduce exponent operations, there is not a viable solution to tackle the gradient issues. Hence, we propose a Taylor expansion of second order to realize Tanh and Sigmoid functions. In particular, Long Short-Term Memory (LSTM) network makes extensive use of these functions as well as gating mechanism to control the flow of information and gradients. In consequence, Taylor expansion Tanh and Sigmoid activation functions based parallel heterogeneous LSTM network integrated with Bayesian hyperparameter optimization is being proposed for multi-step time series prediction. The current model efficacy is evaluated on bench mark datasets Mackey-Glass Series (MGS), Electricity Transformer Temperature hourly 2 (ETTh2), coronavirus daily cumulative cases, Cumulative Deaths (CD-5) and (CD-7), daily New Cases (NC4), and Total Recovery Cases (TRC-8) in India. The model performance is compared with conventional models like the Auto Regressive Integrated Moving Average (ARIMA), Tree-based Pipeline Optimization Tool (TPOT) regressor, LSTM, Gated Recurrent Unit (GRU), transformer, and the proposed model with Tanh and Sigmoid activations. The analysis reveals that the current model achieves remarkable performance in terms of Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Coefficient of Determination (R2 Score) when compared to existing models.

References

[1] Abbasimehr H. and Paki R., “Prediction of COVID-19 Confirmed Cases Combining Deep Learning Methods and Bayesian Optimization,” Chaos, Solitons and Fractals, vol. 142, pp. 110511, 2021. https://doi.org/10.1016/j.chaos.2020.110511

[2] Ajagbe S. and Adigun M., “Deep Learning Techniques for Detection and Prediction of Pandemic Diseases: A Systematic Literature Review,” Multimedia Tools and Applications, vol. 83, no. 2, pp. 5893-5927, 2024. https://doi.org/10.1007/s11042-02315805-z

[3] Al-kateeb S., “Forecasting the Spread of Viral Diseases in Jordan Using the SARIMA Statistical Model,” The International Arab Journal of Information Technology, vol. 21, no. 6, pp. 987- 995, 2024. https://doi.org/10.34028/iajit/21/6/3

[4] Bamber S. and Vishvakarma T., “Medical Image Classification for Alzheimer’s Using a Deep Learning Approach,” Journal of Engineering and Applied Science, vol. 70, no. 1, pp. 1-18, 2023. https://doi.org/10.1186/s44147-023-00211-x

[5] Banerjee K., Vishak Prasad C., Gupta R., Vyas K., Anushree H., and Mishra B., “Exploring Alternatives to Softmax Function,” arXiv Preprint, vol. arXiv:2011.11538v1, pp. 1-8, 2021. https://arxiv.org/abs/2011.11538

[6] Bekkar A., Hssina B., Douzi S., and Douzi K., “Air-Pollution Prediction in Smart City, Deep Learning Approach,” Journal of Big Data, vol. 8, no. 1, pp. 1-21, 2021. https://doi.org/10.1186/s40537-021-00548-1

[7] Bergstra J., Bardenet R., Bengio Y., and Kegl B., “Algorithms for Hyper-Parameter Optimization,” in Proceedings of the Proceedings of the 25th International Conference on Neural Information Processing Systems, Granada, pp. 2546-2554, 2011. https://dl.acm.org/doi/10.5555/2986459.2986743

[8] Cetin O., Temurts F., and Gulgonul E., “An Application of Multilayer Neural Network on Hepatitis Disease Diagnosis Using Approximation of Sigmoid Function,” Dicle Medical Journal, vol. 42, no. 2, pp. 150-157, 2015. https://pdfs.semanticscholar.org/4617/fa90b3766 1e70b512fa7ff67e5d216b106bc.pdf

[9] Chandra M., “Hardware Implementation of Hyperbolic Tangent Function Using Catmull-Rom Spline Interpolation,” arXiv Preprint, vol. arXiv:2007.13516v1, pp. 1-4, 2020. https://doi.org/10.48550/arxiv.2007.13516

[10] De Ryck T., Lanthaler S., and Mishra S., “On the Approximation of Functions by Tanh Neural Networks,” Neural Networks, vol. 143, pp. 732- 750, 2021. https://doi.org/10.1016/j.neunet.2021.08.015

[11] DeCastro-Garcia N., Castaneda A., Garcia D., and Carriegos M., “Effect of the Sampling of a Dataset in the Hyperparameter Optimization Phase over the Efficiency of a Machine Learning Algorithm,” Complexity, vol. 2019, no. 1, pp. 1-16, 2019. https://doi.org/10.1155/2019/6278908

[12] Drewil G. and Al-Bahadili R., “Air Pollution Prediction Using LSTM Deep Learning and Meta Heuristics Algorithms,” Measurement: Sensors, vol. 24, pp. 100546, 2022. https://doi.org/10.1016/j.measen.2022.100546

[13] Dubey S., Singh S., and Chaudhuri B., “Activation Functions in Deep Learning: A Comprehensive Survey and Benchmark,” Neuro Computing, vol. 503, pp. 92-108, 2022. https://doi.org/10.1016/j.neucom.2022.06.111

[14] Effrosynidis D., Spiliotis E., Sylaios G., and Arampatzis A., “Time Series and Regression Methods for Univariate Environmental Forecasting: An Empirical Evaluation,” Science of the Total Environment, vol. 875, pp. 162580, 2023. https://doi.org/10.1016/j.scitotenv.2023.162580

[15] Fernandez A. and Mali A., “TELU Activation Function for Fast and Stable Deep Learning,” arXiv Preprint, vol. arXiv:2412.20269v1, pp. 1- 80, 2024. https://arxiv.org/html/2412.20269v1

[16] Global Development, Oxford Martin School, https://www.oxfordmartin.ox.ac.uk/global- development, Last Visited, 2024.

[17] Han Y. and Meng S., “Machine English Translation Evaluation System Based on BP Neural Network Algorithm,” Computational Intelligence and Neuroscience, vol. 2022, no. 1, pp. 1-10, 2022. https://doi.org/10.1155/2022/4974579

[18] Hochreiter S. and Schmidhuber J., “Long Short Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735

[19] Hochreiter S., “The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions,” International Journal of Uncertainty Fuzziness and Knowledge-Based Systems, vol. 6, no. 2, pp. 107-116, 1998. https://doi.org/10.1142/S0218488598000094

[20] Hu Z., Zhang J., and Ge Y., “Handling Vanishing Gradient Problem Using Artificial Derivative,” IEEE Access, vol. 9, pp. 22371-22377, 2021. DOI: 10.1109/ACCESS.2021.3054915

[21] Hwang S. and Kim J., “A Universal Activation Function for Deep Learning,” Computers, Materials and Continua, Materials and Continua, 1044 The International Arab Journal of Information Technology, Vol. 22, No. 5, September 2025 vol. 75, no. 2, pp. 3553-3569, 2023. https://doi.org/10.32604/cmc.2023.037028

[22] Kaundal R., Kapoor A., and Raghava G., “Machine Learning Techniques in Disease Forecasting: A Case Study on Rice Blast Prediction,” BMC Bioinformatics, vol. 7, no. 1, pp. 1-16, 2006. https://doi.org/10.1186/1471-2105-7- 485

[23] Khagi B. and Kwon G., “A Novel Scaled-Gamma- Tanh (SGT) Activation Function in 3D CNN Applied for MRI Classification,” Scientific Reports, vol. 12, no. 1, pp. 1-14, 2022. https://www.nature.com/articles/s41598-022- 19020-y

[24] Kunc V. and Klema J., “Three Decades of Activations: A Comprehensive Survey of 400 Activation Functions for Neural Networks,” arXiv Preprint, vol. arXiv:2402.09092v1, pp. 1-107, 2024. https://arxiv.org/abs/2402.09092

[25] Lessmann S., Stahlbock R., and Crone S., “Optimizing Hyperparameters of Support Vector Machines by Genetic Algorithms,” in Proceedings of the International Conference on Artificial Intelligence, Las Vegas, pp. 74-80, 2005. file:///C:/Users/user/Downloads/Optimizing_Hyp erparameters_of_Support_Vector_Machi%20(1). pdf

[26] Li M., Jiang Y., Zhang Y., and Zhu H., “Medical Image Analysis Using Deep Learning Algorithms,” Frontiers in Public Health, vol. 11, pp. 1273253, 2023. DOI: 10.3389/fpubh.2023.1273253

[27] Lupon J., Gaggin H., De Antonio M., Domingo M., and et al., “Biomarker-Assist Score for Reverse Remodeling Prediction in Heart Failure: The ST2-R2 Score,” International Journal of Cardiology, vol. 184, pp. 337-343, 2015. https://doi.org/10.1016/j.ijcard.2015.02.019

[28] Mackey M. and Glass L., “Oscillation and Chaos in Physiological Control Systems,” Science, vol. 197, pp. 287-289, 1997. DOI: 10.1126/science.267326

[29] Mahaur B., Mishra K., and Singh N., “Improved Residual Network Based on Norm Preservation for Visual Recognition,” Neural Networks, vol. 157, pp. 305-322, 2023. https://doi.org/10.1016/j.neunet.2022.10.023

[30] Nogueria F., A Python Implementation of Global Optimization with Gaussian Processes, GitHub, https://github.com/fmfn/BayesianOptimization, Last Visited, 2024.

[31] Ozturk M., “Hyperparameter Optimization of a Parallelized LSTM for Time Series Prediction,” Vietnam Journal of Computer Science, vol. 10, no. 3, pp. 303-328, 2023. https://doi.org/10.1142/S2196888823500033

[32] Prasanth S., Singh U., Kumar A., Tikkiwal V., and Chong P., “Forecasting Spread of COVID-19 Using Google Trends: A Hybrid GWO-Deep Learning Approach,” Chaos, Solitons and Fractals, vol. 142, pp. 110336, 2021. https://doi.org/10.1016/j.chaos.2020.110336

[33] Shahi S., Fenton F., and Cherry E., “Prediction of Chaotic Time Series Using Recurrent Neural Networks and Reservoir Computing Techniques: A Comparative Study,” Machine Learning with Applications, vol. 8, pp. 100300, 2022. https://doi.org/10.1016/j.mlwa.2022.100300

[34] Snoek J., Larochelle H., and Adams R., “Practical Bayesian Optimization of Machine Learning Algorithms,” in Proceedings of the 26th International Conference on Neural Information Processing Systems, Nevada, pp. 2951-2959, 2012. https://dl.acm.org/doi/10.5555/2999325.2999464

[35] Temurtas F., Gulbag A., and Yumusak N., “A Study on Neural Networks Using Taylor Series Expansion of Sigmoid Activation Function,” in Proceedings of the Computational Science and its Applications, Assisi, pp. 389-397, 2004. DOI:10.1007/978-3-540-24768-5_41

[36] Timmons N. and Rice A., “Approximating Activation Functions,” arXiv Preprint, vol. arXiv:2001.06370v1, pp. 1-10, 2020. https://arxiv.org/abs/2001.06370

[37] Tirupati G., Murali K., and Peri S., “An Improved Parallel Heterogeneous Long Short-Term Model with Bayesian Optimization for Time Series Prediction,” International Journal of Experimental Research and Review, vol. 45, pp. 106-118, 2024. https://doi.org/10.52756/ijerr.2024.v45spl.009

[38] Tirupati G., Murali K., and Peri S., “COVID-19 Prediction Modeling Using Bidirectional Gated Recurrent Unit Network Model,” Journal of Webology, vol. 18, no. 5, pp. 15-41, 2021. file:///C:/Users/user/Downloads/2022022603092 4pmwebology185-52.pdf

[39] Tun N. and Myat A., “Proposed Activation Function Based Deep Learning Approach for Real-Time Face Cover Detection System,” Authorea, pp. 1-7, 2024. DOI:10.22541/au.172449788.82658502/v1

[40] Vijayaprabakaran K. and Sathiyamurthy K., “Towards Activation Function Search for Long Short-Term Model Network: A Differential Evolution-based Approach,” Journal of King Saud University Computer and Information Sciences, vol. 34, no. 6, pp. 2637-2650, 2020. https://doi.org/10.1016/j.jksuci.2020.04.015

[41] Vincent P., De Brebisson A., and Bouthillier X., “Efficient Exact Gradient Update for Training Deep Networks with very Large Sparse Targets,” arXiv Preprint, vol. arXiv:1412.7091v3, pp. 1-15, 2015. https://arxiv.org/abs/1412.7091

[42] Wang S., Liu B., and Liu F., “Escaping the An Improved Taylor Hyperbolic Tangent and Sigmoid Activations for Avoiding Vanishing ... 1045 Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism,” IEEE Access, vol. 9, pp. 168749-168759, 2021. https://ieeexplore.ieee.org/document/9662308

[43] Wei L., Cai J., Nguyen V., Chu J., and Wen K., “P- SFA: Probability based Sigmoid Function Approximation for Low Complexity Hardware Implementation,” Microprocessor and Micro Systems, vol. 76, pp. 417, 2020. https://doi.org/10.1016/j.micpro.2020.103105

[44] Willmott C. and Matsuura K., “Advantages of The Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in Assessing Average Model Performance,” Climate Research, vol. 30, no. 1, pp. 79-82, 2005. https://www.jstor.org/stable/24869236

[45] Wu J., Chen X., Zhang H., Xiong L., Lei H., and Deng S., “Hyperparameter Optimization for Machine Learning Models Based on Bayesian Optimization,” Journal of Electronic Science and Technology, vol. 17, no. 1, pp. 26-40, 2019. https://doi.org/10.11989/JEST.1674- 862X.80904120

[46] Zaki P., Hashem A., Fahim E., Mansor M., and et al., “A Novel Sigmoid Function Approximation Suitable for Neural Networks on FPGA,” in Proceedings of the 15th International Computing Engineering Conference, Cairo, pp. 95-99, 2019. DOI: 10.1109/ICENCO48310.2019.9027479

[47] Zhang J., He T., Sra S., and Jadbabaie A., Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity, https://github.com/JingzhaoZhang/why-clipping- accelerates, Last Visited, 2024.

[48] Zhou H., Zhang S., Peng J., Zhang S., and et al., “Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, pp. 11106-11115, 2021. https://doi.org/10.1609/aaai.v35i12.17325

Abstract:

URL: https://iajit.org/paper/5282

,abstract={

},
keywords={Vanishing gradient,taylor expansion,functional approximation,sigmoid,hyperbolic tangent,neural networks},
ISSN={2413-9351},
month={Jan}}