F0 Modeling for Isarn Speech Synthesis using Deep Neural Networks and Syllable-level Feature

Author Representation,

Keywords #Fundamental frequency #speech synthesis #deep neural networks

Abstract The generation of the fundamental frequency (F0) plays an important role in speech synthesis, which directly influences the naturalness of synthetic speech. In conventional parametric speech synthesis, F0 is predicted frame-by-frame. This method is insufficient to represent F0 contours in larger units, especially tone contours of syllables in tonal languages that deviate as a result of long-term context dependency. This work proposes a syllable-level F0 model that represents F0 contours within syllables, using syllable-level F0 parameters that comprise the sampling F0 points and dynamic features. A Deep Neural Network (DNN) was used to represent the relationships between syllable-level contextual features and syllable-level F0 parameters. The proposed model was examined using an Isarn speech synthesis system with both large and small training sets. For all training sets, the results of objective and subjective tests indicate that the proposed approach outperforms the baseline systems based on hidden Markov models and DNNS that predict F0 values at the frame level.

References

[1] Amrouche A., Falek L., and Teffahi H., “Design and Implementation of a Diacritic Arabic Text- To-Speech System,” The International Arab Journal of Information Technology, vol. 14, no. 4, pp. 488-494, 2017.

[2] Chomphan S. and Kobayashi T., “Implementation and Evaluation of an HMM-Based Thai Speech Synthesis System,” in Proceedings of 8th Annual Conference of the International Speech, Antwerp pp. 2849-2852, 2007.

[3] Chomphan S. and Kobayashi T., “Tone Correctness Improvement in Speaker Dependent HMM-Based Thai Speech Synthesis,” Speech Communication, vol. 50, no. 5, pp. 392-404, 2008.

[4] Fujisaki H., Narusawa S., and Maruno M., “Pre- Processing of Fundamental Frequency Contours of Speech for Automatic Parameter Extraction,” in Proceedings of International Conference on Signal Processing, Beijing, pp. 722-725, 2000.

[5] Fujisaki H. and Hirose K., “Analysis of Voice Fundamental Frequency Contours for Declarative Sentences of Japanese,” Journal of the Acoustical Society of Japan, vol. 5, no. 4, pp. 233-242, 1984.

[6] Gandour J., Potisuk S., and Dechongkit S., “Tonal Coarticulation in Thai,” Journal of Phonetics, vol. 22, no. 4, pp. 477-492, 1994.

[7] Goodfellow I., Bengio Y., and Courville A., Deep Learning, MIT Press, 2016.

[8] Janyoi P. and Seresangtakul P., “An Isarn Dialect HMM-Based Text-To-Speech System,” in Proceedings of 2nd International Conference on Information Technology, Nakhonpathom, pp.1-6, 2017.

[9] Lazaridis A., Potard B., and Garner P., “DNN- Based Speech Synthesis: Importance of Input Features and Training Data,” in Proceedings of International Conference on Speech and Computer, Athens, pp. 193-200, 2015.

[10] Li Y., Tao J., Hirose K., Xu X., and Lai W., “Hierarchical Stress Modeling and Generation in Mandarin for Expressive Text-To-Speech,” Speech Communication, vol. 72, pp. 59-73, 2015.

[11] Masuko T., Tokuda K., Kobayashi T., and Imai S., “HMM-Based Speech Synthesis with Various Voice Characteristics,” The Journal of the Acoustical Society of America, vol. 100, no. 4, pp. 2760, 1996.

[12] Mittrapiyanuruk P., Hansakunbuntheung C., Tesprasit V., and Sornlertlamvanich V., “Issues in Thai Text-to-Speech Synthesis: the NECTEC Approach,” NECTEC Technical Journal, vol. 2, no. 7, pp. 36-47, 2000.

[13] Mnasri Z., Boukadida F., and Ellouze N., “F0 Contour Modeling for Arabic Text-to-Speech Synthesis Using Fujisaki Parameters and Neural Networks,” Signal processing: An International Journal, vol. 6, no. 4, pp. 352-369.

[14] Morise M., Yokomori F., and Ozawa K., “WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,” IEICE Transactions on Information and Systems, vol. E99.D, no. 7, pp. 1877-1884, 2016.

[15] Mukherjee S. and Mandal S., “F0 Modeling In Hmm-Based Speech Synthesis System Using Deep Belief Network,” in Proceedings of 17th Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques, Phuket, pp. 1-5, 2014.

[16] Qian Y., Fan Y., Hu W., and Soong F., “On the Training Aspects of Deep Neural Network (DNN) for Parametric TTS Synthesis,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, pp. 3829-3833, 2014.

[17] Qian Y., Soong F., Chen Y., and Chu M., “An HMM-Based Mandarin Chinese Text-To-Speech System,” in Proceedings of International Symposium on Chinese Spoken Language Processing, Singapore, pp. 223-232, 2006.

[18] Ribeiro M. and Clark R., “A Multi-Level Representation of F0 Using the Continuous Wavelet Transform and the Discrete Cosine Transform,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, pp. 4909-4913, 2015.

[19] Sagisaka Y., “On the Prediction of Global F0 Shape for Japanese Text-To-Speech,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 325-328, 1990.

[20] Seresangtakul P. and Takara T., “Analysis of Pitch Contour of Thai Tone Using Fujisaki’s Model,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, pp. 505-508, 2002.

[21] Seresangtakul P. and Takara T., “Synthesis of Polysyllabic Sequences of Thai Tones Using a Generative Model of Fundamental Frequency Contours,” IEEJ Transactions on Electronics Information and Systems, vol. 125, no. 7, pp. 1101-1108, 2005.

[22] Shinoda K. and Watanabe T., “MDL-Based Context-Dependent Subword Modeling for Speech Recognition,” Acoustical Science and 914 The International Arab Journal of Information Technology, Vol. 17, No. 6, November 2020 Technology, vol. 21, no. 2, pp. 79-86, 2000.

[23] Siriaksornsat P., Thai Dialects. Bangkok: Department of Thai and Oriental Languages, Ramkhamhaeng University, 2011.

[24] Stan A., Yamagishi J., King S., and Aylett M., “The Romanian Speech Synthesis (RSS) Corpus: Building A High Quality HMM-Based Speech Synthesis System Using A High Sampling Rate,” Speech Communication, vol. 53, no. 3, pp. 442- 450, 2011.

[25] Taylor P., “Analysis and synthesis of Intonation Using The Tilt Model,” The Journal of the Acoustical Society of America, vol. 107, no. 3, pp. 1697-1714, 2000.

[26] Teutenberg J., Watson C., and Riddle P., “Modelling and Synthesising F0 Contours with the Discrete Cosine Transform,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, pp. 3973-3976, 2008.

[27] Thangthai A., Thatphithakkul N., Wutiwiwatchai C., Rugchatjaroen A., and Saychum S., “T-Tilt: A Modified Tilt Model for F0 Analysis and Synthesis in Tonal Languages,” in Proceedings of 9th the Annual Conference of the International Speech Communication Association, Brisbane, pp. 2270-2273, 2008.

[28] Tokuda K., Black A., and Zen H., “An HMM- Based Speech Synthesis System Applied to English,” in Proceedings of IEEE Workshop on Speech Synthesis, Santa Monica, pp. 227-230, 2002.

[29] Tokuda K., Masuko T., Miyazaki N., and Kobayashi T., “Hidden Markov Models Based on Multi-Space Probability Distribution for Pitch Pattern Modeling,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, pp. 229-232, 1999.

[30] Tokuda K., Nankaku Y., Toda T., Zen H., Yamagishi J., and Oura K., “Speech Synthesis Based on Hidden Markov Models,” in Proceedings of the IEEE, vol. 101, no. 5, pp. 1234-1252, 2013.

[31] Tokuda K., Yoshimura T., Masuko T., Kobayashi T., and Kitamura T., “Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, pp. 1315-1318, 2000.

[32] Tóth B. and Csapó T., “Continuous Fundamental Frequency Prediction With Deep Neural Networks,” in Proceedings of 24th European Signal Processing Conference, Budapest, pp. 1348-1352, 2016.

[33] Wang C., Ling Z., Zhang B., and Dai L., “Multi- Layer F0 Modeling for HMM-Based Speech Synthesis,” in Proceedings of 6th International Symposium on Chinese Spoken Language Processing, Kunming, pp. 1-4, 2008.

[34] Wu Y. and Soong F., “Modeling Pitch Trajectory by Hierarchical HMM with Minimum Generation Error Training,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, pp. 4017- 4020, 2012.

[35] Wutiwiwatchai C., Hansakunbuntheung C., Rugchatjaroen A., Saychum S., Kasuriya S., and Chootrakool P., “Thai Text-to-Speech Synthesis: A Review,” Journal of Intelligent Informatics and Smart Technology, vol. 2, no. 2, pp. 1-8, 2017.

[36] Xu Y., “Speech Melody As Articulatorily Implemented Communicative Functions,” Speech Communication, vol. 46, no. 3, pp. 220- 251, 2005.

[37] Xu Y. and Wang Q., “Pitch Targets and Their Realization: Evidence from Mandarin Chinese,” Speech Communication, vol. 33, no. 4, pp. 319- 337, 2001.

[38] Yin X., Lei M., Qian Y., Soong F., He L., Ling Z., and Dai L., “Modeling F0 Trajectories in Hierarchically Structured Deep Neural Networks,” Speech Communication, vol. 76, no. C, pp. 82-92, 2016.

[39] Yoshimura T., Tokuda K., Masuko T., Kobayashi T., and Kitamura T., “Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis,” in Proceedings of 6th European Conference on Speech Communication and Technology, Budapest, pp. 2347-2350, 1999.

[40] Yu K., “Review of F0 Modelling and Generation in HMM Based Speech Synthesis,” in Proceedings of IEEE 11th International Conference on Signal Processing, Beijing, pp. 599-604, 2012.

[41] Ze H., Senior A., and Schuster M., “Statistical Parametric Speech Synthesis Using Deep Neural Networks,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, pp. 7962- 7966, 2013.

[42] Zen H., Tokuda K., and Black A., “Statistical Parametric Speech Synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039-1064, 2009.

[43] Zen H., Tokuda K., and Kitamura T., “Reformulating the HMM as A Trajectory Model by Imposing Explicit Relationships Between Static and Dynamic Feature Vector Sequences,” Computer Speech and Language, vol. 21, no. 1, pp. 153-173, 2007. F0 Modeling for Isarn Speech Synthesis using Deep Neural Networks and ... 915 Pongsathon Janyoi received the B.Sc. degree and M.S. degree in Computer Science from Khon Kaen University, Khon Kaen, Thailand, in 2010 and 2015, respectively. Currently, he is a Ph.D. candidate in Natural Language and Speech Processing Laboratory, Department of Computer Science, Khon Kaen University. His current research interests include speech synthesis, automatic speech recognition and machine learning. Pusadee Seresangtakul received B.Sc. in Physics (Khon Kaen University) and M.Sc. in Computer Science (Chulalongkorn University), Thailand in 1986, and 1991, respectively. In 2005, she received a Ph.D. in Interdisciplinary Intelligent Systems Engineering from Graduate School of Engineering and Science, the University of the Ryukyus, Japan. She is currently an assistant professor in the Department of Computer Science, Faculty of Science, Khon Kaen University. Her research interests include NLP, speech processing, machine learning, and artificial intelligence system.