The International Arab Journal of Information Technology (IAJIT)


Two-Level Classification in Determining the Age and Gender Group of a Speaker

Ergün Yücesoy,
In this study, the classification of the speakers according to age and gender was discussed. Age and gender classes were first examined separately, and then by combining these classes a classification with a total of 7 classes was made. Speech signals represented by Mel-Frequency Cepstral Coefficients (MFCC) and delta parameters were converted into Gaussian Mixture Model (GMM) mean supervectors and classified with a Support Vector Machine (SVM). While the GMM mean supervectors were formed according to the Maximum-A-Posteriori (MAP) adaptive GMM-Universal Background Model (UBM) configuration, the number of components was changed from 16 to 512, and the optimum number of components was decided. Gender classification accuracy of the system developed using aGender dataset was measured as 99.02% for two classes and 92.58% for three classes and age group classification accuracy was measured as 67.03% for female and 63.79% for male. In the classification of age and gender classes together in one step, an accuracy of 61.46% was obtained. In the study, a two-level approach was proposed for classifying age and gender classes together. According to this approach, the speakers were first divided into three classes as child, male and female, then males and females were classified according to their age groups and thus a 7-class classification was realized. This two-level approach was increased the accuracy of the classification in all other cases except when 32-component GMMs were used. While the highest improvement of 2.45% was achieved with 64 component GMMs, an improvement of 0.79 was achieved with 256 component GMMs.

[1] Alim S. and Rashid N., From Natural to Artificial Intelligence-Algorithms and Applications, IntechOpen, 2018.

[2] Büyük O. and Arslan L., “An Investigation of Multi-Language Age Classification from Voice,” in Proceedings of the 12th International Joint Conference on Biomedical Engineering Systems and Technologies, Prague, pp. 85-92, 2019.

[3] Burges C., “A Tutorial on Support Vector Machines for Pattern Recognition,” Data mining and Knowledge Discovery, vol. 2, no. 2, pp. 121- 167, 1998.

[4] Campbell W., Sturim D., and Reynolds D., “Support Vector Machines Using GMM Supervectors for Speaker Verification,” IEEE Signal Processing Letters, vol. 13, no. 5, pp. 308- 311, 2006.

[5] Chellali S., Al-Maadeed S., Kenai O., Ahfir M., and Hidouci W., “Middle Eastern and North African English Speech Corpus (MENAESC): Automatic Identification of MENA English Accents,” The International Arab Journal of Information Technology, vol. 18, no. 1, pp. 67- 76, 2021.

[6] Collobert R. and Bengio S., “Svmtorch: Support Vector Machines for Large-Scale Regression Problems,” Journal of Machine Learning 670 The International Arab Journal of Information Technology, Vol. 18, No. 5, September 2021 Research, vol. 1, no. 2, pp. 143-160, 2001.

[7] Deller J., Hanse J., and Proakis J., Discrete Time Processing of Speech Signals, IEEE Press, 2000.

[8] Dempster A., Laird N., and Rubin D., “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society: Series B, vol. 39, no. 1, pp. 1- 22, 1977.

[9] Dobry G., Hecht R., Avigal M., and Zigel Y., “Supervector Dimension Reduction for Efficient Speaker Age Estimation Based on the Acoustic Speech Signal,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 1975-1985, 2011.

[10] Fokoue E. and Ma Z., “Speaker Gender Recognition via Mfccs and Svms,” Technical Report, Digital Media Library Running, 2013.

[11] Grzybowska J. and Kacprzak S., “Speaker Age Classification and Regression Using i-Vectors,” in Proceedings of 17th Annual Conference of the International Speech Communication Association, San Francisco, pp. 1402-1406, 2016.

[12] Kockmann, M., Burget, L., and Černocký, J., “Brno University of Technology System for Interspeech 2010 Paralinguistic Challenge,” in Proceedings of 11th Annual Conference of the International Speech Communication Association, Makuhari, pp. 2822-2825, 2010.

[13] Li M., Han K., and Narayanan S., “Automatic Speaker Age and Gender Recognition Using Acoustic and Prosodic Level Information Fusion,” Computer Speech and Language, vol. 27, no. 1, pp. 151-167, 2013.

[14] Markitantov M. and Verkholyak O., “Automatic Recognition of Speaker Age and Gender Based on Deep Neural Networks,” in Proceedings of International Conference on Speech and Computer, Istanbul, pp. 327-336, 2019.

[15] Muller C., Wittig F., and Baus J., “Exploiting Speech for Recognizing Elderly Users to Respond to Their Special Needs,” in Proceedings of 8th European Conference on Speech Communication and Technology, Geneva, pp. 1305-1308, 2003.

[16] Porat R., Lange D., and Zigel Y., “Age Recognition Based on Speech Signals Using Weights Supervector,” in Proceedings of 11th Annual Conference of the International Speech Communication Association, Makuhari, pp. 2814-2817, 2010.

[17] Přibil J., Přibilová A., and Matoušek J., “GMM- Based Speaker Age and Gender Classification in Czech and Slovak,” Journal of Electrical Engineering, vol. 68, no. 1, pp. 3-12, 2017.

[18] Qawaqneh Z., Mallouh A., and Barkana B., “Deep Neural Network Framework and Transformed Mfccs for Speaker's Age and Gender Classification,” Knowledge-Based Systems, vol. 115, no. 1, pp. 5-14, 2017.

[19] Reynolds D. and Rose R., “Robust Text- Independent Speaker Identification Using Gaussian Mixture Speaker Models,” IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, pp.72-83, 1995.

[20] Reynolds D., Quatieri T., and Dunn R., “Speaker Verification Using Adapted Gaussian Mixture Models,” Digital Signal Processing, vol. 10, no. 1-3, pp. 19-41, 2000.

[21] Safavi S., Russell M., and Jančovič P., “Automatic Speaker, Age-Group and Gender Identification from Children’s Speech,” Computer Speech and Language, vol. 50, no.1, pp. 141-156, 2018.

[22] Schuller B., Steidl S., Batliner A., and Burkhard F., “The Interspeech 2010 Paralinguistic Challenge,” in Proceedings of 11th Annual Conference of the International Speech Communication Association, Makuhari, pp. 2795-2798, 2010.

[23] Schwenker F., Scherer S., Magdi Y., and Palm G., “The GMM-SVM Supervector Approach for the Recognition of the Emotional Status from Speech,” Lecture Notes in Computer Science, Limassol, pp. 894-903, 2009.

[24] Tanner D. and Tanner M., Forensic Aspects of Speech Patterns: Voice Prints, Speaker Profiling, Lie And Intoxication Detection, Lawyers and Judges Publishing Company, 2004.

[25] Van-Heerden C., Barnard E., Davel M., Walt C., Van-Dyk E., Feld M., and Müller C., “Combining Regression and Classification Methods for Improving Automatic Speaker Age Recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, pp. 5174-5177, 2010.

[26] Yücesoy E., “Speaker Age and Gender Classification Using GMM Supervector and NAP Channel Compensation Method,” Journal of Ambient Intelligence and Humanized Computing, pp. 1-10, 2020. Ergün Yücesoy received his BSc., MSc. and Ph.D. Degrees from Department of Computer Engineering, Karadeniz Technical University, Trabzon, Turkey in 1999, 2004 and 2017 respectively. Currently, He is an Assistant Professor in Ordu Vocational School, Ordu University, Ordu, Turkey. His research interests include biometric security and machine learning.