The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


An Effective Framework for Speech and Music Segregation

Speech and music segregation from a single channel is a challenging task due to background interference and intermingled signals of voice and music channels. It is of immense importance due to its utility in wide range of applications such as music information retrieval, singer identification, lyrics recognition and alignment. This paper presents an effective method for speech and music segregation. Considering the repeating nature of music, we first detect the local repeating structures in the signal using a locally defined window for each segment. After detecting the repeating structure, we extract them and perform separation using a soft time-frequency mask. We apply an ideal binary mask to enhance the speech and music intelligibility. We evaluated the proposed method on the mixtures set at-5 dB, 0 dB, 5 dB from Multimedia Information Retrieval- 1000 clips (MIR-1K) dataset. Experimental results demonstrate that the proposed method for speech and music segregation outperforms the existing state-of-the-art methods in terms of Global-Normalized-Signal-to-Distortion Ratio (GNSDR) values.


[1] Chien J. and Yang P., “Bayesian Factorization and Learning for Monaural Source Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 1, pp. 185-195, 2015.

[2] Colin Ch., “Some Experiments on The Recognition of Speech, with one and with Two Ears,” The Journal of the Acoustical Society of America, vol. 25, no. 5, pp. 975-979, 1953.

[3] Durrieu J., David B., and Richard G., “A Musically Motivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, pp. 1180-1191, 2011.

[4] Deif H., Fitzgerald D., Wang W., and Gan L., “Separation of Vocals from Monaural Music Recordings Using Diagonal Median Filters and Practical Time-Frequency Parameters,” in Proceedings of IEEE International Symposium on Signal Processing and Information Technology, Abu Dhabi, pp. 163-167, 2015.

[5] Foote J. and Uchihashi S., “The Beat Spectrum: A New Approach to Rhythm Analysis,” in Proceedings of IEEE International Conference on Multimedia and Expo, Tokyo, 2001.

[6] Févotte C., Gribonval R., and Vincent E., “BSS_EVAL Toolbox User Guide--Revision 2.0,” HAL-Inria, pp. 1-19, 2005.

[7] Hu Y., Wang L., Huang H., and Zhou G., “Monaural singing voice separation by non- Negative Matrix Partial Co-Factorization with Temporal Continuity and Sparsity Criteria,” in Proceedings of International Conference on Intelligent Computing, Lanzhou, pp. 33-43, 2016.

[8] Huang P., Chen S., Smaragdis P., and Hasegawa- Johnson M., “Singing-Voice Separation from Monaural Recordings Using Robust Principal Component Analysis,” in Proceedings of Acoustics, Speech and Signal Processing, IEEE International Conference On, Kyoto, pp. 57-60, 2012.

[9] Han J. and Chen C., “Improving Melody Extraction Using Probabilistic Latent Component Analysis,” in Proceedings of IEEE International Conference on Acoustics Speech and Signal, Prague, pp. 33-36, 2011.

[10] Hsu C. and Jang J., “On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset,” IEEE Transactions on Audio, Speech, and Language vol. 18, no. 2, pp. 310-319, 2010.

[11] Ikemiya Y., Itoyama K., and Yoshii K., “Singing Voice Separation and Vocal F0 Estimation Based on Mutual Combination of Robust Principal Component Analysis and Subharmonic Summation,” IEEE ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2084-2095, 2016.

[12] Kashyap D. and Josan G., “Prediction of Part of Speech Tags for Punjabi Using Support Vector Machines,” The International Arab Journal of Information Technology, vol. 13, no. 6, pp. 603- 608, 2016.

[13] Li Y. and Wang D., “On The Optimality of Ideal Binary Time-Frequency Masks,” Speech Communication, vol. 51, no. 3, pp. 230-239, 2009.

[14] Li Y. and Wang D., “Separation of Singing Voice From Music Accompaniment for Monaural Recordings,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1475-1487, 2007.

[15] Ozerov A., Philippe P., Bimbot F., and Gribonval R., “Adaptation of Bayesian Models for Single- Channel Source Separation and Its Application To Voice/Music Separation In Popular Songs,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 5, pp. 1564-1578, 2007. 514 The International Arab Journal of Information Technology, Vol. 17, No. 4, July 2020

[16] Rafii Z. and Pardo B., “Repeating Pattern Extraction Technique: A Simple Method For Music/Voice Separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 1, pp. 73-84, 2013.

[17] Schenker H., Harmony EM Borgese, Cambridge University Press, 1954.

[18] Sharma V., “A Deep Neural Network Based Approach for Vocal Extraction From Songs,” in Proceeding of Signal and Image Processing Applications IEEE International Conference on, Kuala Lumpur, pp. 116-121, 2015.

[19] Tachibana H., Ono N., and Sagayama S., “Singing Voice Enhancement in Monaural Music Signals Based on Two-Stage Harmonic/Percussive Sound Separation on Multiple Resolution Spectrograms,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 22, no. 1, pp. 228-237, 2014.

[20] Vembu S. and Baumann S., “Separation of Vocals From Polyphonic Audio Recordings,” in Proceeding of International Society for Music Information Retrieval, London, pp. 337-344, 2005.

[21] Virtanen T., “Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1066- 1074, 2007.

[22] Vincent Emmanuel., http://bass- db.gforge.inria.fr/bss_eval/Last Visited, 2018.

[23] Wang C., Lyu R., and Chiang Y., “An Automatic Singing Transcription System with Multilingual Singing Lyric Recognizer and Robust Melody Tracker,” in Proceeding of 8th European Conference on Speech Communication and Technology, Geneva, pp. 1197-1200, 2003.

[24] Yang D. and Lee W., “Disambiguating Music Emotion Using Software Agents,” International Conference on Music Information Retrieval, vol. 4, pp. 218-223, 2004.

[25] Zhu B., Li W., Li R., and Xue X., “Multi-Stage Non-Negative Matrix Factorization For Monaural Singing Voice Separation,” IEEE Transactions on Audio, Speech, And Language Processing, vol. 21, no. 10, pp. 2096-2107, 2013.

[26] Zhang T. and Packard H., “System and Method for Automatic Singer Identification,” Research Disclosure, pp. 756-756, 2003. Sidra Sajid received her M.Sc. degree in Software Engineering from UET Taxila, Pakistan in 2018. She is working as IT Officer in Primary and Secondary Healthcare Department, Punjab. Her research interests include audio signal processing, classification problems and machine learning. Ali Javed received his Ph.D. degree in Computer Engineering from UET Taxila, Pakistan in 2016. Currently, he is serving as Assistant Professor in Software Engineering Department at UET Taxila, Pakistan. His areas of interest are Digital Image Processing, Computer vision and Machine Learning. Aun Irtaza has completed his PhD from FAST-National University of Computers & Emerging Sciences in 2016. Currently he is serving as HOD in Computer Science Department at UET Taxila. His research interests include computer vision, pattern analysis, and big data analytics .