The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


Multi-Level Improvement for a Transcription Generated by Automatic Speech Recognition

In this paper we will propose a novel approach to improving an automatic speech recognition system. The proposed method constructs a search space based on the relations of semantic dependence of the output of a recognition system. Then, it applies syntactic and phonetic filters so as to choose the most probable hypotheses. To achieve this objective, different techniques are deployed, such as the word2vec or the language model Recurrent Neural Networks Language Models (RNNLM) or ever the language model tagged in addition to a phonetic pruning system. The obtained results showed that the proposed approach allowed to improve the accuracy of the system especially for the recognition of mispronounced words and irrelevant words.


[1] Aggarwal R. and Dave M., “Acoustic Modeling Problem for Automatic Speech Recognition System: Advances and Refinements Part (Part II),” International Journal of Speech Technology, pp. 309-320, 2011.

[2] Anusuya M. and Katti S., “Speech Recognition by Machine: A Review,” International Journal of Computer Science and Information Security, vol. 6, no. 3, pp. 181-205, 2009.

[3] Arisoy E., Sainath T., Kingsbury B., and Ramabhadran B., “Deep Neural Network Language Models,” in Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, Montreal, pp. 20-28, 2012.

[4] Ben Mohamed M., Mallat S., Nahdi M., and Zrigui M., “Exploring The Potential Of Schemes In Building NLP Tools For Arabic Language,” The International Arab Journal of Information Technology, vol. 12, no. 6, pp. 566-573, 2015.

[5] Ben Mohamed M., Zrigui S., Anis Z., and Zrigui M., “N-Scheme Model: An Approach Towards Reducing Arabic Language Sparseness,” in Proceedings of 5th International Conference on Information and Communication Technology and Accessibility, Marrakech, pp. 1-5, 2015.

[6] Boehm B., “A Spiral Model of Software Development and Enhancement,” IEEE Computers, vol. 21, no. 5, pp. 61-72, 1988.

[7] Bougares F., Estève Y., Deléglise P., and Linarès G., “Bag Of N-Gram Driven Decoding For LVCSR System Harnessing,” in Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, Waikoloa, pp. 278-282, 2011.

[8] Dahl G., Yu D., Deng L., and Acero A., “Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30-42, 2012.

[9] Dua M., Aggarwal R., Kadyan V., and Dua S., “Punjabi Automatic Speech Recognition Using HTK,” International Journal of Computer Science Issues, vol. 9, no. 1, pp. 359-364, 2012.

[10] Favre B., Rouvier M., and Béchet F., “Reranked Aligners for Interactive Transcript Correction,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, pp. 146-150, 2014.

[11] Green S. and Manning C., “Better Arabic Parsing: Baselines, Evaluations, and Analysis,” in Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, pp. 394-402, 2010.

[12] Helleseth T., Klove T., and Levenshtein V., “Error-Correction Capability of Binary Linear Codes,” IEEE Transactions on Information Theory, vol. 51, no. 4, pp. 1408-1423, 2005.

[13] Hoste L., Dumas B., and Signer B., “Speeg: A Multimodal Speech-And Gesture-Based Text Input Solution,” in Proceedings of International Working Conference on Advanced Visual Interfaces, Capri Island, pp. 156-163, 2012.

[14] Laurent A., Meignier S., Merlin T., and Deléglise P., “Computer-Assisted Transcription of Speech Based on Confusion Network Reordering,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, pp. 4884 4887, 2011.

[15] Lecouteux B., Nocera P., and Linarès G., “Décodageguidé Par Un Modèle Cache Sémantique ,” Journées d ’Etude Sur la Parole, Belgique, pp. 97-100, 2010.

[16] Lecouteux B., Linarès G., and Oger S., “Integrating Imperfect Transcripts in to Speech Recognition Systems for Building High-Quality Multi-Level Improvement for a Transcription Generated by Automatic ... 465 Corpora,” Computer Speech and Language, vol. 26, no. 2, pp. 67-89, 2012.

[17] Ma G., Zhou W., Zheng J., and You X., “A Comparison between HTK and SPHINX on Chinese Mandarin,” in Proceedings of International Joint Conference on Artificial Intelligence, Hainan Island, pp. 394-397, 2009.

[18] Mallat S., Ben Mohamed A., Hkiri E., Zouaghi A., and Zrigui M., “Semantic and Contextual Knowledge Representation for Lexical Disambiguation: Case of Arabic-French Query Translation,” Journal of Computing and Information Technology, vol. 22, no. 3, pp. 191- 215, 2014.

[19] Meena K., Subramaniam K., and Gomathy M., “Gender Classification In Speech Recognition Using Fuzzy Logic And Neural Network,” The International Arab Journal of Information Technology, vol. 10, no. 5, pp. 477-485, 2013.

[20] Marin A., Kwiatkowski T., Ostendorf M., and Zettlemoyer L., “Using Syntactic and Confusion Network Structure for Out-of Vocabulary Word Detection,” in Proceedings of IEEE Spoken Language Technology Workshop, Miami, pp. 159-164, 2012.

[21] Merhbene L., Zouaghi A., and Zrigui M., “A Semi-Supervised Method for Arabic Word Sense Disambiguation Using a Weighted Directed Graph,” in Proceedings of the 6th International Joint Conference on Natural Language Processing, Nagoya, pp. 1027-1031, 2013.

[22] Merhbene L., Zouaghi A., and Zrigui M., “An Experimental Study for Some Supervised Lexical Disambiguation Methods of Arabic Language,” in Proceedings of 4th International Conference on Information and Communication Technology and Accessibility, Hammamet, pp. 1-6, 2013.

[23] Mikolov T., Karafiat M., Burget L., Cernocky J., and Khudanpur S., “Recurrent Neural Network Based Language Model,” in Proceedings of INTERSPEECH, Mukuhari, pp. 1045-1048, 2010.

[24] Prasad R., Kumar R., Ananthakrishnan S., Chen W., Hewavitharana S., Roy M., Choi F., Challenner A., Kan E., Neelakantan A., and Natarajan P., “Active Error Detection And Resolution For Speech-To-Speech Translation,” in Proceedings of International workshop on Spoken Language Translation, Hong Kong, 2012.

[25] Pennington J., Socher R., and Manning C., “Glove: Global Vectors for Word Representation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, pp. 1532-1543, 2014.

[26] Rouvier M., Favre B., and Béchet F., “Reranked Aligners for Interactive Transcript Correction,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, pp. 146-150, 2014.

[27] Salam M., Dzulkifli M., and Salleh S., “Malay Isolated Speech Recognition Using Neural Network: A Work In Finding Number of Hidden Nodes And Learning Parameters,” The International Arab Journal Information Technology, vol. 8, no. 4, pp. 364-371, 2011.

[28] Satori H., Hiyassat H., Harti M., and Chenfour N., “Investigation Arabic Speech Recognition Using CMU Sphinx System,” The International Arab Journal of Information Technology, vol. 6, no. 2, pp. 186-190, 2009.

[29] Siegler M. and Stern R., “on the Effect of Speech Rate in Large Vocabulary Speech Recognition System,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, Detroit, pp. 612-615, 1995.

[30] Trigui A., Maraoui M., and Zrigui M., “Acoustic Study of the Gemination Effect in Standard Arabic Speech,” Idiopathic Polypoidal Choroidal Vasculopathy, pp. 192-196, 2010.

[31] Trigui A., Terbeh N., Maraoui M., and Zrigui M., “Statistical Approach for Spontaneous ArabicSpeech Understanding Based on Stochastic Speech Recognition Module,” Research in Computing Science, vol. 117, pp. 143-151, 2016.

[32] Zolnay A., Schluter R., and Ney H., “Robust Speech Recognition Using a Voiced-Unvoiced Feature,” in Proceedings of 7th International Conference on Spoken Language Processing, vol. 2, Denver, pp. 1065-1068, 2002. 466 The International Arab Journal of Information Technology, Vol. 16, No. 3, May 2019 Heithem Amich received his BCs degree in computer science from the Faculty of Sciences of Monastir, Tunisia and his MSc degree from the Faculty of Mathematical, Physical and Natural Sciences of Tunis, Tunisia. He is member of LaTICE Laboratory, Monastir unit (Tunisia). His areas of interest include speech recognition system, natural language processing, machine learning. Mohamed Ben Mohamed received his PhD from the Faculty of Economic Sciences and Management of Sfax, Tunisia. He is member of La TICE Laboratory, Monastir unit (Tunisia). His areas of interest include natural language processing, computer-assisted language learning and machine learning. Mounir Zrigui is an associate professor at the University of Monastir, Tunisia. He received his PhD degree from the Paul Sabatier University, Toulouse, France in 1987 and his HDR in computer science from the Stendhal University, Grenoble, France in 2008. He is the head of Monastir unit of LaTICE laboratory. He has more than 25 years of experience including teaching and research in all aspects of automatic processing of natural language (written and oral).