A Generic Multimodal Architecture for Integrating Voice and Ink XML Formats

Author Zouheir Trabelsi,

Keywords #Multimodal voice/ink applications #speech recognition #online handwriting recognition #mutual disambiguation #VoiceXML #InkXML

Abstract

The acceptance of a standard VoiceXML format has facilitated the development of voice applications, and we anticipate a similar facilitation of pen application development upon the acceptance of a standard InkXML format. In this paper we present a multimodal interface architecture that combines standardized voice and ink formats to facilitate the creation of robust and efficient multimodal systems, particularly for noisy mobile environments. The platform provides a Web interactive system for generic multimodal application development. By providing mutual disambiguation of input signals and superior error handling this architecture should broaden the spectrum of users to the general population, including permanently and temporarily disabled users. Integration of VoiceXML and InkXML provides a standard data format to facilitate Web based development and content delivery. Diverse applications ranging from complex data entry and text editing applications to Web transactions can be implemented on this system, and we present a prototype platform and sample dialogues.

References

[1] Bers J., Miller S., and Makhoul J., “Designing Conversational Interfaces with Multimodal Interaction,” DARPA Workshop on Broadcast News Understanding Systems, pp. 319-321, 1998.

[2] Bolt R. A., “Put-that-three: Voice and Gesture at the Graphics Interface,” Computer Graphics, vol. 14, no. 3, pp. 262-270, 1980.

[3] Bregler C., Manke S., Hild H., and Waibel A., “Improving Connected Letter Recognition by Lip Reading,” in Proceedings of Int. Conference Acoustics, Speech and Signal Processing, IEEE Press, vol. 1, pp. 557-560, 1993.

[4] Codella C., Jalili R., Koved L., Lewis J., Ling D., Lipscomb J., Rabenhorst D., Wang C., Norton A., Sweeney P., and Turk C., “Interactive Simulation in a Multi-Person Virtual World,” in Proceedings of Conference on Human Factors in Computing Systems (CHI’92), ACM Press, New York , pp. 329-334, 1992.

[5] Cohen P. R., Johnston M., McGee D., Oviatt S., Pittman J., Smith I., Chen L., and Clow J., “Quickset: Multimodal Interaction for Distributed Applications,” in Proceedings of Fifth ACM Int. Multimedia Conference, ACM Press, New York, pp. 31-40, 1997.

[6] Duncan L., Brown W., Esposito C., Holmback H., and Xue P., “Enhancing Virtual Maintenance Environments with Speech Understanding,” Boeing M and CT TechNet, 1999.

[7] Edgar B., The VoiceXML Handbook, CMP Books, 2001.

[8] Fujisaki T., Modlin W., Mohiuddin M. K., and Takahashi H., “Hybrid On-Line Handwriting Recognition and Optical Character Recognition System,” U.S. Patent 6.011.865, 2000.

[9] Holzman T. G., “Computer Human Interface Solutions for Emergency Medical Care,” Interactions, vol. 6, no. 3, pp. 13-24, 1999.

[10] InkXML Documents, http://www.easystreet. com/~lartech/InkXML/.

[11] Kay M., “Functional Grammar,” in Proceedings of Fifth Annual Meeting of the Berkeley Linguistics Society, pp. 142-158, 1979.

[12] Lai J. and Vergo J., “MedSpeak: Report Creation with Continuous Speech Recognition,” in Proceedings of Conference on Human Factors in Computing (CHI’97), ACM Press, pp. 431-438, 1997.

[13] Larson J. A., Oviatt S. L., and Ferro D., “Designing The User Interface for Pen and Speech Applications,” in Proceedings of Conference on Human Factors in Computing Systems (CHI’99), Philadelphia, PA, 1999.

[14] McGee D., Cohen P. R., and Oviatt S. L., “Confirmation in Multimodal Systems,” in Proceedings of Int. Joint Conference of Association for Computational Linguistics and the International Committee on Computational Linguistics (COLING-ACL’98), University of Montreal Press, pp. 823-829, 1998.

[15] McNeill D., Hand and Mind: What Gestures Reveal about Thought, University of Chicago Press, Chicago, 1962.

[16] Oviatt S. L., “Multimodal Interactive Maps: Designing for Human Performance,” Human- Computer Interaction (special issue on Multimodal Interfaces), vol. 12, pp. 93-129, 1997.

[17] Oviatt S. L., “Mutual Disambiguation of Recognition Errors in Multimodal Architecture,” in Proceedings of Conference Human Factors in Computing Systems (CHI’99), ACM Press, New York, pp. 576-583, 1999.

[18] Oviatt S. L., “Pen/Voice: Complementary Multimodal Communication,” in Proceedings of Speech Technology, New York, 1992.

[19] Oviatt S. L. and Van G. R., “Error Resolution During Multimodal Human-Computer Interaction,” in Proceedings of Int. Conference on Spoken Language Processing, University of Delaware Press, pp. 204-207, 1996.

[20] Oviatt S. L., Cohen P. R., Wu L., Vergo J., Duncan L., Suhm B., Bers J., Holzman T., Winograd T., Landay J., Larson J., and Ferro D., “Designing the User Interface for Multimodal Speech and Pen-Based Gesture Applications: State-of-the-Art Systems and Future Research Directions,” Human Computer Interaction, vol. 15, no. 4, pp. 263-322, 2000. A Generic Multimodal Architecture for Integrating Voice and Ink XML Formats 101

[21] Oviatt S. L., DeAngeli A., and Kuhn K., “Integration and Synchronization of Input Modes During Multimodal Human-Computer Interaction,” in Proceedings of Conference on Human Factors in Computing Systems (CHI'97), New York, pp. 415-422, 1997.

[22] Pavlovic V., and Huang T. S., “Multimodal Prediction and Classification on Audio-Visual Features,” AAAI’98 Workshop on Representations for Multi-modal Human- Computer Interaction, AAAI Press, Menlo Park, CA, pp. 55-59, 1998.

[23] Wang J., “Integration of Eye-Gaze, Voice and Manual Response in Multimodal User Interfaces,” in Proceedings of IEEE Int. Conference Systems, Man and Cybernetics, IEEE Press, pp. 3938-3942, 1995. Zouheir Trabelsi received his PhD from Tokyo University of Technology and Agriculture, Japan, in the field of computer science, March 1994. From April 1994 until December 1998, he was a computer science researcher at the Central Research Laboratory of Hitachi in Tokyo, Japan. From November 2001 until October 2002, he was a visiting assistant professor at Pace University, New York, USA. Currently, he is an associate professor at the College of Telecommunications, the University of Tunisia. His research areas are mainly multimodal voice and ink systems, human computer interaction, internet/networking hacking and security and the TCP/IP protocols.