Word Prediction via a Clustered Optimal Binary Search Tree

Author Eyas El-Qawasmeh,

Keywords #Bigram #cluster computing #N-gram #unigram #trigram #word frequency #word prediction

Abstract

Word prediction methodologies depend heavily on the statistical approach that uses the unigram, bigram, and the trigram of words. However, the construction of the N-gram model requires a very large size of memory, which is beyond the capability of many existing computers. Beside this, the approximation reduces the accuracy of word prediction. In this paper, we suggest to use a cluster of computers to build an Optimal Binary Search Tree (OBST) that will be used for the statistical approach in word prediction. The OBST will contain extra links so that the bigram and the trigram of the language will be presented. In addition, we suggest the incorporation of other enhancements to achieve optimal performance of word prediction. Our experimental results showed that the suggested approach improves the keystroke saving.

References

[1] Al-Furaih I., Aluru S., Goil S., and Ranka S., “Parallel Construction of Multidimensional Binary Search Trees,” IEEE Trans. on Parallel and Distributed Systems, vol. 11, no. 2, pp. 136- 148, February 2000.

[2] Brown P., Pietra V., DeSouza P., Lai J., and Mercer R., “Class-Based N-Gram Models of Natural Language,” Computational Linguistics, vol. 18, no. 4, pp. 467-479, 1992.

[3] Chen S. and Goodman J., “An Empirical Study of Smoothing Techniques for Language Modeling,” in Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, CA, USA, pp. 310-318, June 1996.

[4] Even-Zohar Y. and Roth D., “A Classification Approach to Word Prediction,” in Proceedings of The 1st North American Conference on Computational Linguistics (NAACL' 2000), pp. 124-131, 2000.

[5] Garay-Vitoria N. and Gonzalez-Abascal J., “Intelligent Word-Prediction to Enhance Text Input Rate (A Syntactic Analysis-Based Word- Prediction Aid for People with Severe Motor and Speech Disability),” in Proceedings of International Conference on Intelligent User Interfaces, January 6-9, Orlando, FL, USA, pp. 241-244, 1997.

[6] Gonnet G., Baeza-Yates R., and Snider T., “New Indices for Text: Pat Trees and Pat Arrays,” in Frankes W. and Baeza-Yates R. (Eds), Information Retrieval: Data Structures and Algorithms, Prentice Hall, New Jersey, pp. 66-82, 1993.

[7] Heeman P., “POS Tagging versus Classes in Language Modeling,” in Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, pp. 179-187, August 1998.

[8] Hunnicutt S., “Using Syntactic and Semantic Information in a Word Prediction Aid,” in Proceedings of Europe Conference Speech Commun., Paris, France, vol. 1, pp. 191-193, September 1989.

[9] Jelinek F., “Self-Organized Language Modeling for Speech Recognition,” Technical Report, IBM T. J. Watson Research Center, Continuous Speech Recognition Group, 1985.

[10] Jelinek F. and Mercer R. “Interpolated Estimation of Markov Source Parameters from Sparse Data,” in Proceedings of Workshop on Pattern Recognition in Practice, Amsterdam, The Netherlands, North-Holland, pp. 381-397, 1980.

[11] Katz S., “Estimation of Probabilities from Spares Data for the Language Model Component of a Speech Recognizer,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 35, no. 3, pp. 400-401, 1987.

[12] Koester H. and Levine S., “Modeling the Speed of Text Entry With a Word Prediction Interface,” IEEE Trans. on Rehabilitation Engineering, vol. 2, no. 3, pp. 177-187, September 1994.

[13] Lesher G, Moulton B., and Higginbotham D., “Effects of N-gram Order and Training Text Size on Word Predition,” in Proceedings of (RESNA’99) Annual Conference, Arlington, VA, pp. 52-54, 1999.

[14] Lesher G., Moulton B., and Higginbotham D., “Techniques for Augmenting Scanning Communication,” Augmentative and Alternative Communication, vol. 14, no. 2, pp. 81-101, 1998.

[15] Matthew E., “Syntactic Pre-Processing in Single- Word Prediction for Disabled People,” PhD Thesis, University of Bristol, England, June 1996.

[16] Pereira F., Singer Y., and Tishby N., “Beyond Word N-Grams,” in Proceedings of the Third Workshop on Very Large Corpora, Columbus, Ohio, Massachusetts Institute of Technology, Association for Computational Linguistics, pp. 95-106, 1995.

[17] Srinivas B., “Complexity of Lexical Descriptions and its Relevance to Partial Parsing,” PhD Thesis, University of Pennsylvania, IRCS Report 97-10, 1997.

[18] Wessel F., Ortmanns S., and Ney H., “Implementation of Word Based Statistical Language Models,” in Proceedings of SQEL Workshop on Multi-Lingual Information Retrieval Dialogs, Pilsen, Czech Republic, pp. 55-59, April 1997. Word Prediction via a Clustered Optimal Binary Search Tree 141 Eyas El-Qawasmeh received his BSc degree in computer science in 1985 from Yarmouk University, Jordan. He then joined the Yarmouk University as teaching assistant in the Computer Science Department. In 1992, he joined the George Washington University, Washington DC, USA where he obtained his MSc and PhD degrees in software and systems in 1994 and 1997, respectively. In 2001, he joined George Washington University as visiting researcher. Currently, he is an assistant professor in the Department of the Computer Science and Information Systems at Jordan University of Science and Technology, Jordan. His areas of interest include multimedia databases, information retrieval, and object-oriented. He is a member of the ACM and IEEE.