ABSTRACT
In this paper, we present a novel Chinese language model, and study its applications, in particular in Chinese pinyin-to-character conversion. In the new model, each word is associated with supporting context constructed by mining the frequent sets of nearby phrases and their distances to the word. Such information was usually overlooked in previous n-gram model and its variants. We apply the model to Chinese pinyin-to-character conversion and find that it offers a better solution to Chinese input. The model has lower perplexity in our evaluation and higher prediction accuracy than the state-of-the-art n-gram Markov model for Chinese language.
- Intelligent Pinyin Input Method Editor Demo Website, http://www.cais.ntu.edu.sg/~jzhang/pinyin/index_en.html.Google Scholar
- R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. ACM SIGMOD Record, 22(2):207--216, 1993. Google ScholarDigital Library
- A. Berger, V. Della Pietra, and S. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39--71, 1996. Google ScholarDigital Library
- H. Cheng, X. Yan, J. Han, and C. Hsu. Discriminative frequent pattern analysis for effective classification. In ICDE, pages 716--725, 2007.Google ScholarCross Ref
- J. Gao, J. Goodman, M. Li, and K. Lee. Toward a unified approach to statistical language modeling for Chinese. ACM TALIP, 1(1):3--33, 2002. Google ScholarDigital Library
- X. Luo and S. Roukos. An iterative algorithm to build Chinese language models. In ACL, pages 139--143, 1996. Google ScholarDigital Library
- D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In ACL, pages 189--196, 1995. Google ScholarDigital Library
Index Terms
- A novel statistical chinese language model and its application in pinyin-to-character conversion
Recommendations
Using Mutual Information Criterion to Design an Effective Lexicon for Chinese Pinyin-to-Character Conversion
IALP '13: Proceedings of the 2013 International Conference on Asian Language ProcessingPinyin-to-character (P2C) conversion is mostly used to input Chinese characters into a computer. Its main problem is homophone words, which is solved through exploiting contextual information provided by lexicon and n-gram language model (LM). Our ...
Applying the Word Acquiring Algorithm to the Pinyin-to-Character Conversion
ICNC '09: Proceedings of the 2009 Fifth International Conference on Natural Computation - Volume 04This paper applies the information entropy based word acquiring algorithm to the task of Pinyin-to-character (PTC) conversion, which adopts Artificial Immune Network Model. Firstly, the Artificial Immune Network is used to overcome the sparse data ...
Comments