poster

A novel statistical chinese language model and its application in pinyin-to-character conversion

Authors:
Bo Lin

Nanyang Technological University, Singapore, Singapore

Nanyang Technological University, Singapore, Singapore
View Profile

,
Jun Zhang

Nanyang Technological University, Singapore, Singapore

Nanyang Technological University, Singapore, Singapore
View Profile

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge managementOctober 2008Pages 1433–1434https://doi.org/10.1145/1458082.1458318

Published:26 October 2008Publication History

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Pages 1433–1434

ABSTRACT

In this paper, we present a novel Chinese language model, and study its applications, in particular in Chinese pinyin-to-character conversion. In the new model, each word is associated with supporting context constructed by mining the frequent sets of nearby phrases and their distances to the word. Such information was usually overlooked in previous n-gram model and its variants. We apply the model to Chinese pinyin-to-character conversion and find that it offers a better solution to Chinese input. The model has lower perplexity in our evaluation and higher prediction accuracy than the state-of-the-art n-gram Markov model for Chinese language.

References

Intelligent Pinyin Input Method Editor Demo Website, http://www.cais.ntu.edu.sg/~jzhang/pinyin/index_en.html.Google Scholar
R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. ACM SIGMOD Record, 22(2):207--216, 1993. Google ScholarDigital Library
A. Berger, V. Della Pietra, and S. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39--71, 1996. Google ScholarDigital Library
H. Cheng, X. Yan, J. Han, and C. Hsu. Discriminative frequent pattern analysis for effective classification. In ICDE, pages 716--725, 2007.Google ScholarCross Ref
J. Gao, J. Goodman, M. Li, and K. Lee. Toward a unified approach to statistical language modeling for Chinese. ACM TALIP, 1(1):3--33, 2002. Google ScholarDigital Library
X. Luo and S. Roukos. An iterative algorithm to build Chinese language models. In ACL, pages 139--143, 1996. Google ScholarDigital Library
D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In ACL, pages 189--196, 1995. Google ScholarDigital Library

Index Terms

A novel statistical chinese language model and its application in pinyin-to-character conversion
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information systems applications

Recommendations

Using Mutual Information Criterion to Design an Effective Lexicon for Chinese Pinyin-to-Character Conversion
IALP '13: Proceedings of the 2013 International Conference on Asian Language Processing

Pinyin-to-character (P2C) conversion is mostly used to input Chinese characters into a computer. Its main problem is homophone words, which is solved through exploiting contextual information provided by lexicon and n-gram language model (LM). Our ...
Read More
Introduction to Chinese Natural Language Processing
Read More
Applying the Word Acquiring Algorithm to the Pinyin-to-Character Conversion
ICNC '09: Proceedings of the 2009 Fifth International Conference on Natural Computation - Volume 04

This paper applies the information entropy based word acquiring algorithm to the task of Pinyin-to-character (PTC) conversion, which adopts Artificial Immune Network Model. Firstly, the Artificial Immune Network is used to overcome the sparse data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
October 2008
1562 pages
ISBN:9781595939913
DOI:10.1145/1458082
General Chair:
James G. Shanahan
Church and Duncan Group Inc, USA
,
Program Chairs:
Sihem Amer-Yahia
Yahoo! Research, USA
,
Ioana Manolescu
INRIA, France
,
Yi Zhang
University of California, Santa Cruz, USA
,
David A. Evans
JustSystems Evans Research, USA
,
Alek Kolcz
Microsoft Live Labs, USA
,
Key-Sun Choi
KAIST, Korea
,
Abdur Chowdury
Twitter, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
chinese language model
pinyin-to-character conversion
word-context support
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 262
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A novel statistical chinese language model and its application in pinyin-to-character conversion

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Using Mutual Information Criterion to Design an Effective Lexicon for Chinese Pinyin-to-Character Conversion

Introduction to Chinese Natural Language Processing

Applying the Word Acquiring Algorithm to the Pinyin-to-Character Conversion