Chinese Lexical Database
Covers over 500,000 entries
Simplified and Traditional Chinese
Optimized for NLP applications
Overview
The CJKI Chinese Lexical Database (CLD) is a comprehensive monolingual lexical database specifically designed for NLP applications. It consists of two modules, Simplified Chinese (SC) and Traditional Chinese (TC), with about 250,000 entries in each module covering general vocabulary, technical terms, and important proper nouns.
A unique feature of CLD is that the readings (pinyin and zhuyin) take into account the differences in pronunciation between the PRC and Taiwan. For example, SC 危险 wēixiǎn ‘dangerous’ is TC 危險 wéixiǎn. Furthermore, the TC not merely a code-conversion equivalent of the SC version, but has been carefully proofread to ensure accuracy on both the orthographic and lexemic levels.
For example, 出租车 chūzūchē ‘taxi’ has a lexemic equivalent of 計程車 jīchéngchē, rather than the SC orthographic equivalent 出租車. Developed by CJKI’s team of Chinese specialists over many years, CLD is a significant contribution to the field of Chinese lexicography and information processing.
Main Features
Phonological information
Such as pinyin, zhuyin, and IPA
Semantic classification codes
Such as type of proper noun
Grammatical information
Such as POS and adjacency attributes
Morphological information
derivational affixes and binding valency codes
* Select one of the tabs below.
Practical Applications
CLD is being used by major IT companies to enhance their Chinese morphological analysis technology and is especially suitable for natural language processing (NLP) applications, such as:
Segmentation and tokenization
Named-entity recognition
Input method editors
Morphological analysis
Information retrieval
Part-of-speech tagging
Reference Documents
Related Resources
Japanese Lexical Database
Monolingual general vocabulary for NLP applications
Korean Lexical Database
Monolingual general vocabulary for NLP applications
Chinese Hanyu Pinyin Database
Accurate hanyu pinyin data including technical terms and proper nouns