Chinese Lexical Database

Covers over 500,000 entries

Simplified and Traditional Chinese

Optimized for NLP applications

Overview

The CJKI Chinese Lexical Database (CLD) is a comprehensive monolingual lexical database specifically designed for NLP applications. It consists of two modules, Simplified Chinese (SC) and Traditional Chinese (TC), with about 250,000 entries in each module covering general vocabulary, technical terms, and important proper nouns.

A unique feature of CLD is that the readings (pinyin and zhuyin) take into account the differences in pronunciation between the PRC and Taiwan. For example, SC 危险 wēixiǎn ‘dangerous’ is TC 危險 wéixiǎn. Furthermore, the TC not merely a code-conversion equivalent of the SC version, but has been carefully proofread to ensure accuracy on both the orthographic and lexemic levels.
For example, 出租车 chūzūchē ‘taxi’ has a lexemic equivalent of 計程車 jīchéngchē, rather than the SC orthographic equivalent 出租車. Developed by CJKI’s team of Chinese specialists over many years, CLD is a significant contribution to the field of Chinese lexicography and information processing.

Main Features

Phonological information

Such as pinyin, zhuyin, and IPA

Semantic classification codes

Such as type of proper noun

Grammatical information

Such as POS and adjacency attributes

Morphological information

derivational affixes and binding valency codes

* Select one of the tabs below.

Practical Applications

CLD is being used by major IT companies to enhance their Chinese morphological analysis technology and is especially suitable for natural language processing (NLP) applications, such as:

Segmentation and tokenization

Named-entity recognition

Input method editors

Morphological analysis

Information retrieval

Part-of-speech tagging

Related Resources

JLD

Japanese Lexical Database

Monolingual general vocabulary for NLP applications

KLD

Korean Lexical Database

Monolingual general vocabulary for NLP applications

CHD

Chinese Hanyu Pinyin Database

Accurate hanyu pinyin data including technical terms and proper nouns