DeepLEX
DeepLEX: Lexical Resources for Deep Learning
The CJK Dictionary Institute is engaged in the active development of very large-scale lexical resources, referred to as DeepLEX Resources, to support Deep Learning technologies in such diverse technologies as named entity recognition (NER), cybersecurity, neural machine translation (NMT), and speech technology.
Selected Resources
Chinese Personal Name Variants
7.6 million Chinese personal names and their romanized variants
Japanese Orthographic Database
Orthographic variants for core Japanese vocabulary, covering 126,000 entries
Japanese Personal Name Variants
3.5 million Japanese personal names and their romanized variants
Japanese-Multilingual Place Names and POIs
3.1 million, multilingual database of Japanese and Western place names
Arabic Full-Form Lexicon
530 million entries, including all inflected, declined, and conjugated forms
Database of Arabic Names
6.5 million Arabic personal names and their romanized variants
Use Cases
Named Entity Recognition (NER)
NER traditionally uses rule-based approaches, but in data-rich domains such as romanized personal name variants in Chinese and Arabic, these approaches do not always achieve adequate recall and precision. The integration of comprehensive, hard-coded lexicons covering tens or hundreds of millions of entries, such as those we maintain provided by CJKI, offers the most practical solution to achieving high accuracy.Neural Machine Translation (NMT)
NMT performs poorly on low-frequency content words, especially named entities. Integrating DeepLEX data into NMT systems can substantially increase translation accuracy scores.Cybersecurity
Large-scale entity lexicons can also play a major role in cybersecurity, but extraction models tend to ignore entities specific to the cybersecurity domain such as names of hackers and viruses. Cybersecurity can significantly benefit from both traditional CRF-based NER using ordinary entity lexicons, as well as from security entity lexicons fine-tuned to specific entities.Regularization
Regularization algorithms must perform optimally not only on trained data but also on unknown input data such as orthographic variants of named entities. Large-scale entity lexicons can significantly enhance accuracy by compressing vector data and computing meaningful values for each variant.Pre-trained Models
Building pre-trained word association models using our DeepLEX Resources and combining them with other resources such as annotated corpora can lead to satisfactory results, especially for morphologically complex languages like Arabic.Reference Documents
DeepLEX: Lexical Resources for Deep Learning
White Paper (English)
DeepLEX: 深層学習用辞書データベース
ホワイトペーパー(日本語)
DeepLEX: 用于深度学习的词库资源
论文(中文)