LRAG Glossary Generator

LRAG Glossary Generator

Keyword Extraction

Glossary Generation

Prompt Augmentation

Overview

Large language models (LLMs) like GPT-4 are now widely used for machine translation (MT), often surpassing neural machine translation (NMT) systems such as Google Translate and DeepL. However, LLMs frequently mistranslate proper nouns and technical terms due to insufficient training data. One solution is to use a method known as Retrieval Augmented Generation (RAG), where the necessary data is retrieved from an external source and supplied at inference time.

Our institute has developed a novel method to integrate source-text specific glossaries into LLM systems, referred to as Lexical Retrieval Augmented Generation (LRAG). This method is implemented through the LRAG Glossary Generator plug-in module, which retrieves data from our large-scale multilingual databases of proper nouns and technical terms, known as the LRAG Databases, which can be supplemented by a customized User Dictionary.

This generator enables the LLM to use large-scale multilingual terminology databases, reducing translation errors without needing to be fine-tuned. To read the full white paper, click here.

Distinctive Features

The LRAG Glossary Generator offers unique features that enable highly efficient glossary generation, such as:

  • Keywords such as technical terms and proper nouns are automatically extracted.
  • User can specify the domain or the domain is automatically inferred.
  • Real-Time access to the LRAG Databases, which consists of tens of millions of entries.
  • Multiple translation equivalents are priortitized by context.
  • Optional User Dictionaries can be added.
  • An Augmented Prompt including LRAG Glossary is automatically generated.

Use Cases

The LRAG Glossary Generator offers significant benefits to various types of users: 

  • Individuals, such as translators, can use the tool to achieve improved accuracy and reliability in both personal and professional tasks by providing customizable glossaries.
  • Language Service Providers (LSPs) can leverage the extensive multilingual databases, LRAG Glossaries, and customer-specific dictionaries to generate more consistent and accurate translation drafts for post-editing. The tool’s real-time access and integration with machine translation (MT) systems enable LSPs to reduce the time and effort required to produce high-quality translations.
  • LLM developers can use the LRAG Glossary Generator to fine-tune models by incorporating domain-specific glossaries and user dictionaries without the need for extensive retraining. Developers can also utilize the vast LRAG Databases as training data, enhancing translation solutions with domain-specific terminologies.

LRAG Databases

CJKI has several large-scale dictionary and lexical databases that can be repurposed for use in LLM MT systems, including:

CNV

Chinese Personal Name Variants

7.6 million Chinese personal names and their romanized variants

JOD

Japanese Orthographic Database

Orthographic variants for core Japanese vocabulary, covering 126,000 entries

JNV

Japanese Personal Name Variants

3.5 million Japanese personal names and their romanized variants

JMP

Japanese-Multilingual Place Names and POIs

3.1 million, multilingual database of Japanese and Western place names

DAN

Database of Arabic Names

6.5 million Arabic personal names and their romanized variants