Parallel Annotated Synthetic Corpora
Perfect multilingual alignment
Accurate and idiomatic translations
Rich set of annotation tags
Overview
The goal of the Parallel Annotated Synthetic Corpora (PASC) project is to create large-scale synthetic corpora for various natural language processing (NLP) applications, including machine translation (MT) for NMT and generative AI. Synthetic data mimics natural language, used especially for training machine learning models when actual data is scarce or expensive to obtain. Artificial corpora show great promise in improving machine translation quality.
The PASC project aims to create synthetic corpora using supervised generation techniques. Unlike augmented corpora, which expand existing corpora, PASC constructs synthetic corpora from scratch using predefined sentence templates, ensuring strict adherence to linguistic rules. This meticulous approach yields precise translations, accurate alignment, grammatical annotation, accurate phonemic transcriptions, and more.
PASC includes very large-scale databases consisting of tens to hundreds of millions of entries for each domain. Currently it focuses on named entities, especially personal names, place names and points of interest for CJK languages and Arabic, to be followed by technical terms. Its distinctive features include full alignment, translation accuracy, accurate transcriptions, multilingual formats, full annotation, and consistency.
Parallel Annotated Synthetic Corpora
* Select different languages by clicking on the tabs below.
ID | ENGLISH | JAPANESE |
---|---|---|
0002-01 | My full name is [Michael Owen]. | 私の姓名は[オーウェン・マイケル]です。 |
0002-02 | [Michael] is my given name and [Owen] is my surname. | [マイケル]は私の名前で、[オーウェン]は私の苗字です。 |
0002-03 | I’m called [Michael Owen]. | [オーウェン・マイケル]と言います。 |
0002-04 | Both [Michael] and [Owen] are personal names. | [オーウェン]と[マイケル]は両方とも人名です。 |
0002-05 | [Michael Owen] is my full name. | [オーウェン・マイケル]とは私のフルネームです。 |
0002-06 | [Michael Owen] is what’s written on my ID. | 旅券に記載されている姓名は[オーウェン・マイケル]です。 |
0002-07 | I’ve never heard of anyone called [Michael Owen]. | [オーウェン・マイケル]と言う人のことを聞いたことがない。 |
0002-08 | I go by the name [Michael Owen]. | [オーウェン・マイケル]と言う名前で呼ばれています。 |
0002-09 | Do you know of anyone who goes by the name of [Michael Owen]? | [オーウェン・マイケル]という人を知っていますか。 |
ID | JAPANESE | ENGLISH |
---|---|---|
0030-01 | 私の姓名は[森隆大]です。 | My full name is [Takahiro Mori]. |
0030-02 | [隆大]は私の名前で、[森]は私の苗字です。 | [Takahiro] is my given name and [Mori] is my surname. |
0030-03 | [森隆大]と言います。 | I’m called [Takahiro Mori]. |
0030-04 | [森]と[隆大]は両方とも人名です。 | Both [Takahiro] and [Mori] are personal names. |
0030-05 | [森隆大]とは私のフルネームです。 | [Takahiro Mori] is my full name. |
0030-06 | 旅券に記載されている姓名は[森隆大]です。 | [Takahiro Mori] is what’s written on my ID. |
0030-07 | [森隆大]と言う人のことを聞いたことがない。 | I’ve never heard of anyone called [Takahiro Mori]. |
0030-08 | [森隆大]と言う名前で呼ばれています。 | I go by the name [Takahiro Mori]. |
0030-09 | [森隆大]という人を知っていますか。 | Do you know of anyone who goes by the name of [Takahiro Mori]? |
ID | CHINESE | ENGLISH |
---|---|---|
0040-01 | 我的姓名是[张小东]。 | My full name is [Xiaodong Zhang]. |
0040-02 | [小东]是我的名字,[张]是我的姓。 | [Xiaodong] is my given name and [Zhang] is my surname. |
0040-03 | 我叫[张小东]。 | I’m called [Xiaodong Zhang]. |
0040-04 | [小东]和[张]都是人名。 | Both [Xiaodong] and [Zhang] are personal names. |
0040-05 | [张小东]是我的姓名。 | [Xiaodong Zhang] is my full name. |
0040-06 | 我的身份证上的姓名是[张小东]。 | [Xiaodong Zhang] is what’s written on my ID. |
0040-07 | 我从未听过叫[张小东]的人。 | I’ve never heard of anyone called [Xiaodong Zhang]. |
0040-08 | 我叫[张小东]。 | I go by the name [Xiaodong Zhang]. |
0040-09 | 你知道叫[张小东]的人吗? | Do you know of anyone who goes by the name of [Xiaodong Zhang]? |
ID | KOREAN | ENGLISH |
---|---|---|
0050-01 | 저의 성명은 [김지영]입니다. | My full name is [Jiyeong Gim]. |
0050-02 | [지영]은 저의 이름이고, [김]은 저의 성입니다. | [Jiyeong] is my given name and [Gim] is my surname. |
0050-03 | 저는 [김지영]이라고 합니다. | I’m called [Jiyeong Gim]. |
0050-04 | [지영]과 [김]은 모두 다 인명입니다. | Both [Jiyeong] and [Gim] are personal names. |
0050-05 | [김지영]은 저의 성명입니다. | [Jiyeong Gim] is my full name. |
0050-06 | 저의 신분증의 이름은 [김지영]입니다. | [Jiyeong Gim] is what’s written on my ID. |
0050-07 | [김지영]이라는 이름은 들어본 적이 없습니다. | I’ve never heard of anyone called [Jiyeong Gim]. |
0050-08 | 저는 [김지영]이라고 합니다. | I go by the name [Jiyeong Gim]. |
0050-09 | [김지영]이라는 분을 아시나요? | Do you know of anyone who goes by the name of [Jiyeong Gim]? |
ID | ARABIC | ENGLISH |
---|---|---|
0060-01 | اسمي الكامل هو [محمد العبدي] | My full name is [Mohammed Al-Abadi]. |
0060-02 | [محمد] هو اسمي الاول، و [العبدي] هو اسمي العائلي | [Mohammed] is my first name and [Al-Abadi] is my family name. |
0060-03 | أنا أدعى [محمد العبدي] | I’m called [Mohammed Al-Abadi]. |
0060-04 | [محمد] و[العبدي] كلاهما أسماء شخصية | Both [Mohammed] and [Al-Abadi] are personal names. |
0060-05 | [محمد العبدي] هو اسمي الكامل | [Mohammed Al-Abadi] is my full name. |
0060-06 | الإسم المدرج في بطاقة هويتي هو [محمد العبدي] | The name listed on my ID card is [Mohammed Al-Abadi]. |
0060-07 | لم أسمع عن أحد يدعى [محمد العبدي] | I haven’t heard of anyone called [Mohammed Al-Abadi]. |
0060-08 | أنا ألقب ب [محمد العبدي] | I go by the name [Mohammed Al-Abadi]. |
0060-09 | هل تعرف شخصا يلقب بـ[محمد العبدي]؟ | Do you know of anyone who goes by the name [Mohammed Al-Abadi]? |
Practical Applications
PASC can enhance the quality of language models and NLP algorithms for various applications, such as:
Neural Machine Translation
Automatic Speech Recognition
Text-to-Speech
Reference Documents
Related Resources
Chinese Personal Name Variants
Over 7 million Chinese and non-Chinese names and romanized variants
Database of Arabic Names
6.5 million Arabic personal names and their romanized variants
Japanese Personal Name Variants
Japanese personal names and their romanized variants