Linguistic and Technical Documents

Linguistic and Technical Documents

This page brings together some linguistic and technical documents written by Jack Halpern, aimed at introducing the CJK languages, in addition to Arabic, with emphasis on the linguistic issues to be addressed in developing both CJK and Arabic linguistic tools.

Japanese Information Processing

A paper co-authored by Masahito Takahashi, Toshifumi Tanabe, Kosho Shudo, and Jack Halpern on JMWEL, a comprehensive lexicon of Japanese Multiword Expressions (MWEs) with a rich set of grammatical attributes fine-tuned for phrase-based NLP applications such as machine translation and information retrieval. Presented at the EUROPHRAS 2019: Computational and Corpus-based Phraseology in Malaga, Spain in September, 2019.

This paper presented at the TAUS Executive Forum Tokyo 2017 looks at the linguistic issues related to orthographic variation, showing how Very Large-scale Lexical Resources (VLSLR) can significantly enhance the accuracy of NLP tools, with focus on machine translation (MT),named entity recognition (NER) and named entity translation (NET).  See the slide show.

This keynote address given at the 6th NEWS Named Entities Workshop in Berlin in August, 2016 focuses on the special characteristics of Chinese, Japanese, and Arabic scripts that impact machine translation, and the role played by lexical resources such as personal name dictionaries and how these resources can be used to enhance the accuracy of name transliteration systems. 

Introduces The CJKI Chinese Learner’s Dictionary, designed to satisfy the needs of learners and to overcome the shortcomings of existing Chinese dictionaries. Presented at ASIALEX 2011. See also the slide show.
A linguistic description of the principal challenges to be overcome by developers of CJK NLP application, this paper was presented at workshops of COLING/ACL 2006 in Sydney as well as other conferences.

Presented at the 11th Oriental COCOSDA Workshop held in Kyoto in 2008, this paper summarizes the complex allophonic variations that need to be considered in developing Japanese speech technology applications, and introduces the 130,000-entry Japanese Phonetic Database (JPD) developed by CJKI.

A linguistic description of the principal challenges to be overcome by developers of Japanese speech technology and the role of phonological databases.
Presented at COLING 2002 (Taipei), this paper analyzes the linguistic issues of CJK orthographic variation, including Japanese, and discusses why lexical databases should play a central role in NLP.
This paper analyzes in detail the linguistic issues related to orthographic variation in Japanese, and discusses advanced information retrieval technologies such as cross-script and cross-orthographic searching for use in intelligent IR.
The highly irregular orthography and morphological complexity of Japanese pose formidable challenges to software developers. This report focuses on orthographic variation and analyzes the linguistic issues in developing Japanese linguistic tools.
Explains the subtle distinctions between the numerous homophones in Japanese, and shows why homophone processing deserves special attention in Japanese information retrieval.

Describes the linguistic issues to be addressed by advanced Japanese information retrieval technologies, focusing on cross-language and cross-synonym searching.

Describes the derivational affixes and binding valency in our Japanese lexical database, particularly useful for disambiguating Japanese lexemes in such applications as search engine query processing.

Mobile Language Learning

Enhancing Mobile Learning by Linking Japanese Dictionary Apps

This paper, presented at eLex 2019 in Sintra, Portugal, describes how four mobile apps exploit the unique features of the mobile platform to help learners study Japanese effectively in previously unavailable ways. (Abstract | Presentation)

Groundbreaking Mobile Technology to Enhance Chinese and Japanese Language Learning

This paper, presented at the ACLL2017: The Asian Conference on Language Learning in Kobe, Japan, describes our groundbreaking Libera platform that combines the strengths of traditional bilingual parallel texts with the educational potential of the smart tablet platform. (Abstract | Presentation)

Exploiting Mobile Technology to Enhance EFL
This paper, presented at the JALT2016 Annual Conference in Nagoya, describes our groundbreaking Libera platform that combines the strengths of traditional bilingual parallel texts with the educational potential of the smart tablet platform. (Abstract | Presentation)
Exploiting Mobile-Assisted Language Learning Technology to Enhance Japanese Language Education

This poster presentation, given at the 2016 Pacific Second Language Research Forum in Tokyo, describes two applications that leverage mobile technology to help learners study Japanese more effectively than ever before. (Abstract | Poster)

A workshop presentation sponsored by Kodansha USA was given at the 2015 ACTFL Annual Convention and World Languages Expo in San Diego, CA.

Presented at the 11th Oriental COCOSDA Workshop held in Kyoto in 2008, this paper summarizes the complex allophonic variations that need to be considered in developing Japanese speech technology applications, and introduces the 130,000-entry Japanese Phonetic Database (JPD) developed by CJKI.

The Japanese Language

This article was published in a special issue of the International Journal of Lexicography (Volume 29, Issue 3) on “Bilingual Learners’ Dictionaries”.

The aim of this presentation, given at the Second Wordnet Bahasa Workshop in Singapore in January 2016, was to examine several key issues in pedagogical lexicography both from the lexicographer’s and from the kanji learner’s points of view, focusing on compilation and design innovations that increase learner usability. (Abstract | Presentation)

A fairly detailed introduction to the Japanese writing system, including the birth of the Chinese characters, the function of kanji in Japanese, and a description of the various scripts used in Japanese.

Presented at Euralex ’94, this paper describes how we began to develop DESK, our comprehensive CJK lexical databases, on the basis of the New Japanese-English Character Dictionary.

A detailed introduction to the hiragana, katakana, and romaji scripts, which together with kanji constitute the complex Japanese writing system.

Describes the principal word-formation processes in Japanese, with special emphasis on the function of kanji as word elements and bound affixes.

Chinese Information Processing

This paper looks at the linguistic issues related to orthographic variation, showing how Very Large-scale Lexical Resources (VLSLR) can significantly enhance the accuracy of NLP tools, with a focus on information retrieval (IR) and named entity recognition (NER) and named entity translation (NET).

This keynote address was given at the 6th NEWS Named Entities Workshop in Berlin in August 2016 focuses on the special characteristics of Chinese, Japanese, and Arabic scripts that impact machine translation, and the role played by lexical resources such as personal name dictionaries and how these resources can be used to enhance the accuracy of name transliteration systems. See also the slide show.

This article was published in a special issue of International Journal of Lexicography (Volume 29, Issue 3) on “Bilingual Learners’ Dictionaries”.

Introduces The CJKI Chinese Learner’s Dictionary, designed to satisfy the needs of learners and to overcome the shortcomings of existing Chinese dictionaries. Presented at ASIALEX 2011. See also slide show.
A linguistic description of the principal challenges to be overcome by developers of Chinese NLP application.
This paper analyzes the linguistic issues of CJK orthographic variation, and discusses why lexical databases should play a central role in disambiguation.

Presented at several international conferences, this academic paper presents an in-depth analysis of the linguistic and technical issues related to converting Simplified Chinese to/from Traditional Chinese.

This report focuses on the complexities of orthographic variation in Chinese, analyzes the linguistic issues in developing Chinese linguistic tools, and describes the major differences between Traditional and Simplified Chinese.
Traditional Chinese does not have a stable orthography. This short document describes the various types character form variants and how they relate to each other.

Korean Information Processing

This paper analyzes the linguistic issues of CJK orthographic variation, including Korean,and discusses why lexical databases should play a central role NLP.
This report focuses on Korean orthographic variation and analyzes the linguistic issues to be addressed when developing Korean linguistic tools, especially intelligent information retrieval tools.

Arabic Information Processing

This paper, co-written with Yannis Haralambous and accepted for presentation at The 4th Workshop on Arabic Corpus Linguistics (WACL-4), focuses on the strategies utilized to compile ArabLEX and DiaLEX as a methodological framework for creating comprehensive Arabic lexical resources and full-form lexicons for other dialects.

This paper describes ArabLEX, a full-form lexicon specifically designed to support NLP applications such as morphological analysis, machine translation, named entity recognition (NER), morphological generation, and speech technology.

A white paper about our ArabLEX database, a comprehensive Arabic lexical resource that provides a rich set of grammatical, morphological and phonological features.

This keynote address was given at the 6th NEWS Named Entities Workshop in Berlin in August 2016 focuses on the special characteristics of Chinese, Japanese, and Arabic scripts that impact machine translation, and the role played by lexical resources such as personal name dictionaries and how these resources can be used to enhance the accuracy of name transliteration systems. See also the slide show.

This article was published in a special issue of International Journal of Lexicography (Volume 29, Issue 3) on “Bilingual Learners’ Dictionaries”.

This presentation at ASIALEX2016 in The Philippines describes three bilingual learner’s dictionaries. (Abstract | Presentation slides)

A panel discussion organized by our director Jack Halpern for the Middle East Studies Association (MESA) 2014 Annual Meeting focused on methodologies to create pedagogically effective language learning and dictionary applications by harnessing the vast potential of the mobile platform. View Mr. Halpern’s presentation abstract here.

Discusses key issues related to the selection of headwords in Arabic dictionaries, in particular learner’s dictionaries, and briefly touches on criteria for selecting word senses.
Presented at the 2012 International Conference on Asian Languages Processing (Hanoi), this paper describes some of the methodology used in compiling two innovative Arabic learner’s dictionaries fine-tuned to the special needs of learners that present abundant lexicographic information in a user-friendly manner.
Introduces a new type of Arabic-English dictionary and smartphone app fine-tuned to the special needs of learners, and describes the ultimate verb conjugator smartphone app that provides instant access to verb conjugation paradigms.

An innovative phonemic transcription system developed mainly for ease of use by learners of Modern Standard Arabic, with several unique features including an indication of word stress and vowel neutralization. Presented at the Towards A Transliteration Standard of Arabic: Challenges and Solutions conference in Abu Dhabi in 2009. See also slide show.

This paper describes the techniques used to compile the Database of Arabic Names (DAN), the world’s largest Arab name resource containing millions of names and their variants. Presented at the 2nd International Conference on Arabic Language Resources and Tools in Cairo in 2009.
This paper presents word stress and neutralization rules that are both linguistically accurate and pedagogically useful based on how spoken MSA is actually pronounced. Presented at the 2nd International Conference on Arabic Language Resources and Tools in Cairo in 2009.

This paper analyzes the principal linguistic issues of Arabic and CJK orthographic variation and argues that linguistic knowledge supported by large-scale lexical databases is essential for accurate disambiguation. Presented at LREC 2008.

This paper was presented at The Second Workshop on Computational Approaches to Arabic Script-based Languages (CAASL2) conference held at Stanford University. This paper focuses on the linguistic issues encountered in developing unique systems for the automatic romanization of Arabic names and the arabization of non-Arabic names that can arabize CJK names directly.

Others

This paper explores the evidence supporting the continued existence and unique benefits of paper dictionaries for language learners and enthusiasts.

Parallel Annotated Synthetic Corpora (PASC)

The Parallel Annotated Synthetic Corpora (PASC) project focuses on creating comprehensive synthetic corpora for various applications in natural language processing and speech translation. By providing fully aligned and accurate synthetic corpora along with precise annotations, the quality of language models, including Neural Machine Translation, Automatic Speech Recognition, and Text to Speech, can be enhanced. (White Paper | Summary | Data Sample)

This paper, presented at the Collocations in Lexicography: existing solutions and future challenges workshop at eLex 2019 in Sintra, Portugal, discusses some of the fundamental principles for the selection of headwords in bilingual dictionaries. (Abstract | Presentation)

This paper, presented at The 4th Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2019) in Malaga, Spain, discusses the fundamental principles for identifying and selecting MWUs for inclusion in bilingual dictionaries, both for humans and for MT systems (MT lexicons).

Describes the principal word-formation processes in English, and demonstrates that word segmentation in English, contrary to popular belief, is far from trivial.
Criteria for Inclusion of Multiword Lexical Units in Dictionaries
Coming Soon.
European and Semitic languages
Coming Soon. A series of reports describing the features of the major European and Semitic languages, focusing on orthographic variation, and describing the linguistic issues to be addressed in developing linguistic tools.