Linguistic and Technical Documents
This page brings together some linguistic and technical documents written by Jack Halpern, aimed at introducing the CJK languages, in addition to Arabic, with emphasis on the linguistic issues to be addressed in developing both CJK and Arabic linguistic tools.
Japanese Information Processing
A paper co-authored by Masahito Takahashi, Toshifumi Tanabe, Kosho Shudo, and Jack Halpern on JMWEL, a comprehensive lexicon of Japanese Multiword Expressions (MWEs) with a rich set of grammatical attributes fine-tuned for phrase-based NLP applications such as machine translation and information retrieval. Presented at the EUROPHRAS 2019: Computational and Corpus-based Phraseology in Malaga, Spain in September, 2019.
This paper presented at the TAUS Executive Forum Tokyo 2017 looks at the linguistic issues related to orthographic variation, showing how Very Large-scale Lexical Resources (VLSLR) can significantly enhance the accuracy of NLP tools, with focus on machine translation (MT),named entity recognition (NER) and named entity translation (NET). See the slide show.
This keynote address given at the 6th NEWS Named Entities Workshop in Berlin in August, 2016 focuses on the special characteristics of Chinese, Japanese, and Arabic scripts that impact machine translation, and the role played by lexical resources such as personal name dictionaries and how these resources can be used to enhance the accuracy of name transliteration systems.
Presented at the 11th Oriental COCOSDA Workshop held in Kyoto in 2008, this paper summarizes the complex allophonic variations that need to be considered in developing Japanese speech technology applications, and introduces the 130,000-entry Japanese Phonetic Database (JPD) developed by CJKI.
Describes the linguistic issues to be addressed by advanced Japanese information retrieval technologies, focusing on cross-language and cross-synonym searching.
Mobile Language Learning
Enhancing Mobile Learning by Linking Japanese Dictionary Apps
This paper, presented at eLex 2019 in Sintra, Portugal, describes how four mobile apps exploit the unique features of the mobile platform to help learners study Japanese effectively in previously unavailable ways. (Abstract | Presentation)
Groundbreaking Mobile Technology to Enhance Chinese and Japanese Language Learning
This paper, presented at the ACLL2017: The Asian Conference on Language Learning in Kobe, Japan, describes our groundbreaking Libera platform that combines the strengths of traditional bilingual parallel texts with the educational potential of the smart tablet platform. (Abstract | Presentation)
Exploiting Mobile Technology to Enhance EFL
Exploiting Mobile-Assisted Language Learning Technology to Enhance Japanese Language Education
Presented at the 11th Oriental COCOSDA Workshop held in Kyoto in 2008, this paper summarizes the complex allophonic variations that need to be considered in developing Japanese speech technology applications, and introduces the 130,000-entry Japanese Phonetic Database (JPD) developed by CJKI.
The Japanese Language
This article was published in a special issue of the International Journal of Lexicography (Volume 29, Issue 3) on “Bilingual Learners’ Dictionaries”.
The aim of this presentation, given at the Second Wordnet Bahasa Workshop in Singapore in January 2016, was to examine several key issues in pedagogical lexicography both from the lexicographer’s and from the kanji learner’s points of view, focusing on compilation and design innovations that increase learner usability. (Abstract | Presentation)
Presented at Euralex ’94, this paper describes how we began to develop DESK, our comprehensive CJK lexical databases, on the basis of the New Japanese-English Character Dictionary.
A detailed introduction to the hiragana, katakana, and romaji scripts, which together with kanji constitute the complex Japanese writing system.
Chinese Information Processing
This keynote address was given at the 6th NEWS Named Entities Workshop in Berlin in August 2016 focuses on the special characteristics of Chinese, Japanese, and Arabic scripts that impact machine translation, and the role played by lexical resources such as personal name dictionaries and how these resources can be used to enhance the accuracy of name transliteration systems. See also the slide show.
This article was published in a special issue of International Journal of Lexicography (Volume 29, Issue 3) on “Bilingual Learners’ Dictionaries”.
Presented at several international conferences, this academic paper presents an in-depth analysis of the linguistic and technical issues related to converting Simplified Chinese to/from Traditional Chinese.
Korean Information Processing
Arabic Information Processing
This paper, co-written with Yannis Haralambous and accepted for presentation at The 4th Workshop on Arabic Corpus Linguistics (WACL-4), focuses on the strategies utilized to compile ArabLEX and DiaLEX as a methodological framework for creating comprehensive Arabic lexical resources and full-form lexicons for other dialects.
This paper describes ArabLEX, a full-form lexicon specifically designed to support NLP applications such as morphological analysis, machine translation, named entity recognition (NER), morphological generation, and speech technology.
This keynote address was given at the 6th NEWS Named Entities Workshop in Berlin in August 2016 focuses on the special characteristics of Chinese, Japanese, and Arabic scripts that impact machine translation, and the role played by lexical resources such as personal name dictionaries and how these resources can be used to enhance the accuracy of name transliteration systems. See also the slide show.
This article was published in a special issue of International Journal of Lexicography (Volume 29, Issue 3) on “Bilingual Learners’ Dictionaries”.
This presentation at ASIALEX2016 in The Philippines describes three bilingual learner’s dictionaries. (Abstract | Presentation slides)
A panel discussion organized by our director Jack Halpern for the Middle East Studies Association (MESA) 2014 Annual Meeting focused on methodologies to create pedagogically effective language learning and dictionary applications by harnessing the vast potential of the mobile platform. View Mr. Halpern’s presentation abstract here.
An innovative phonemic transcription system developed mainly for ease of use by learners of Modern Standard Arabic, with several unique features including an indication of word stress and vowel neutralization. Presented at the Towards A Transliteration Standard of Arabic: Challenges and Solutions conference in Abu Dhabi in 2009. See also slide show.
This paper analyzes the principal linguistic issues of Arabic and CJK orthographic variation and argues that linguistic knowledge supported by large-scale lexical databases is essential for accurate disambiguation. Presented at LREC 2008.
Others
This paper explores the evidence supporting the continued existence and unique benefits of paper dictionaries for language learners and enthusiasts.
Parallel Annotated Synthetic Corpora (PASC)
The Parallel Annotated Synthetic Corpora (PASC) project focuses on creating comprehensive synthetic corpora for various applications in natural language processing and speech translation. By providing fully aligned and accurate synthetic corpora along with precise annotations, the quality of language models, including Neural Machine Translation, Automatic Speech Recognition, and Text to Speech, can be enhanced. (White Paper | Summary | Data Sample)
This paper, presented at the Collocations in Lexicography: existing solutions and future challenges workshop at eLex 2019 in Sintra, Portugal, discusses some of the fundamental principles for the selection of headwords in bilingual dictionaries. (Abstract | Presentation)
This paper, presented at The 4th Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2019) in Malaga, Spain, discusses the fundamental principles for identifying and selecting MWUs for inclusion in bilingual dictionaries, both for humans and for MT systems (MT lexicons).