ELSNET-list archive

Category:   E-SSchool
Subject:   NSF-supported Summer Internships
From:   Fred Jelinek
Email:   jelinek_(on)_clsp.jhu.edu
Date received:   11 Feb 2003
Deadline:   15 Feb 2003

Dear Colleague: The Center for Language and Speech Processing at the Johns Hopkins University is offering a unique summer internship opportunity, which we would like you to bring to the attention of your best students in the current junior class. Preliminary applications for these internships are due at the end of this week. This internship is unique in the sense that the selected students will participate in cutting edge research as full members alongside leading scientists from industry, academia, and the government. The exciting nature of the internship is the exposure of the undergraduate students to the emerging fields of language engineering, such as automatic speech recognition (ASR), natural language processing (NLP) and machine translation (MT). We are specifically looking to attract new talent into the field and, as such, do not require the students to have prior knowledge of language engineering technology. Please take a few moments to nominate suitable bright students for this internship. On-line applications for the program can be found at http://www.clsp.jhu.edu/ along with additional information regarding plans for the 2003 Workshop and information on past workshops. The application deadline is February 15, 2003. If you have questions, please contact us by phone (410-516-4237), e-mail (sec_(on)_clsp.jhu.edu) or via the Internet http://www.clsp.jhu.edu Sincerely, Frederick Jelinek J.S. Smith Professor and Director ------------------------------------------------------------------------ --- Team Project Descriptions for this Summer ------------------------------------------------------------------------ --- 1. Syntax for Statistical Machine Translation In recent evaluations of machine translation systems, statistical systems based on probabilistic models have outperformed classical approaches based on interpretation, transfer, and generation. Nonetheless, the output of statistical systems often contains obvious grammatical errors. This can be attributed to the fact that the syntactic well-formedness is only influenced by local n-gram language models and simple alignment models. We aim to integrate syntactic structure into statistical models to address this problem. A very convenient and promising approach for this integration is the maximum entropy framework, which allows to integrate many different knowledge sources into an overall model and to train the combination weights discriminatively. This approach will allow us to extend a baseline system easily by adding new feature functions. The workshop will start with a strong baseline -- the alignment template statistical machine translation system that obtained best results in the 2002 DARPA MT evaluations. During the workshop, we will incrementally add new features representing syntactic knowledge that deal with specific problems of the underlying baseline. We want to investigate a broad range of possible feature functions, from very simple binary features to sophisticated tree-to-tree translation models. Simple feature functions might test if a certain constituent occurs in the source and the target language parse tree. More sophisticated features will be derived from an alignment model where whole sub-trees in source and target can be aligned node by node. We also plan to investigate features based on projection of parse trees from one language onto strings of another, a useful technique when parses are available for only one of the two languages. We will extend previous tree-based alignment models by allowing partial tree alignments when the two syntactic structures are not isomorphic. We will work with the Chinese-English data from the recent evaluations, as large amounts of sentence-aligned training corpora, as well as multiple reference translations are available. This will also allow us to compare our results with the various systems participating in the evaluations. In addition, annotation is underway on a Chinese-English parallel tree-bank. We plan to evaluate the improvement of our system using both automatic metrics for comparison with reference translations (BLEU and NIST) as well as subjective evaluations of adequacy and fluency. We hope both to improve machine translation performance and advance the understanding of how linguistic representations can be integrated into statistical models of language. ------------------------------------------------------------------------ --- 2. Semantic Analysis Over Sparse Data The aim of the task is to verify the feasibility of a machine learning-based semantic approach to the data sparseness problem that is encountered in many areas of natural language processing such as language modeling, text classification, question answering and information extraction. The suggested approach takes advantage of several technologies for supervised and unsupervised sense disambiguation that have been developed in the last decade and of several resources that have been made available. The task is motivated by the fact that current language processing models are considerably affected by sparseness of training data, and current solutions, like class-based approaches, do not elicit appropriate information: the semantic nature and linguistic expressiveness of automatically derived word classes is unclear. Many of these limitations originate from the fact that fine-grained automatic sense disambiguation is not applicable on a large scale. The workshop will develop a weakly supervised method for sense modeling (i.e. reduction of possible word senses in corpora according to their genre) and apply it to a huge corpus in order to coarsely sense-disambiguate it. This can be viewed as an incremental step towards fine-grained sense disambiguation. The created semantic repository as well as the developed techniques will be made available as resources for future work on language modeling, semantic acquisition for text extraction, question answering, summarization, and most other natural language processing tasks. ------------------------------------------------------------------------ --- 3. Dialectal Chinese Speech Recognition There are eight major dialectal regions in addition to Mandarin (Northern China) in China, including Wu (Southern Jiangsu, Zhejiang, and Shanghai), Yue (Guangdong, Hong Kong, Nanning Guangxi), Min (Fujian, Shantou Guangdong, Haikou Hainan, Taipei Taiwan), Hakka (Meixian Guangdong, Hsin-chu Taiwan), Xiang (Hunan), Gan (Jiangxi), Hui (Anhui), and Jin (Shanxi). These dialects can be further divided into more than 40 sub-categories. Although the Chinese dialects share a written language and standard Chinese (Putonghua) is widely spoken in most regions, speech is still strongly influenced by the native dialects. This great linguistic diversity poses problems for automatic speech and language technology. Automatic speech recognition relies to a great extent on the consistent pronunciation and usage of words within a language. In Chinese, word usage, pronunciation, and syntax and grammar vary depending on the speaker's dialect. As a result speech recognition systems constructed to process standard Chinese (Putonghua) perform poorly for the great majority of the population. The goal of our summer project is to develop a general framework to model phonetic, lexical, and pronunciation variability in dialectal Chinese automatic speech recognition tasks. The baseline system is a standard Chinese recognizer. The goal of our research is to find suitable methods that employ dialect-related knowledge and training data (in relatively small quantities) to modify the baseline system to obtain a dialectal Chinese recognizer for the specific dialect of interest. For practical reasons during the summer, we will focus on one specific dialect, for example the Wu dialect or the Chuan dialect. However the techniques we intend to develop should be broadly applicable. Our project will build on established ASR tools and systems developed for standard Chinese. In particular, our previous studies in pronunciation modeling have established baseline Mandarin ASR systems along with their component lexicons and language model collections. However, little previous work or resources are available to support research in Chinese dialect variation for ASR. Our pre-workshop will therefore focus on further infrastructure development: * Dialectal Lexicon Construction. We will establish an electronic dialect dictionary for the chosen dialect. The lexicon will be constructed to represent both standard and dialectal pronunciations. * Dialectal Chinese Database Collection. We will set up a dialectal Chinese speech database with canonical pinyin level and dialectal pinyin level transcriptions. The database could contain two parts: read speech and spontaneous speech. For the spontaneous speech part, the generalized initial/final (GIF) level transcription should be also included. Our effort at the workshop will be to employ these materials to develop ASR system components that can be adapted from standard Chinese to the chosen dialect. Emphasis will be placed on developing techniques that work robustly with relatively small (or even no) dialect data. Research will focus primarily on acoustic phenomena, rather than syntax or grammatical variation, which we intend to pursue after establishing baseline ASR experiments. ------------------------------------------------------------------------ --- 4. Confidence Estimation for Natural Language Applications Significant progress has been made in natural language processing (NLP) technologies in recent years, but most still do not match human performance. Since many applications of these technologies require human-quality results, some form of manual intervention is necessary. The success of such applications therefore depends heavily on the extent to which errors can be automatically detected and signaled to a human user. In our project we will attempt to devise a generic method for NLP error detection by studying the problem of Confidence Estimation (CE) in NLP results within a Machine Learning (ML) framework. Although widely used in Automatic Speech Recognition (ASR) applications, this approach has not yet been extensively pursued in other areas of NLP. In ASR, error recovery is entirely based on confidence measures: results with a low level of confidence are rejected and the user is asked to repeat his or her statement. We argue that a large number of other NLP applications could benefit from such an approach. For instance, when post-editing MT output, a human translator could revise only those automatic translations that have a high probability of being wrong. Apart from improving user interactions, CE methods could also be used to improve the underlying technologies. For example, bootstrap learning could be based on outputs with a high confidence level, and NLP output re-scoring could depend on probabilities of correctness. Our basic approach will be to use a statistical Machine Learning (ML) framework to post-process NLP results: an additional ML layer will be trained to discriminate between correct and incorrect NLP results and compute a confidence measure (CM) that is an estimate of the probability of an output being correct. We will test this approach on a statistical MT application using a very strong baseline MT system. Specifically, we will start off with the same training corpus (Chinese-English data from recent NIST evaluations), and baseline system as the Syntax for Statistical Machine Translation team. During the workshop we will investigate a variety of confidence features and test their effects on the discriminative power of our CM using Receiver Operating Characteristic (ROC) curves. We will investigate features intended to capture the amount of overlap, or consensus, among the system's n-best translation hypotheses, features focusing on the reliability of estimates from the training corpus, ones intended to capture the inherent difficulty of the source sentence under translation, and those that exploit information from the base statistical MT system. Other themes for investigation include a comparison of different ML frameworks such as Neural Nets or Support Vector Machines, and a determination of the optimal granularity for confidence estimates (sentence-level, word-level, etc). Two methods will be used to evaluate final results. First, we will perform a re-scoring experiment where the n-best translation alternatives output by the baseline system will be re-ordered according to their confidence estimates. The results will be measured using the standard automatic evaluation metric BLEU, and should be directly comparable to those obtained by the Syntax for Statistical Machine Translation team. We expect this to lead to many insights about the differences between our approach and theirs. Another method of evaluation will be to estimate the tradeoff between final translation quality and amount of human effort invested, in a simulated post-editing scenario.

[print/pda] [no frame] [navigation table] [navigation frame]     Page generated 14-02-2008 by Steven Krauwer Disclaimer / Contact ELSNET