Second Frederick Jelinek Memorial Summer Workshop

Research Groups

  • Continuous Wide-Band Machine Translation

    Continuous space models of language (CSMs), have recently been shown to be highly successful in text processing, computer vision and speech processing. CSMs are compelling for machine translation since they permit a diverse variety of contextual information to be considered when making decisions about local (e.g., lexical and morphological) and global (e.g., sentence structure) translation decisions. They thus promise to make machine translation more practical with fewer examples of parallel sentences by leveraging limited parallel data more effectively. Beyond this their flexibility allows for learning representations from other data sources such as monolingual corpora (thus, improving performance in low-resource settings), as well as allowing for translation decisions to be more easily conditioned on context (thus, improving the state-of-the-art in high-resource scenarios). At a high level, our workshop has the following goals: (i) develop a suite of tools to make experimentation with continuous space translation models practical; (ii) demonstrate their effectiveness in low-resource translation scenarios; and (iii) develop models that condition on non-local context-- in particular discourse structure-- to improve the state of the art in high-resource scenarios.

    Scientifically, our modeling efforts will focus on what we call wide-band translation, reflecting the fact that translation decisions can be modeled locally (at the word and phrase level), at the sentence level, and at the discourse level. Current models for translation emphasize sentence-level models, relying on heuristically learned word- and phrase-level models, and ignoring the document context entirely. While it is reasonable to assume that a document translation that consists of good sentence translations will generally be effective, wide-band translation suggests that each level should be modeled independently and then merged into a single global model. This approach enables the translation problem to be decomposed into subproblems that can be addressed independently, using customized resources and models. Additionally, by controlling the number of outputs generated by the local model, the computational complexity of performing inference in subsequent models can be carefully controlled to balance translation quality and decoding speed. We discuss the research focus of each of these levels in turn.

    Local translation modeling. Local translation modeling refers to making decisions about the likely translation possibilities for words and phrases in the input-- in the context of the document and sentence they occur in. These are then pieced together by subsequent models to build complete translations. In current statistical translation models, the likely translation options that are available for each word or phrase are limited by the translations observed in a parallel corpus where the available translations for some input must exactly match observed words and phrases in the parallel training data. Local translation modeling seeks to extend the inventory of phrases available to later models by synthesizing translation options that would not be available to the sentence-level model in a traditional approach-- when such local translations are missing, building a translation is like putting together a puzzle with missing pieces. To illustrate how this can go wrong, in the figure to the right, the translation options necessary to generate the plausible translation I saw the cats are missing. Our modeling efforts here will construct word and phrasal translations by modeling morphological processes (enabling us to model translations with units generally much smaller than can be effectively used in sentence- level translation models), modeling the dependence of translation decisions based on the syntactic and discursive contexts they occur in, and generating lexical translation options from word embeddings learned from monolingual corpora (thus minimizing the amount of parallel data necessary to perform translation). Sentential translation modeling. Equipped with an inventory of translation options for subspans of the input, the task of generating a sentence translation is putting these pieces together in a sensible order. This is a deceptively complex problem: modeling how to reorder phrasal units into an appropriate order requires modeling interactions between the local decisions, formulating a distribution over word/phrase permutations, and searching this space. Dynamic programming search algorithms from context-free parsing enable a large subset (expotnential in the length) of the permutation space to be searched in polynomial time; however, modeling this space are notoriously difficult since the relevant features for determining word order are not obvious. Our insight here is to exploit the fact that when local translation decisions are combined their meanings should compose to form a semantically coherent whole. We will operationalize this using compositional models of vector semantics that have been proposed in previous work for monolingual applications, where each decision about reordering (or, in the context of grammar-based models where reordering is driven by nonterminal substitution) is determined by using the composed vector representation to predict the appropriateness of the composed unit in a translation. See the figure to the left. Thus, this reduces the reordering problem to a series of nested binary classification decisions which can be trained using parallel data. Since the vector size of the constituents and composed representations are specified, statistical efficiency can be maintained, even in low-resource scenarios.

    A second challenge related to global sentence modeling that will be addressed in modeling sentences with continuous space variables will be decoding. The standard dynamic programming approaches that are taken in the literature will be rendered ineffective since recombination will no longer be possible. For these models our starting point will be a reranking framework, but during the course of the workshop, more efficient decoding algorithms will be developed. Beyond their application to identify the best translation, an efficient decoding algorithm will facilitate the development of alternative training objective functions; part of the workshop efforts will look at various options, including local cross-entropy training, global margin maximisation, and data reconstruction error using auto-encoders. We anticipate that the choice of training objective will be important for learning in situations where there is considerable derivational ambiguity, as occurs with modern state of the art phrase-based and grammar based decoders in high resource translation settings.

    Discourse translation modeling.
    Conventional approaches to machine translation translate sentences independently of one another and without reference to the discourse context they occur in. This is a serious limitation. For example, in languages with relatively free word order, information status of syntactic constituents tends to determine where they should appear. Thus, our final scientific contribution will be to explore the incorporation of discourse context in translation modeling. A serious challenge in modeling discourse dependency is determining what the relevant representation of discursive context is: while lexical and syntactic features can be effectively captured with hand-engineered features accessible via standard NLP tools, linguistic theory offers strongly divergent accounts of the appropriate representation of discourse. A key aspect of the workshop will be comparing different feature sets determined by different discourse parsing formalisms, as well as purely unsupervised acquisition of extra-sentential features via convolutional neural networks. Refer to the figure to the left for an illustration of a translation architecture that incorporates extra sentential information as a pair of vectors in the source and target languages.

    Languages and resources. We will target translation between English and three typologically representative languages: German, Czech, and Malagasy, an Austronesian language spoken by 18 million people, mostly in Madagascar. Czech and Malagasy word order is determined largely by information status of NPs, making them an ideal test case for our models. All three have rich morphological paradigms that mean that the naive lexical assumptions made in conventional translation are poorly suited. And Malagasy is representative of the small-resource scenario, where we must learn our models from limited parallel data. All three have standard parallel corpora and machine translation test sets available with standard baselines. Additionally, Czech has a discourse treebank enabling us to explicitly model discourse relations in a non-English source language. Finally, discourse analysis has been most widely studied in English, and Gatech's state-of-the-art RST parser will be used in this workshop.

    Why now? This is an optimal time to pursue the development of a suite of common tools for continuous space language models.Existing toolkits for machine translation (cdec, Moses) are in wide use, and generic tools for devel- oping training and inference algorithms with continuous space models are likewise available (Theano, CNTK, Caffe). From an engineering perspective, they need only be combined. In terms of research interest, there are numerous research groups around the world exploring continuous space models for translation, and this workshop will bring these groups together to avoid inefficient duplication of effort. This workshop is an opportunity to shape and spur future research in this area, and popularise the importance and academic challenges of low resource translation.

    Expected impact. The expected impact of this workshop will be an open-source toolkit for training and decoding with continuous space language and translation models, a reference reimplementation of standard continuous space translation models, publications showing the effect of our wide-band translation formulation, specific topics in trans- lation modeling (at the local and sentential levels), a series of publications on the role of discourse in translation, and papers on low-resource translation using continuous space models with extremely limited parallel data resources.

    Team Members

    Chris DyerCarnegie Mellon Unversity
    Senior Members
    Trevor CohnMelbourne University
    Kevin DuhNara Institute of Science & Technology
    Jacob EisensteinGeorgia Institute of Technology
    Kaisheng YaoMicrosoft Research
    Graduate Students
    Yangfeng JiGeorgia Institute of Technology
    Austin MatthewsCarnegie Mellon Unversity
    Ekaterina VylomovaMelbourne University
    Bahman Yari Saeed KhanlooMonash University
  • Far-Field Enhancement and Recognition in Mismatched Settings


    Based on the recent success of Automatic speech recognition (ASR) for mobile applications, noise robustness of ASR in the real world has become a very important technical issue. ASR systems will soon be expected to function in a variety of conditions-- gaming (Kinect), personal assistants (Amazon Echo), meeting recognition and distance wire-taps, to name a few. Traditional application scenarios tend to utilize the same microphone and channel conditions in training and at test time. Efforts so far have focused on developing techniques, such as microphone arrays, source separation, speech enhancement (SE), and ASR, that work in a given specific setting. Here the setting consists of the configuration, (i.e., the number of mics and their geometry), and the environment (i.e., room noise and reverberation). Such approaches tend to over-fit the system to the training setting, and do not generalize well to mismatched or unseen settings.

    We propose tackling this challenging problem using cutting-edge machine learning techniques based around three themes. The first theme is the embedding of generative model-based strategies into a deep learning framework using deep unfolding [Hershey et al., 2014]. Conventional generative model strategies such as adaptation [Gales, 1998], uncertainty decoding [Barker et al., 2005], and variational inference [Rennie et al., 2010; Watanabe et. al., 2004], allow us to use physical problem constraints to guide an adaptation process, such as inferring the acoustic channel parameters, in order to generalize to new acoustic configurations. Nevertheless, we expect that such methods can be even more powerful when incorporated into a deep learning framework, in which the adaptation model itself can be discriminatively trained to produce more accurate estimates of the signals of interest. The second theme is the augmentation of training data based on existing databases to provide better coverage of unexpected conditions [Cui et al., 2014]. There are many factors of variation in far-field ASR, including noise types, microphone configuration, and room acoustics. By considering the acoustic and physiological constraints of the data generation, however, we can construct stochastic generative processes with few degrees of freedom from which we can efficiently sample multiple instances of training data, enbling multi-condition training. The third theme is the exploitation of multi-task learning methodologies [Seltzer & Droppo, 2013] for ASR and SE, now that, in the context of deep networks, the mathematical formalism to describe enhancement and ASR can be identical [Mohamed et al., 2012].


    I. Probabilistic model-based methods, for example, 1) self-calibrating mic arrays, using ASR models, 2) dereverberation using model-based approaches e.g., [Nakatani et al., 2011], 3) model-based speech enhancement, non-negative matrix factorization (NMF) and its generalizations (e.g., multichannel NMF [Ozerov & Fevotte, 2010]). These methods may be loosely integrated with ASR via lattice-based methods [Mandel & Narayanan, 2014; Carmona et al., 2013] or tightly integrated inside the speech decoder, if possible.

    II. Data augmentation. We will exploit several data augmentation techniques for deep networks to cover speaker variations based on linear transformations or vocal tract length mapping between speakers [Cui et al., 2014], and extend these ideas to multiple configurations and environments to generate an augmented training data set, increasing generalization of the models.

    III. Deep network methods for integrating ASR acoustic modeling and enhancement, in a multi-task learning framework, for example, long short-term memory (LSTM) recurrent neural networks (RNNs), bi-directional LSTMs, convolutional networks, pooling across microphones, pooling across beam directions, and so on.

    IV. Deep unfolding of model-based methods, a hybrid of I. and III. For example, we can derive a novel deep network architecture (as in III.) whose layers emulate the computations performed in the iterations of (for example) a variational algorithm for model-based noise and reverberation compensation (as in I.) using the framework in [Hershey et al., 2014].

    Task design

    Speech data will be embedded in continuous audio backgrounds with natural context and continuity constraints. We will combine existing distant-talk/noise-robust ASR tasks, having different configurations and environments, based on the CHiME series [Barker et al., 2013] (including new six-channel CHiME-3 data), AMI [Hain et al., 2006], REVERB [Kinoshita et al., 2013], and ASpIRE databases. We will also prepare augmented training data based on the data augmentation techniques described in II. Additional data can be recorded if necessary, for example, using instrumented meeting rooms and mobile mic arrays.

    Software platform and outcome

    We will assemble a publicly available state-of-the-art ASR baseline connecting several state-of-the-art SE techniques. SE techniques would include conventional tools such as beamforming, de-reverberation and echo cancellation, and advanced tools such as non-negative matrix factorization, spectrum mask estimation, and RNN based speech enhancement. ASR tools include Kaldi [Povey et al., 2011] for core training and decoding based on tandem bottleneck [Grezl et al., 2007] and DNN acoustic modeling with sequence training [Vesely et al., 2013]. MSR Computational Network Toolkit (CNTK) [Yu et al., 2014] and Theano [Bergstra et al., 2010] can be used for novel architectures including deep unfolding. The outcome of the project will include a far-field speech recognition toolkit and software for data augmentation.

    Team Members

    John HersheyMitsubishi Electric Research Laboratory
    Senior Members
    Jon BarkerSheffield University
    Martin KarafiatBrno University of Technology
    Michael MandelOhio State University
    Shinji WatanabeMitsubishi Electric Research Laboratory
    Graduate Students
    Vijay PeddintiJohns Hopkins University
    Pawel SwietojanskiEdinburgh University
    Karel VeselyBrno University of Technology
  • Probabilistic Transcription of Languages with No Native-Language Transcribers

    Speech technology has the potential to provide database access, simultaneous translation, and text/voice messaging services to anybody, in any language, dramatically reducing linguistic barriers to economic success. To date, speech technology has failed to achieve its potential, because successful speech technology requires very large labeled corpora. Current methods require about 1000 hours of transcribed speech per language, transcribed at a cost of about 6000 hours of human labor; the human transcribers must be computer-literate, and they must be native speakers of the language being transcribed. In many languages, the idea of recruiting hundreds of computer-literate native speakers is impractical, sometimes even absurd. We propose to develop probabilistic transcription methods capable of generating speech training data in languages with no native-language transcribers. Specifically we propose a diversity coding scheme based on three transcription methods:

    (1) MISMATCHED ASR: Speech-to-text transcription using automatic speech recognition in pre-trained languages other than the one being transcribed, and using a global IPA phone set, is used to produce multiple parallel transcriptions.

    (2) MISMATCHED CROWDSOURCING: Human crowd workers who don't speak the target language are asked to transcribe it as if it were a sequence of nonsense syllables.

    (3) EEG DISTRIBUTION CODING: Humans who do not speak the language of interest are asked to listen to its extracted syllables, and their EEG responses are interpreted as a probability mass function over possible IPA phonetic transcriptions of the speech.

    Mismatched crowdsourcing is the process of asking people who don't speak a language to transcribe it, e.g., as a sequence of nonsense syllables. Mismatched crowdsourcing can be viewed as a kind of lossy communication channel: some of the information in the original signal has been systematically deleted by the untrained ears of the transcriber, but critically, some of the information is still available, even in the mismatched transcription. It is possible to estimate the channel substitution probabilities, and using the estimated probabilities, to find the most probable source message. Preliminary experiments demonstrate 96%-correct reconstruction of Hindi phoneme sequence based on nonsense syllable transcriptions by non-Hindi-speaking crowd workers.

    EEG distribution coding is a proposed new method that interprets the pre-categorical electrical evoked potentials of untrained listeners (measured by an electro-encephalograph or EEG) as a posterior probability distribution over the phone set of the foreign language. Transcribers, in this scenario, are speakers of English whose EEG responses to English speech have been previously recorded. From their responses to English speech, an English-language EEG phone recognizer is trained. In order to transcribe non-English speech, the speech is played to these listeners. The vector of posterior probabilities p(English phone|EEG) is computed, for all English phones, and is interpreted as an index into possible non-English phone strings; e.g., an ambiguous posterior probability vector is interpreted as evidence of a phone that does not exist in English.

    In the workshop we propose to use active learning to select the waveforms whose mismatched crowdsourcing and/or EEG distribution coding would be most informative. Specifically, Mismatched ASR (automatic speech recognition using phone recognizers in a variety of languages not including the language of the utterance) can be used to generate information about the possible target-language phonetic transcription of the utterance. Noisy channel methods, similar to those of mismatched crowdsourcing (Fig. 2), can be used to decode the target-language phone string. Two models trained for perfect recall (G: the "general model") and perfect precision (S: the "specific model") can be compared, and waveforms demonstrating the greatest S-G difference are retranscribed using mismatched crowdsourcing and/or EEG distribution coding.

    Workshop deliverables:

    (1) SOFTWARE: Extensions to OpenFST that permit estimation of foreign-language phone confusion probabilities that can be used, e.g., in a noisy channel model of mismatched crowdsourcing and/or mismatched ASR.

    (2) ALGORITHMS: that transcribe Speech from EEG.

    (3) SCIENCE: Expected significant scientific publications, e.g., first-ever study of EEG correlates of foreign-language phone classification.

    (4) MODELS: Transcribed speech and trained audio speech-to-text in 70 languages. Currently we plan to use the 70 languages whose podcasts are published by the Special Broadcasting Service of Australia; among those 70, we believe that transcribed speech and trained speech-to-text models do not currently exist in Armenian, Assyrian, Bangla, Bosnian, Bulgarian, Burmese, Croatian, Dinka, Estonian, Fijian, Gujarati, Hmong, Hungarian, Kannada, Khmer, Kurdish, Lao, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Maltese, Maori, Nepali, Punjabi, Samoan, Sinhalese, Slovak, Slovenian, Somali, Tigrinya, Tongan, or Ukrainian.

    (5) EVALUATION METRICS: trained audio speech-to-text will be evaluated in one of two ways, depending on whether or not we have access to at least one native speaker of the language being evaluated. If a native speaker is available, that native speaker will be asked to transcribe a small test corpus, on which we will compute word error rate. If not, speech-to-text will be evaluated based on its training criteria, viz., likelihood of the audio given the phone string and/or entropy of the phone string given the audio.

    Team Members

    Mark Hasegawa-JohnsonUniversity of Illinois
    Senior Members
    Preethi JyothiUniversity of Illinois
    Edmund LalorTrinity College, Dublin
    Adrian KC LeeUniversity of Washington
    Majid MirbagheriUniversity of Washington
    Graduate Students
    Amit DasUniversity of Illinois
    Giovanni di LibertoTrinity College, Dublin
    Bradly EkinUniversity of Washington
    Hao TangToyota Technological Institute
  • Structured Computational Network Architectures for Robust ASR

    Despite the significant progress made in the past several years in automatic speech recognition (ASR), the ASR performance under low-SNR, reverberation, and multi-speaker conditions is still far from satisfactory, especially when the testing and training conditions do not match which is almost always the case in real world situations.

    We perceive that, to solve the robustness problem, the desired system should have the following capabilities:

    • Classification -- because the goal of ASR is to classify the input audio signal to word sequences.
    • Adaptation -- so that the behavior of the system can automatically change under different (most likely mismatched) environments or conditions to achieve the best performance.
    • Prediction -- in order to provide information to guide the adaptation process and to reduce mismatch.
    • Generation -- so that the system not only can predict labels (e.g., speaker, phone) but also features and can focus on specific part of the signal, which is esp. important when there are overlapping speakers.

    The behavior of prediction, adaptation, generation and classification is widely observed in human speech recognition (HSR). For example, listeners may guess what you will say next and wait to confirm their guess. They may adjust their listening effort by predicting the speaking rate and noise condition based on the current information, or predict and adjust the mapping from letter to sound based on the speaker's current pronunciation. They may even predict what your next pronunciation would sound like and focus their attention to only the relevant part in the audio signal.

    These capabilities are integrated in a single dynamic system in HSR. This dynamic system can be approximated at the functional level and described as a recurrent neural network (RNN) and implemented using the computational network toolkit (CNTK), which supports training and decoding of arbitrary RNNs.

    A general simplified description of the model is:

    where are the noisy speech input, prediction result, hidden state, and output at time t, respectively, n is the window of frames the prediction is based on, and , and are the nonlinear functions to make prediction and classification, and to estimate the hidden state, respectively. The predicted information can include, but not limited to, future phones, speaker code, noise code, device code, speaking rate, masks, and even the expected feature. The hidden state can carry over long short-term history information to the model.

    The benefit of predicting and exploiting the auxiliary information can be shown using a simple example. Assume in the training time we learn where f is the function learned, y is the label, and x is the input feature. During the test time, x is corrupted and becomes , where c is a constant channel distorsion. As a result, and we get degraded performance. However, if auxiliary information a, such as the noise level, channel distortion, and speaker id, is estimated and modeled during the training time, we get , where g is a function learned through training and we have put a structure into the model. During the test time the auxiliary information is estimated and the distortion of the feature can thus be compensated automatically as we have .

    Fig. 1 is a concrete example of such dynamic systems. The conventional ASR systems only contain the classification (right) component and thus cannot adapt to new environments. Slightly more advanced techniques estimate (often requires multiple passes of the utterances) some auxiliary information independently from the rest of the network and use it as the side-information to guide the adaptation. In our proposed approach shown in Fig. 1, the auxiliary information is predicted through an integrated prediction (left) component using past frames. The predicted auxiliary information helps to improve the classification component's accuracy. The classification component's processing result in turn helps to boost the performance of the prediction component. It is obvious that the classification component's behavior adapts automatically based on the predicted information. At the same time, the prediction component's behavior also adapts automatically based on the information fed back from the classification component.

    To train the model, we need labels for auxiliary information (many of which can be automatically generated) as well as the text transcription. The model will be trained to optimize multiple objectives (sometimes called multi-task learning), including both the final classification accuracy and the prediction accuracy.

    We have implemented and investigated a very preliminary version of the model, in which only the next phone is predicted and used as the auxiliary information and the predicted information is solely concatenated with the raw input feature, using CNTK and Argon decoder. Table 1. compares this preliminary model (PAC-RNN) with the state-of- the-art systems. From the table we can observe that the proposed approach is very promising.

    To further advance the research along this line, we need to design computational network architectures that can predict other information such as speaker code, noise level, speech mask (to trace a specific speaker), and speaking rate. To exploit multiple auxiliary information we also need to develop more advanced structured adaptation techniques, in which the network's parameters are jointly learned with different combination of factors and with the rest of the system. We will evaluate our model on the meeting transcription or Youtube (if data are available) task that features noisy far-field microphone, reverberation, and overlapping speech.

    We expect to develop novel structured computational network architectures for robust speech recognition and shape a new research direction in attacking low-SNR, noisy and mismatched ASR. Besides the technology advancement, we will also contribute to the open source software. More specifically we will integrate Kaldi with CNTK so that Kaldi can have the ability to build arbitrary computational networks, including all existing architectures such as DNN, RNN, LSTM, and CNN as well as models we even don't know today. This tool integration alone would help to move the field forward significantly.

    Team Members

    Dong YuMicrosoft Research
    Senior Members
    Liang LuEdinburgh University
    Khe Chai SimNational University of Singapore
    Graduate Students
    Souvik KunduNational University of Singapore
    Tian TanShanghai Jiao Tong University

WS'14 Summer Workshop

Research Groups

  • ASR Machines That Know When They Do Not Know

    Goal: Development of ASR systems that can successfully deal with new, unexpected data ("systems that know when they do not know" or "getting rid of unknown unknowns")

    To constrain the problem and provide resources for team members with different ideas, the core problem is stated as: Given a classifier that yields a frame-based vector of posterior probabilities for speech sounds of interest, predict the accuracy of these estimates without knowing the correct probabilities on test data but knowing performance of the classifier on the training data.

    The main attack on this problem will be through multi-stream processing, where many parallel and partially redundant processing streams are derived from information providing data. This approach should be effective in many practical situations where the unexpected signal distortions negatively affect only some of the processing streams while the remaining streams can still be used for the extraction of the targeted information. The technique needs to be unsupervised, since the ground truth on the unknown data is not known, and fast, since new unexpected data need to be dealt with.

    To date, research at JHU has resulted in formation of band-limited artificial neural net based processing streams for recognition of noisy speech, and in a couple of techniques for estimating the classifier performance based on temporal dynamics of classifier outputs. JHU will provide its multistream experimental system with 31 processing streams based on independent artificial neural net classifiers. Initial results on recognition of noise-corrupted TIMIT have been already obtained and will serve as a baseline. We will also provide the true accuracies for all processing streams, which would serve as the ideal targets of our efforts.

    Team Members

    Hynek HermanskyJohns Hopkins University
    Senior Members
    Lukas BurgetBrno University of Technology
    Jordan CohenSpelamode Consulting
    Naomi FeldmanUniversity of Maryland
    Tetsuji OgawaWaseda University
    Richard RoseMcGill University
    Richard SternCarnegie Mellon University
    Graduate Students
    Matthew MaciejewskiCarnegie Mellon University
    Harish MallidiJohns Hopkins University
    Anjali MenonCarnegie Mellon University
    Vijayaditya PeddintiJohns Hopkins University
    Matthew WiesnerMcGill University
    Affiliate Members
    Eleanor ChodroffJohns Hopkins University
    Emmanuel DupouxLaboratoire de Science Cognitive et Psycholinguistique
    John GodfreyJohns Hopkins University
    Sanjeev KhudanpurJohns Hopkins University
  • Cross-Lingual Abstract Meaning Representations (CLAMR) for Machine Translation

    Goal: Explore the potential benefit of Abstract Meaning Representation to semantics-based statistical machine translation.

    This team will explore several facets of using Abstract Meaning Representation (AMR), analyzing and generating them, matching parallel AMRs from source and target languages, graph learning of AMRs (GLAMR), and determining semantically equivalent AMRs that can provide a greater range of matching options. The team leverages several closely related research traditions, including the Czech Tectogrammatical approach, ISI's AMR prototyping, and longstanding syntactic and semantic modeling at Boulder, Brandeis, Rochester and elsewhere, all of which benefit from the availability of treebanks, PropBanks, and other richly annotated linguistic resources as represented by SemLink.

    Both Chinese/English and Czech/English corpora and test sets have been prepared for use in the summer and beyond. The team aims to reduce English bias from existing AMRs so that cross-linguistic AMRs can be as compatible as possible. Since they are graphs, traditional tree-matching approaches used for machine translation must be extended to graph matching, requiring new approaches to make the problem tractable. The team hopes to address the question of whether or not any graph matching obstacles can be overcome by generating alternative AMRs that are semantically equivalent. Team members are also interested in knowing whether or not insights gained this summer will provide concrete measurable improvements to either AMR parsing or AMR generation - both key steps in an AMR-based machine translation system.

    The team's investigations are organized along three named, intertwined threads:

    • GLAMR, or Graph Languages for AMRs, entails looking at AMR pairs to see what operations are needed to ensure accurate mappings and meaning-transfer from one language to another.
    • MATRIX, or Meaning in AMRs and Tectogrammatical Representation Interchange, entails reformatting AMRs automatically to semantically equivalent representations in search of better cross-linguistic matching.
    • PARSE entails automatically parsing English, Chinese and Czech sentences into AMRs

    Team Members

    Martha PalmerUniversity of Colorado
    Senior Members
    Ondrej BojarCharles University in Prague
    David ChiangUniversity of Southern California
    Frank DrewesUmea University
    Daniel GildeaUniversity of Rochester
    Jan HajicCharles University in Prague
    Adam LopezJohns Hopkins University
    Giorgio SattaUniversity of Padua
    Zdenka UresovaCharles University in Prague
    Graduate Students
    Wei-Te ChenUniversity of Colorado
    Ondrej DusekCharles University in Prague
    Jeffrey FlainganCarnegie Mellon University
    Tim O'GormanUniversity of Colorado
    Xiaochang PengUniversity of Rochester
    Martin PopelCharles University in Prague
    Aditya RenduchintalaJohns Hopkins University
    Naomi SaphraJohns Hopkins University
    Chuan WangBrandeis University
    Yuchen ZhangBrandeis University
    Affiliate Members
    Silvie CinkovaCharles University in Prague
    Sanjeev KhudanpurJohns Hopkins University
    James PustejovskyBrandeis University
    Roman SudarikovCharles University in Prague
  • Probabilistic Representations of Linguistic Meaning (PRELIM)

    This team will first undertake an open-ended and substantive deliberation of meaning representations for linguistic processing, and then focus on a pragmatic problem in semantic processing by machines.

    Goal 1: Deliberate upon representations of linguistic meaning. "Deep" natural-language understanding will eventually need more sophisticated semantic representations. What representations should we be using in 10 years? How will they relate to non-linguistic processing? How can we start to recover them from text or other linguistic resources?

    Linguists currently rely on modal logic as the foundation of semantics. However, semantics and knowledge representation must connect to reasoning and pragmatics, which are increasingly regarded by the AI and cognitive science communities as involving probabilistic inference and not just logical inference. Can we find a probabilistic foundation to integrate the whole enterprise? What is the role of probability distributions over semantic representations and within semantic representations?

    The team includes leaders from multiple communities -- linguistics,natural language processing, machine learning, and computational cognitive science. We hope to make progress toward an acceptable theory by integrating the constraints and formal techniques contributed by all of these communities.

    This week-long immersive exercise, which will take a broad perspective on meaning and its representation, is expected to inform the long-term thinking of all workshop participants, even as they pursue near-term practical uses of meaning representations.

    Goal 2: Explore semantic/proto-roles, from both a theoretical and an empirical perspective.

    This research is motivated by linguists such as David Dowty, who have considered the meta-question of which (if any) of the semantic role theories espoused in the literature are well founded. Instead of the traditional coarse labels, they create a binary feature structure representation by collecting human responses to questions on proto-roles (e.g., "does the subject of this verb have a causal role in the event?" or "does the object of this verb change location as a result of the event?").

    From a computational perspective, the team will build a classifier for automatic prediction of these binary features, adapting recent work led by Van Durme on models for PropBank semantic role classification. They will also perform corpus-based studies on how these feature structures correlate with existing resources such as Framenet and PropBank. PropBank is a precursor to the current work in AMR, which will lead to interesting discussions with the CLAMR team.

    Team Members

    Jason EisnerJohns Hopkins University
    Benjamin Van DurmeJohns Hopkins University
    Senior Members
    Oren EtzioniUniversity of Washington, Allen Institute
    Craig HarmanJohns Hopkins University
    Shalom LappinKing's College London
    Staffan LarssonUniversity of Gothenburg
    Dan LassiterStanford University
    Percy LiangStanford University
    David McAllesterToyota Technical Institute
    James PustejovskyBrandeis University
    Kyle RawlinsJohns Hopkins University
    Graduate Students
    Nicholas AndrewsJohns Hopkins University
    Frank FerraroJohns Hopkins University
    Drew ReisingerJohns Hopkins University
    Darcey RileyJohns Hopkins University
    Rachel RudingerJohns Hopkins University

WS'13 Summer Workshop

Research Groups

  • Speaker and Language Recognition (June 22 - July 24)

    In the summer of 2013, CLSP hosted a 4-week workshop to explore new challenges in speaker and language recognition. A group of 16 international researchers came together to collaborate in a set of research areas described below. The workshop was motivated by the successful outcomes of the 2008 CLSP summer workshop and the BOSARIS workshops of 2010 and 2012.

    The workshop was sponsored by the individual funds of the participants and by Google research.

    Research areas

  • Domain adaptation for speaker recognition
  • + Motivation: Advances in subspace modeling, specifically the i-vector approach, have demonstrated dramatic and consistent improvement in speaker recognition performance on the NIST speaker recognition evaluations over the past 4 years. However, these techniques are highly-dependent on having access to large amounts of labeled training data from thousands of speakers each making tens of calls to train the hyper-parameters (UBM, total-variability matrix, within and between covariance matrices). The archive of past LDC data collections has provided such a set of data for the NIST SREs and been used effectively. However, it is highly unrealistic to expect such a large set of labeled data from matched conditions when applying a speaker recognition system to a new application. Thus there is a need to focus research efforts on how to use unlabeled data for adapting and applying i-vector speaker recognition systems.

    + Resources:

    - Domain Adaptation Challenge (DAC) description [pdf].

    - If you want to start from the audio (assuming you have the data), here are the lists of files for the DAC.

    - If you want to start from i-vectors.

    - Matlab script that shows how to use the i-vectors [example_DAC_cosine.m].

    - Running a Gaussian PLDA system (like this) on the i-vectors above produces the following results.

    + List of publications:

    - Stephen Shum, Douglas Reynolds, Daniel Garcia-Romero, and Alan McCree, "UNSUPERVISED CLUSTERING APPROACHES FOR DOMAIN ADAPTATION IN SPEAKER RECOGNITION SYSTEMS", Odyssey, 2014.


    - Daniel Garcia-Romero, Alan McCree, Stephen Shum, Niko Brummer, and Carlos Vaquero, "UNSUPERVISED DOMAIN ADAPTATION FOR I-VECTOR SPEAKER RECOGNITION", Odyssey, 2014.




  • Unsupervised score calibration
  • + Motivation: When a speaker recognizer is deployed in a new environment, which may differ from previously seen environments w.r.t. factors like language, demographics, vocal effort, noise level, microphone, transmission channel, duration, etc., the behaviour of the scores may change. Although the scores can still be expected to discriminate between targets and non-targets in the new environment, score distributions could change between environments. If scores are to be used to make hard decisions, then we need to calibrate the scores for the appropriate environment. To date, most works on calibration have made use of supervised data. Here, we explore the problem of calibration where our only resource is a large database of completely unsupervised scores.

    + List of publications:


  • Deep neural networks for language recognition
  • + Motivation: Deep Neural Networks have recently proved to be successful in challenging machine learning applications such as acoustic modelling, visual object recognition and many other; especially when large amount of training data is available. Motivated by those results and also by the discriminative nature of DNNs, which could complement the i-vector generative approach, we adapt DNNs to work at acoustic frame level to perform Language Identification. Particularly, in this work, we build, explore and experiment the use of several DNNs configurations and compare the obtained results with several state-of-the-art i-vector based systems trained from the same acoustic features.

    + List of publications:

    - Ignacio Lopez-Moreno, Javier Gonzalez-Dominguez, Oldrich Plchot, David Martinez, Joaquin Gonzalez-Rodriguez, and Pedro Moreno, "AUTOMATIC LANGUAGE IDENTIFICATION USING DEEP NEURAL NETWORKS", ICASSP, 2014.

  • JFA-based front ends for speaker recognition
  • + Motivation: Overcome some of the limitations of the i-vector representation of speech segments by exploiting Joint Factor Analysis (JFA) as an alternative feature extractor. The work addresses both text-independent and text-dependent speaker recognition.

    + List of publications:

    - Patrick Kenny, Themos Stafylakis, Jahangir Alam, Pierre Ouellet, "JFA-BASED FRONT ENDS FOR SPEAKER RECOGNITION", ICASSP, 2014.

  • Vector Taylor Series (VTS) for i-vector extraction
  • + Motivation: i-vector speaker recognition systems achieve good permorfance in clean environments. The goal is to adapt the i-vector approach for noisy conditions, where the accuracy of the systems is degraded. Our solutions are based on VTS and unscented transform (UT). We have adopted the simplified VTS recently proposed by (Yun Lei et al., 2013), and studied a new approach based on UT that allows a more accurate modelling of nonlinearities. The last is especially useful for very low SNRs.

    + List of publications:

    - David Martinez, Lukas Burget, Themos Stafylakis, Yun Lei, Patrick Kenny, and Eduardo Lleida, "UNSCENTED TRANSFORM FOR IVECTOR-BASED NOISY SPEAKER RECOGNITION", ICASSP, 2014.

    Team Members

    Affiliate Members
    Hagai AronowitzIBM, Israel
    Niko BrummerAGNITIO, South Africa
    Lukas BurgetBrno University of Technology, Czech Republic
    Sandro CumaniBrno University of Technology, Czech Republic
    Najim DehakMIT, USA
    Daniel Garcia-RomeroJohns Hopkins University, USA
    Javier Gonzalez DominguezUniversidad Autonoma de Madrid, Spain
    Patrick KennyCRIM, Canada
    Ignacio Lopez MorenoGoogle, USA
    David MartinezUniversidad de Zaragoza, Spain
    Oldrich PlchotBrno University of Technology, Czech Republic
    Themos StafylakisCRIM, Canada
    Albert SwartAGNITIO, South Africa
    Carlos VaqueroAGNITIO, Spain
    Karel VeselyBrno University of Technology, Czech Republic
    Jesus VillalbaUniversidad de Zaragoza, Spain

WS'12 Summer Workshop

Research Groups

  • Complementary Evaluation Measures for Speech Transcription

    The classical performance measure of a speech recognizer, a.k.a. speech transcriber, has been word-error rate (WER). This measure dates back to the days when ASR was regarded as a task in its own right, and the goal was simply nothing more than to aim for the same perfect transcriptions that humans can (in principle) generate from listening to speech. We know better now: it seems, in fact, very unlikely that anyone would be interested in reading as much as a single-page transcript of colloquial, spontaneous speech, even if it were transcribed perfectly. What people want is to search through speech, summarize speech, translate speech, etc. And our computers' memory capacity is now capable of storing large amounts of digitized audio to make these derivative tasks more direct.

    Underneath the hood of any one of these "real" tasks, however, is a speech recognizer, or at least some components of one, which generates word hypotheses that are numerically scored. Even very flawed transcripts are nevertheless a very valuable source of features on which to train spoken language processing applications, even if we would be too embarrassed to show them to anyone. How do we evaluate the issue of these components? According to recent HCI research, WER simply does not work. Dumouchel showed that manually correcting a transcript with more than 20% WER is actually harder than starting over from scratch, and yet Munteanu et al. showed that transcripts with WERs of as much as 25% are statistically significantly better than not having a transcript on a common lecture-browsing task for university students. Transcripts with WERs as bad as 46% have proven to be a useful source of features for speech summarization systems, at least according to the very flawed standards of current summarization evaluations, but it is also clear that those same standards often label poor summaries as very good because of the lack of a higher-level organization or goal orientation that people expect from summaries, and it remains unclear the extent to which WER affects this. University of Toronto's computational linguistics lab, which specializes in HCI-style experimental design for spoken language interfaces, is currently conducting a large human-subject study of speech summarizers, in order to evaluate summary quality in a more ecologically valid fashion.

    With this experience at hand, this proposed workshop would focus on measures of transcript quality that are complementary to WER in the ASR- and summarization-related tasks of: (1) rapid message assessment, in which an accurate gist of a small spoken message must be formed in order to make a rapid decision, such as in military and emergency rescue contexts; and (2) decision-support for meetings, in which very interactive spoken negotiations between multiple participants must be distilled into a set of promises, deliverables and dates. Our intention is to bring our experience with human-subject experimentation on ASR applications together with recent advances in semantic distance measures as well as statistical parsing to formulate complementary objective functions to WER that can be computed without human-subject trials and employed to turn around better message-assessment and decision-support systems through periodic testing on development data.

    Recent work on alternative metrics for statistical machine translation could be cited as parallel to the present proposal, but there is an important distinction. The recent SMT metrics work has focussed on a principle that, in HCI research, is called _construct validity_. BLEU scores have no construct validity, it is claimed, because an SMT development team can "game" an evaluation by seeking better BLEU scores even if it adversely affects the true quality of the translation (whatever that is). Our proposal seeks to remedy a more fundamental problem with speech transcription evaluation that is sometimes called _ecological validity_. Stipulating for the moment that WER is the perfect construct for measuring success in speech transcription, the current value of ASR systems does not derive from their success at generating transcriptions at any level of quality. Not only do we have the wrong measure, but we're also measuring the wrong task. Rapid message assessment and decision-support for meetings are ecologically valid tasks. (In case you were wondering, there has been no serious consideration of how ecologically valid our evaluation of automatically translated documents is either, but the recent work on MT metrics has addressed more acutely perceived concerns about the construct invalidity of BLEU scores).

    The current format of JHU workshops makes it extremely difficult to conduct any sort of corroborative human-subject trial during the actual six-week workshop itself, so we anticipate that during the six-month period prior to the event, workshop participants would pool together speech corpora in the two above-mentioned domains together with a few state-of-the-art ASR systems in order to collect transcripts of varying WER rates. Again prior to the workshop, we would then conduct the human-subject trials necessary to establish an ecologically valid gold standard for human participation in these tasks that would serve, among other purposes, to differentiate transcripts of roughly the same WER that are not at all the same with respect to how well they enable the task at hand. The six-week period of the workshop would then be devoted to experimentation with objective functions based on NLP techniques, as well as improvement of an existing system in at least one of these two tasks, to demonstrate the benefit of this alternative evaluation scheme on an actual piece of spoken language technology.

    Team Members

    Senior Members
    Benoit FavreUniversite' de Marseilles
    Gerald PennUniversity of Toronto
    Stephen TratzDepartment of Defense
    Clare VossDepartment of Defense
    Graduate Students
    Siavash KazemianUniversity of Toronto
    Adam LeeCity University of New York
    Undergraduate Students
    Kyla CheungColumbia University
    Dennis OcheiDuke University
    Affiliate Members
    Yang LiuUniversity of Texas at Dallas
    Cosmin MunteanuNational Research Council Canada
    Ani NenkovaUniversity of Pennsylvania
    Frauke ZellerWilfried Laurier University
  • Domain Adaptation in Statistical Machine Translation

    Introduction: Statistical machine translation (SMT) systems perform poorly when applied on new domains.  This degradation in quality can be as much as ⅓ of the original system's performance; the Figure below provides a small qualitative example, and illustrates that unknown words (copied verbatim) and incorrect translations are major sources of errors.  When parallel data is plentiful in a new domain, the primary challenge becomes that of scoring good translations higher than bad translations.  This is often accomplished using either mixture models that downweight the contribution of old domain corpora, or by subsampling techniques that attempt to force the translation model to pay more attention to new domain-like sentences.  A more sophisticated approach recently demonstrated that phrase-level adaptation can perform better.  However, these approaches are still less sophisticated than state-of-the-art domain adaptation (DA) techniques from the machine learning community.  Such techniques have not been applied to SMT, likely due to the mismatch between SMT models and the classification setting that dominates the DA literature.  The Phrase Sense Disambiguation (PSD) approach to translation, which treats SMT lexical choice as a classification task, allows us to bridge this gap. In particular, classification-based DA techniques can be applied to PSD to improve translation scoring.  Unfortunately, this is not enough when only comparable data exists in the new domain.  Here, we face the additional challenge of identifying unseen words and also unknown word senses of seen words and attempting to figure out potential translations for these lexical entries. Once we have identified potential translations, we still need to score them, and the techniques we developed for addressing the case of parallel data directly apply.



    Old Domain

    New Domain (Medical)

    Original German text

    wenn das geschieht, würden die serben aus dem nordkosovo wahrscheinlich ihre eigene unabhängigkeit erklären.

    darreichungsform : weißes pulver und klares , farbloses lösungsmittel zur herstellung einer injektionslösung

    Human translation

    if that happens, the serbs from north kosovo would probably have their own independence.

    pharmaceutical form : white powder and clear , colourless solvent for solution for injection

    SMT output

    if that happens, it is likely that the serbs of north kosovo would declare their own independence.

    darreichungsform : white powder and clear , pale solvents to establish a injektionslösung

    Figure: Output of a SMT system. The left example is from the system's old training domain, the right is from an unseen new domain. Incorrect translations are highlighted in red, the two German words are unknown to the system, while the two English words are incorrect word sense problems.


    1.     Understand domain divergence in parallel data and how it affects SMT models, through analysis of carefully defined test beds that will be released to the community.

    2.     Design new SMT lexical choice models to improve translation quality across domains in two settings:

    a.     When new domain parallel data is available, we will leverage existing machine learning algorithms to adapt PSD models, and explore a rich space of context features, including document level context and morphological features.

    b.    When we only have comparable data in the new domain, we will learn training examples for PSD by identifying new translations for new senses.

    Approach:  While BLEU scores suggest that SMT lexical choice is often incorrect outside of the training domain, we do not yet fully understand the sources of translation error for different domains, languages and data conditions. In a DA setting without new parallel data, we have identified unseen words and senses as the main sources of error in many new domains, by analyzing impacts on BLEU. We will conduct similar analyses for the setting with new parallel data.  We will also consider sources of error like word alignment or decoding. We will exploit parallel text to better understand differences between general and domain-specific phrase usage, and their impact on SMT.

    We can learn differences between general language terms, domain-specific terms, and domain-specific usages of general terms, by using their translations as a sense annotation. This is a complex task, since domain shifts are not the only cause of translation ambiguity. For instance, in English to French translation, "run" is usually translated in the computer domain as "éxécuter", and in the sports domain as "courir"; but other senses (such as "diriger", "to manage") can appear in many domains. Sense distinctions also depend on language pairs, which suggests that comparable data in the input language truly is necessary.  For example, consider the English words "virus" and "window."  When translating into French, regardless of whether one is in a general domain or a computer domain, they are translated the same way: as "virus" and "fenêtre", respectively.  However, when translating into Japanese, the domain matters.  In a general domain, they are respectively translated as "病原体" and "窓"; but in a computer domain they are transliterated.

    To build SMT systems that are adapted to a new domain, we first consider the setting with parallel data from the new domain.  A baseline translation approach we will leverage explicitly models the domain-specificity of phrase pair types to re-estimate translation probabilities. Rather than using static mixtures of old and new translation probabilities, this approach learns phrase-pair specific mixture weights based on a combination of features reflecting the degree to which each old-domain phrase pair belongs to general language (e.g., frequencies, "centrality" of old model scores), and its similarity to the new domain (e.g., new model scores, OOV counts). By moving to a PSD translation model, we can attempt much more sophisticated adaptation, and better model the entire spectrum between general and domain specific senses.  In PSD, based on training data extracted from word-aligned parallel data, a classifier scores each phrase-pair in the lexicon, using evidence from the input-language context. Although there are certainly non-lexical affects of domain shift, we will focus on the lexicon, which is the most fruitful target given our past experience.

    With parallel data, our work will focus on adapting PSD to new domains in order to learn better scores for lexical selection. First, we will design adaptation algorithms for PSD, by applying existing learning techniques for DA.  Such approaches typically have two goals: (1) to reduce the reliance of the learned model on aspects that are specific to the old domain (and hence will be unavailable at test time), and (2) to use correlations between related old-domain examples and new-domain examples to "port" parameters learned on the old to the new domain. Such techniques can be directly applied to the PSD translation model, using large context as features. Second, we will determine what features are most important for this task.  We can limit ourselves to local contexts like in past work, or can use much larger contexts (the paragraph, or perhaps the entire document) to build better models. In addition, we will use morphological features to tackle the data sparsity issues that arise when dealing with small amounts of new domain data.

    With only comparable text, we must spot phrases with new senses, identify their translations, and learn to score them.  We will attack the identification challenge using context-based language models (n-gram or topic models) to identify new usages.  For example, in the computer domain, one can observe that "window" still appears on the English side, but "窓" (the general domain word for "window") has disappeared in Japanese, indicating a potential new sense.  For identifying translations we will study dictionary mining or active learning.  The scoring problem can be addressed exactly as before.  While finding new senses and translations is a challenging problem even in a single domain, we believe that differences that might get lost in a single domain with plentiful data will be more apparent in an adaptation setting.

    Evaluation: We will create standard experimental conditions for domain adaptation in SMT and make all resources available to the community. We will consider three very different domains with which we have past experience: medical texts, movie subtitles and scientific texts. We will focus on French-English data, since our team includes native speakers of these two languages.

    We will evaluate the performance of all adapted and non-adapted translation systems using standard automatic metrics of translation quality such as BLEU and Meteor. However, we strongly suspect that these generic metrics do not adequately capture the impact of adaptation on domain-specific vocabulary, and we will investigate how to evaluate domain-specific translation quality in a more directly interpretable way. We will study lexical choice accuracy (automatically checking whether a translation predicted by PSD using source context is correct) using gold standard annotations. We will evaluate extracting this knowledge by manually correcting automatic word-alignments and also by using terminology extraction techniques (e.g., finding translations of the keywords in scientific texts, etc).

    Organization: Before the workshop, we will collect and process all necessary data, train language models, topic models and baseline SMT and PSD systems. During the workshop, we will focus exclusively on data analysis, design and evaluation of new algorithms.

    Conclusion: Domain mismatch is a significant challenge for statistical machine translation. Our proposed work will elucidate this problem through careful data analysis, will provide test beds for future research, will close the gap between statistical domain adaptation and statistical machine translation, and will improve translation quality through novel methods for identifying new senses from comparable corpora.

    Team Members

    Senior Members
    Marine CarpuatNational Research Council Canada
    Hal Daum IIIUniversity of Maryland
    Alexander FraserUniversity of Stuttgart
    Chris QuirkMicrosoft Research
    Graduate Students
    Fabienne BrauneUniversity of Stuttgart
    Ann CliftonSimon Fraser University
    Ann IrvineJohns Hopkins University
    Jagadeesh JagarlamudiUniversity of Maryland
    John MorganArmy Research Laboratory
    Majid RazmaraSimon Fraser University
    Ales TamchynaCharles University
    Undergraduate Students
    Katharine HenryUniversity of Chicago
    Rachel RudingerYale University
    Affiliate Members
    George FosterNational Research Council Canada
  • Towards a Detailed Understanding of Objects and Scenes in Natural Images

    Final Report

    As a human glances at an object, for example an apple, a building, or a rifle, he/she is immediately aware of many of the object qualities. For instance, the apple may be red or green, the building exterior may be reflective (glass) or dull (concrete), and the rifle may be made of metal or plastic. These properties or attributes can be used to describe the objects (e.g., differentiating a green apple from a red one), to further qualify them (e.g., a plastic rifle is probably a toy), or to improve discrimination (e.g., an object in the shape of a cat but made of stone is probably not an animal, but a statue). By contrast, even the best systems for artificial vision have a much more limited understanding of objects and scenes. For instance, state-of-the-art object detectors model objects as distributions of simple features which capture a blurred statistics of the two-dimensional shape of the objects. Colour, material, texture, and most of the other object attributes are likely ignored in the process.

    The objective of this workshop is to develop novel methods to reliably extract from images a diverse set of attributes, and to use them to improve the accuracy, informativeness, and interpretability of object models. The goal is to combine advances in discrete-continuous optimisation, machine learning, and computer vision, to significantly advance our understanding of visual attributes and produce new state-of-the-art methods for their extraction. Inspired by popular features, easy-to-use open source software for the extraction of the new attribute will be released, with the goal of commodifying the use of attributes in computer vision.

    Due to their significance, visual attributes have been an increasingly favourite topic of research. Nevertheless, results have been so far limited. For example, while most attributes are associated to a given object or object part, and have therefore a local scope, some methods treat them as global image properties. As a consequence, such methods are more likely to deduce the presence of attributes from correlated objects (e.g., there is a car and hence metal) instead of detecting the attributes as such (e.g., this region looks chrome, indicating that there may be a car or another metallic object). Other methods roughly localise attributes by bounding boxes, for instance from regions obtained by detecting the object of interest first. In this case, attributes may be useful to qualify objects a-posteriori, but are not integral part of the object model during detection. Finally, all these methods use canned features and models, which are probably suboptimal for a large number of attributes, as these are visually subtle properties (for example detecting chrome requires finding reflections). The work of this six weeks workshop will focus on four areas: (i) identifying and systematising visual attributes, (ii) collecting annotated data for learning and evaluating attribute models, (iii) exploring novel learning and inference techniques to better extract a diverse set of attributes and (iv) evaluating the new representations in canonical tasks such as object detection on international benchmark data. These areas are detailed next.

    Attributes.Attributes may refer to any of a large number of very different concepts, including colours, textures, materials, two or three dimensional shapes, object parts, and relations. While these attributes are often treated equally in term of modelling and detection, they share little beyond being localised in an image. In preparation to the workshop, attributes will be roughly subdivided by expected modelling requirements, abstraction levels, and visualness, identifying prototypical attributes for each class. By focusing on these prototypes, most of the key issue in attribute extraction will be investigated, while maintaining the scope reasonable for the relatively short time available.

    Data.High quality data and annotations have often been instrumental to many advances in computer vision. For example, the introduction of the PASCAL VOC challenge data has significantly boosted the performance of object detectors. In preparation to the workshop, in a collaborative effort using Amazon's Turk, the existing PASCAL VOC dataset will be extended with annotations for the selected attributes, including their localisation in images. By extending an established dataset, this effort can be expected to be useful to the computer vision community at large.

    Modeling, inference, and learning.The core of the workshop, spanning the six weeks, will be the development, learning, and evaluation of models for visual attributes. Modelling and inference will be based on novel ideas in discrete-continuous optimisation. The goal is to decompose efficiently an image into a set of regions characterised by semantically meaningful but visually subtle attributes. Unfortunately the expressive power of standard segmentation methods such as Markov Random Fields (MRFs) is severely limited by their inability to capture the appearance of segments as a whole. MRFs, for example, simply add a smoothness prior to evidence that is otherwise pooled locally and independently at each image pixel. Therefore these algorithms are usually considered adequate for recognizing stuff, i.e. homogeneous patterns which do not have a characteristic shape, such as grass, sky or wood, but cannot infer holistic models of the regions, which leaves out cues such as gradients, low rank textures, repeated patterns, and shapes that can be essential in the recognition of certain attributes. For example, recognising chrome requires analysing the overall structure of a region to identify reflections, something that a standard MRF cannot achieve. Discriminating between different instances of the same attribute is also difficult: for example, in order to differentiate horizontal black-and-white stripes from vertical yellow-and-green ones one would have to instantiate a new MRF label for each case, an approach that does not scale.

    While segmentation methods have been combined with holistic top-down models before, for example to propose possible object locations, refine object localization, or both, simultaneously segmenting the image and estimating non-trivial models for each segment has led to intractable energy functions, difficult to optimize even approximately. Nevertheless, there exist a few examples that carry out this program efficiently, at least in special cases. In particular, in the workshop powerful techniques inspired by the combination of MRF and sampling techniques such as RANSAC will be explored.

    In term of preparatory work, the VLFeat library will be adopted as a simple-to-use toolkit for basic image processing and feature extraction. Additional software implementing the continuous-discrete optimisation techniques discussed above will be made available to the workshop participants. During the workshop these powerful techniques will be used to design, learn, and test new and more advanced attribute models. The synergy between the team members, each with their own particular expertise in various computer vision areas, is likely to be key to success.

    Evaluation. Throughout the workshop, attribute models will be evaluated in term of retrieval performance on the annotated PASCAL VOC data. Starting from week four, attributes will be tested as an additional cue in learning object category models in PASCAL VOC. Extensible state-of-the-art software will be provided to the participants to make this a plug-and-play operation. Both the accuracy and interpretability of the new detectors will be evaluated (for example, we expect to learn automatically that cars are (often) chrome and have windows made of glass).

    Team Members

    Senior Members
    Matthew BlaschkoEcole Centrale Paris
    Iasonas KokkinosEcole Centrale Paris
    Subhransu MajiToyota Technological Institute at Chicago
    Esa RahtuUniversity of Oulu, Finland
    Ben TaskarUniversity of Pennsylvania
    Andrea VedaldiUniversity of Oxford
    Graduate Students
    Ross GirshickUniversity of Chicago
    Siddarth MahendranJohns Hopkins University
    Karen SimonyanUniversity of Oxford
    Undergraduate Students
    Sammy MohamedStony Brook University
    Naomi SaphraCarnegie Mellon University
    Affiliate Members
    Juho KannalaUniversity of Oulu, Finland
  • Zero Resource Speech Technologies and Models of Early Language Acquisition

    The unsupervised (zero resource) discovery of linguistic structure from speech is generating a lot of interest from two largely disjoint communities: the machine learning community is more an more interested in deploying language/speech technologies in a variety of languages/dialects with limited or no linguistic resources. The cognitive science community (psycholinguists, linguists, neurolinguists) want to understand the mechanisms by which infants spontaneously discover linguistic structure. The aim of this workshop is to bring together a team of researchers and graduate students from these two communities, to engage into mutual presentations and discussion of current and future issues. Specifically, this workshop has two aims: 1) identifying key issues and problems to be solved of interest to both communities, and 2) setting up standardized, common resources for comparing the different approaches to solving these problems (databases, evaluation criteria, software, etc). In this workshop, we will focus mainly but not exclusively on the discovery of two levels of linguistic structure: phonetic units and word-like units. We are well aware of the fact that the definition of these levels, as well as their segregation from the rest of the linguistic system is itself a matter of debate, and we welcome discussions of these issues as well.

    Dates: Monday July 16 - Friday July 27

    July 16th (Krieger 205): Kick-Off Symposium, Day 1

    Morning 10:00a-12:30p

    10:00a: Welcome and Overview of objectives [video]

    10:30a: Aren Jansen (JHU/HLTCOE): Overview of Zero Resource Technology [pdf][video]

    11:30a: Dan Swingley (U. Penn): Overview of Early Language Acquisition [pdf][video]

    Afternoon 2:30-5:30p

    2:30p: Mark Johnson (Macquarie U.): Overview of Bayesian Approaches [pdf][video]

    3:30p: Bill Idsardi (U. Md): Clustering Techniques for Phonetic Categories and their Implications for Phonology [pdf][video]

    4:30p: Emmanuel Dupoux (Ecole Normale Superieure): Modeling Language Bootstrapping: Results & Challenges [video]

    July 17th (Krieger 205): Kick-Off Symposium, Day 2

    Morning 9:00a-12:30p

    9:00a: Naomi Feldman (U. Md): Using Bayesian Approaches to Study Human Sound and Word Learning [pdf][video]

    9:50a: Sharon Goldwater (U. Edinburgh): From Sounds to Words: Bayesian Modeling of Early Language Acquisition [pdf][video]

    10:50a: Sanjeev Khudanpur (JHU): Hybrid Dynamical System Models for Signal Segmentation and Labeling [video]

    11:40a: Aren Jansen (JHU/HLTCOE): Towards a Speaker Invariant Representation of Speech [pdf][video]

    Afternoon 2:00p-5:30p

    2:00p: Ian McGraw (MIT): Learning the Lexicon: A Pronunciation Mixture Model [pdf][video]

    2:50p: Rick Rose (McGill): Combining Low and High Resource Acoustic Modeling in Spoken Term Detection [pdf][video]

    3:50p: Hynek Hermansky (JHU): Dealing with Previously Unseen Unknowns in the Recognition of Speech [pdf][video]

    4:40p: Ken Church (IBM): OOVs, Pseudo-truth, and Zero Resource Methods [pdf][video]

    For select symposium abstracts, click here.

    July 18th (NEB 225): Organization: Data, metrics, subprojects/subteams

    July 19th-26th (NEB 225): Collaboration period, informal talks

    July 27th (Hackerman B17): Final Presentation at 2:00p [pdf] [video1] [video2]

    Informal Presentations

    Benjamin Borschinger: Particle Filtering for Word Segmentation [pdf1][pdf2]

    Mark Johnson: Bayesian Methods Tutorial[pdf]

    Hynek Hermansky: Acoustic Processing Tutorial[pdf]

    Shinji Watanabe: Integrated Bayesian Unsupervised Acoustic, Lexical, and Language Models[pdf]

    Pascal Clark: Rythmic Demodulation for Zero-Resource Speech Recognition [pdf]

    Jason Eisner: Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model[pdf]

    Team Members

    Senior Members
    Ken ChurchIBM Research
    Pascal ClarkJohns Hopkins University, HLTCOE
    Emmanuel DupouxEcole Normale Superieure
    Naomi FeldmanUniversity of Maryland
    Sharon GoldwaterUniversity of Edinburgh
    Hynek HermanskyJohns Hopkins University
    Aren JansenJohns Hopkins University, HLTCOE
    Mark JohnsonMacquarie University
    Sanjeev KhudanpurJohns Hopkins University
    Ian McGrawMassachusetts Institute of Technology
    Richard RoseMcGill University
    Graduate Students
    Erin BennettUniversity of Maryland
    Benjamin BrschingerMacquarie University
    Justin ChiuCarnegie Mellon University
    Ewan DunbarUniversity of Maryland
    Abdallah FourtassiEcole Normale Superieure
    David HarwathMassachusetts Institute of Technology
    Keith LevinJohns Hopkins University
    Atta NorouzianMcGill University
    Vijay PeddintiJohns Hopkins University
    Rachel RichardsonUniversity of Maryland
    Thomas SchatzEcole Normale Superieure
    Yuriy ShamesJohns Hopkins University
    Samuel ThomasJohns Hopkins University
    Affiliate Members
    Jason EisnerJohns Hopkins University
    Steven GreenbergTransparent Language
    Timothy (TJ) HazenMIT Lincoln Laboratory
    Florian MetzeCarnegie Mellon University
    Mike SeltzerMicrosoft Research
    Dan SwingleyUniversity of Pennsylvania
    Balakrishnan VaradarajanGoogle
    Shinji WatanabeMitsubishi Electronics Research Lab

WS'11 Summer Workshop

Research Groups

  • An Exploration of How to Learn from Visually Descriptive Text

    This workshop will involve learning to identifying visually descriptive text, parsing this text and extracting statistical models, and using these models to 1) learn how people describe the world and 2) build more relevant recognition systems in computer vision. It should be anexciting opportunity to deal with large scale text and image data, be exposed to cutting edge techniques in computer vision, andinteractively develop new strategies on the boundary between NLP and computer vision. Specific types of work will include, data collection, parsing, using Amazon's Mechanical Turk, building andusing probabilistic models, and work on applications including image parsing, retrieval, and automatic sentence generation from images.

    Team Members

    Senior Members
    Alexander BergStony Brook University
    Tamara BergStony Brook University
    Hal Daum IIIUniversity of Maryland
    Graduate Students
    Amit GoyalUniversity of Maryland
    Xufeng HanStony Brook University
    Margaret MitchellUniversity of Aberdeen
    Karl StratosColumbia University
    Kota YamaguchiStony Brook University
    Undergraduate Students
    Jesse DodgeUniversity of Washington
    Alyssa MenschMassachusetts Institute of Technology
    Affiliate Members
    Yejin ChoiStony Brook University
    Julia HockenmaierUniversity of Illinois at Urbana-Champaign
    Erik Learned-MillerUniversity of Massachusetts, Amherst
    Alan QiPurdue University
  • Confusion-based Statistical Language Modeling for Machine Translation and Speech Recognition

    How can we decide that one sentence is more likely in a language than another sentence, especially if those sentences have never been seen before in entirety? Why would we want to? The answer to the second question is that many natural language applications -- machine translation, automatic speech recognition -- produce a multitude of possible sentences as the output (of translation or recognition) and the likelihood of the resulting sentences in the language is a key way to choose between them. New methods for figuring out the answer to the first question is the topic of this summer workshop project. For the same "true" output, the set of competing outputs ('confusions') depends on the application: for speech recognition, the confusions typically sound similar (such as 'their' and 'there'); while in machine translation, the confusions will depend on ambiguities that arise in the translation process for a particular language pair (different for, say, Chinese and German when translating into English). In this project, we will be investigating techniques to automatically generate possible confusions for a particular task and learn statistical models of language from such confusions. These models can then be used to do a better job of choosing which of the alternative outputs of a particular system is best. This project is a chance to work on cutting edge speech and natural language applications, and get your hands dirty underneath the hood of state-of-the-art systems, while trying to make them better.

    Final Presentation

    Team Members

    Senior Members
    Sanjeev KhudanpurCLSP
    Chris Callison-BurchCLSP
    Dan BikelGoogle
    Keith HallGoogle
    Philipp KoehnUniversity of Edinburgh
    Brian RoarkOregon Health and Science University
    Kenji SagaeUniversity of Southern California
    Graduate Students
    Puyang XuCLSP
    Charley ChanCLSP
    Yuan CaoCLSP
    Eva HaslerUniversity of Edinburgh
    Maider LehrOregon Health and Science University
    Emily TuckerOregon Health and Science University
    Undergraduate Students
    Nathan GlennBrigham Young University
    Darcey RileyUniversity of Rochester
    Affiliate Members
    Damianos KarakosCLSP
    Adam LopezCLSP
    Zhifei LiGoogle
    Matt PostJohns Hopkins University
    Murat SaraclarBoğazii University
    Izhak ShafranOregon Health and Science University
  • New Parameterization for Emotional Speech Synthesis

    This project's goal is to improve the quality of speech synthesis for spoken dialog systems and speech-to-speech translation systems. Instead of just producing natural sounding high quality speech output from raw text, we will investigate how to make the output speech be stylistically appropriate for the text. For example speech output for the sentence "There is an emergency, please leave the building, now" requires a different style of delivery from a sentences like "Are you hurt?". We will use both speech recorded from actors, and natural examples of emotional speech. Using new articulatory feature extraction techniques, and novel machine learning techniques we will build emotional speech synthesis voices and test them with both objective and subjective measures. This will also require developing new techniques for evaluating our results using crowdsourcing in an efficient way.

    Team Members

    Senior Members
    Alan BlackCarnegie Mellon University
    Tim BunnellUniversity of Delaware
    Florian MetzeCarnegie Mellon University
    Kishore PrahalladIIIT, Hyderabad
    Stefan SteidlICSI at Berkeley
    Graduate Students
    Prasanna KumarCarnegie Mellon University
    Tim PolzehlTechnical University of Berlin
    Undergraduate Students
    Daniel PerryUniversity of California, Los Angeles
    Caroline VaughnOberlin College
    Affiliate Members
    Eric Fosler-LussierOhio State University
    Karen LivescuToyota Technical Institute at Chicago

WS'10 Summer Workshop

Research Groups

  • Localizing Objects and Actions in Videos with the Help of Accompanying Text

    Multimedia content is a growing focus of search and retrieval, personalization, categorization, and information extraction. Video analysis allows us to find both objects and actions in video, but recognition of a large variety of categories is very challenging. Any text accompanying the video, however, can be very good at describing objects and actions at a semantic level, and often outlines the salient information present in the video. Such textual descriptions are often available as closed captions, transcripts or program notes. In this inter-disciplinary project, we will combine natural language processing, computer vision and machine learning to investigate how the semantic information contained in textual sources can be leveraged to improve the detection of objects and complex actions in video. We will parse the text to obtain verb-object dependencies, use lexical knowledge-bases to identify words that describe these objects and actions, use web-wide image databases to get exemplars of the objects and actions, and build models that can detect where in the video the objects and actions are localized.


    Final Report

    Final Presentation | Video

    Final Presentation Video

    Team Members

    Senior Members
    Cornelia FermuellerUniversity of Maryland
    Jana KoseckaGeorge Mason
    Jan NeumannStreamSage/Comcast
    Evelyne TzoukermannStreamSage
    Graduate Students
    Rizwan ChaudhryJohns Hopkins University
    Yi LiUniversity of Maryland
    Ben SappUniversity of Pennsylvania
    Gautam SinghGeorge Mason
    Ching Lik TeoUniversity of Maryland
    Xiaodong YuUniversity of Maryland
    Undergraduate Students
    Francis FerraroUniversity of Rochester
    He HeHong Kong Polytechnic University
    Ian PereraUniversity of Pennsylvania
    Affiliate Members
    Yiannis AloimonosUniversity of Maryland
    Greg HagerJohns Hopkins University
    Rene VidalJohns Hopkins University
  • Models of Synchronous Grammar Induction for SMT

    The last decade of research in Statistical Machine Translation (SMT) has seen rapid progress. The most successful methods have been based on synchronous context free grammars (SCFGs), which encode translational equivalences and license reordering between tokens in the source and target languages. Yet, while closely related language pairs can be translated with a high degree of precision now, the result for distant pairs is far from acceptable. In theory, however, the "right"SCFG is capable of handling most, if not all, structurally divergent language pairs. So we propose to focus on the crucial practical aspects of acquiring such SCFGs from bilingual text. We will take the pragmatic approach of starting with existing algorithms for inducing unlabelled SCFGs (e.g. the popular Hiero model), and then using state-of-the-art hierarchical non-parametric Bayesian methods to iteratively refine the syntactic constituents used in the translation rules of the grammar, hoping to approach, in an unsupervised manner, SCFGs learned from massive quantities of manually "tree-banked" parallel text.


    Final Presentation: First Session | Second Session | Video

    Final Presentation Video

    Team Members

    Senior Members
    Phil BlunsomUniversity of Oxford
    Trevor CohnUniversity of Sheffield
    Chris DyerUniversity of Maryland
    Jonathan GraehlUSC/ISI
    Adam LopezUniversity of Edinburgh
    Graduate Students
    Ziyuan WangCLSP
    Jan BothaUniversity of Oxford
    Vladimir EidelmanUniversity of Maryland
    ThuyLinh NguyenCarnegie Mellon University
    Undergraduate Students
    Olivia BuzekUniversity of Maryland
    Desai ChenCarnegie Mellon University
  • Speech Recognition with Segmental Conditional Random Fields

    The goal of this workshop group is to advance the state-of-the-art in core speech recognition by developing new kinds of features for use in a Segmental Conditional Random Field (SCRF). The recently proposed SCRF approach [Zweig & Nguyen, 2009] generalizes Conditional Random Fields to operate at the segment level, rather than at the traditional frame level. Basic to the approach, every segment is labeled directly with a word. Then, features are extracted which each measure some form of consistency between the underlying audio and the word hypothesis for a segment. These are combined in a log-linear model to produce the posterior probability of a word sequence given the audio. Previous work has used features based on the detection of phoneme and multi-phone units in the audio input. For example, one feature is the edit distance between the observed phoneme sequence in a segment, and that expected based on the hypothesis. The log-linear model embodied by the SCRF has the key advantage of being able to combine numerous, possibly redundant features in a coherent way; thus we have a very convenient way of improving performance by adding large numbers of complementary features.

    The work being done in the workshop revolves around extracting new acoustic features that can leverage the segmental approach. Professor Van Compernolle and Dr. Demuynck from Leuven University in Belgium are extending previous work in template based ASR [Wachter et al. 2007, Demange & Van Compernolle 2009] to find highly informative features based on template matching. A second line of research revolves around the use of coherent modulation features [Clark & Atlas 2009], and is being explored by Prof. Les Atlas from the University of Washington, and his student Pascal Clark. Professor Fei Sha and his student Meihong Wang, from the University of Southern California, are studying the use of deep-learning based features . Finally, Dr. Geoffrey Zweig and Dr. Patrick Nguyen from Microsoft Research are working on integrating these and other features into the SCARF toolkit for segmental CRF based speech recognition.


    Final Report

    Final Presentation | Video

    Final Presentation Video

    Team Members

    Senior Members
    Damianos KarakosCLSP
    Les AtlasUniversity of Washington
    Kris DemuynckUniversity of Leuven
    Patrick NguyenMicrosoft
    Fei ShaUniversity of Southern California
    Dirk van CompernolleUniversity of Leuven
    Geoffrey ZweigMicrosoft
    Graduate Students
    Samuel ThomasCLSP
    Pascal ClarkUniversity of Washington
    Gregory SellStanford University
    Meihong WangUniversity of Southern California
    Undergraduate Students
    Samuel BowmanUniversity of Chicago
    Justine KaoStanford University
    Affiliate Members
    Hynek HermanskyCLSP

WS'09 Summer Workshop

Research Groups

  • Low Development Cost, High Quality Speech Recognition for New Languages and Domains

    The cost of developing speech to text systems for new languages and domains is dominated by the need to transcribe a large quantity of data. We aim to significantly reduce this cost.

    In the speaker identification community, limitations on the amount of enrollment data per speaker are dealt with by adapting a "Universal Background Model" (UBM) to the observations from a given speaker. Subspace based techniques can be used in this process to reduce the number of speaker specific parameters that must be trained from the enrollment data. One approach that has been successful in achieving this goal is factor analysis. We have recently performed experiments on speech recognition showing that this factor analysis based approach can beat state of the art techniques. The improvements are particularly large when the amount of training data is small, e.g. 20% relative improvement on a Maximum Likelihood trained fully adapted Broadcast News system with 50 hours of training data. The smaller number of parameters of the UBM based system means that less training data is needed. Another advantage of the UBM framework is that it allows natural parameter tying across domains and languages, which should further reduce the amount of training data needed when migrating to a new language. We anticipate particularly large reductions in WER when training on small amounts of language-specific data, e.g. a few hours.

    The UBM based framework for speech recognition is scientifically interesting as it represents a unification of speech recognition and speaker identification techniques. Speaker identification techniques, which were originally based on those used for speech recognition, have been extended in recent years by the Universal Background Model and the Factor Analysis approach. Our approach brings those ideas back into speech recognition, and we anticipate that the techniques developed may in turn improve speaker verification performance (however, that is not the focus of this workshop). The purpose of the workshop would be to bring top speech recognition and speaker identification researchers together to work on this technique which straddles the two fields, and we would apply it to speech recognition for under-resourced languages; however, the techniques developed would have much wider applicability.

    Since a workable approach to apply UBMs to speech recognition has already been devised, the pre-workshop phase would be able to focus on preparing data, building baseline systems, and coding the existing UBM based approach within an open-source framework based on the HTK toolkit for eventual release. During the workshop we can focus on optimizing and extending the techniques used in UBM based modeling, studying cross-language effects, developing tools to reduce the labor of building a pronunciation dictionary, and packaging our setup for use by others.

    The approach we intend to pursue is of enormous scientific interest for both speech recognition and speaker identification, as it concerns the core modeling approach used in both communities. We will make the tools we develop available and easy to use even for non-experts, so our work should have direct benefits for those who need to build effective speech recognition systems as well as having research and educational purposes. Given our positive initial results, this will be valuable regardless of the outcome of our experiments during the workshop.

    Final Presentation
    Final Report

    Final Group Presentation

    Final Presentation Video

    Team Members

    Senior Members
    Lukas BurgetBrno University of Technology
    Nagendra Kumar GoelApptek Inc.
    Dan PoveyMicrosoft
    Richard RoseMcGill University
    Graduate Students
    Samuel ThomasCLSP
    Arnab GhoshalJohns Hopkins University
    Petr SchwarzBrno University of Technology
    Undergraduate Students
    Mohit AgarwalIIIT Allahabad
    Pinar AkyaziBogacizi University
  • Parsing the Web: Large-Scale Syntactic Processing

    Scalable syntactic processing will underpin the sophisticated language technology needed for next generation information access. Companies are already using NLP tools to create web-scale question answering and semantic search engines. Massive amounts of parsed web data will also allow the automatic creation of semantic knowledge resources on an unprecedented scale. The web is a challenging arena for syntactic parsing, because of its scale and variety of styles, genres, and domains. Our proposal is to scale and adapt an existing wide-coverage parser to the web; evaluate and run this parser on Wikipedia, a large and semi-structured text collection; use the parsed wiki data for an innovative form of bootstrapping to make the parser both more efficient and more accurate; and finally use the parsed web data for a variety of NLP semantic tasks, including a novel combination of distributional and compositional semantics to improve performance on tasks which require fine-grained syntax/semantic intergration.

    The focus of the proposal will be the C&C parser [1], a state-of-the-art statistical parser based on Combinatory Cat- egorial Grammar (CCG), a formalism which originated in the syntactic theory literature. A strength of the parser is that it is theoretically well-motivated at all levels, from the grammar formalism which enables the parser to produce lin- guistically sophisticated output representing the underlying meaning of a sentence to the machine learning techniques which underpin its robustness and accuracy. The parser has been evaluated on a number of standard test sets achieving state-of-the-art accuracy. It has also recently been adapted successfully to the biomedical domain. The parser is surpris- ingly efficient, given its detailed output, processing tens of sentences per second. For web-scale text processing, we aim to make the parser an order of magnitude faster still. The C&C parser is one of only very few parsers currently available which has the potential to produce detailed, accurate analyses at the scale we are considering.

    Weekly Update Slides: Week 1, Week 2, Week 3
    Final Presentation
    Final Report

    Team Members

    Senior Members
    Stephen ClarkUniversity of Cambridge
    Ann CopestakeUniversity of Cambridge
    James R. CurranUniversity of Sydney, Australia
    Graduate Students
    Byung Gyu AhnCLSP
    James HaggertyUniversity of Sydney
    Aurelie HerbelotCambridge University
    Yue ZhangOxford University
    Undergraduate Students
    Tim DawbornUniversity of Sydney
    Jonathan KummerfeldUniversity of Sydney
    Jessika RoesnerUniversity of Texas at Austin
    Curt Van WykNorthwestern University
  • Unsupervised Acquisition of Lexical Knowledge from N-Grams

    The overall performance of machine-learned NLP systems is often ultimately determined by the size of the training data rather than the learning algorithms themselves [Banko and Brill 2001]. The web undoubtedly offers the largest textual data set. Previous researches that use the web as the corpus have mostly relied on search engines to obtain the frequency counts and/or contexts of given phrases [Lapata & Keller 2005]. Unfortunately, this is hopelessly inefficient when building large-scale lexical resources.

    We propose to build a system for acquiring lexical knowledge from ngram counts of the web data. Since multiple occurrences of the same string are collapsed to a single one, the ngram data is considerably smaller than the original text. Since most lexical learning algorithms only collect data from small windows of text anyway, the ngram data can provide the necessary statistics needed for the learning tasks in a much more compact and efficient fashion. Ngram counts may appear to be a rather impoverished data source. However, a surprisingly large variety of knowledge can be mined from them. For example, consider the referents of the pronoun 'his' in the following sentences:

    1. John needed his friends
    2. John needed his support
    3. John offered his support

    The fact that (1) and (3) have a different coreference relationship than (2) seems to hinge on a piece of 'world knowledge' that one never needs one's own support (since one already has it). [Bergsma and Lin, 2006] showed that such seemingly 'deep' world knowledge can actually be obtained from shallow POS-tagged ngram statistics.

    Final Report

    Team Members

    Senior Members
    Ken ChurchMicrosoft
    Heng JiCUNY
    Dekang LinGoogle
    Satoshi SekineNew York University
    Graduate Students
    Kailash PatilCLSP
    Shane BergsmaUniversity of Alberta
    Kapil DalwaniJohns Hopkins University
    Sushant NarsaleJohns Hopkins University
    Emily PitlerUniversity of Pennsylvania
    Undergraduate Students
    Rachel LathburyUniversity of Virginia
    Vikram RaoCornell University

WS'08 Summer Workshop

Research Groups

  • Multilingual Spoken Term Detection: Finding and Testing New Pronunciations

    Final Group Presentation

    Final Presentation Video

    Team Members

    Senior Members
    Jim BakerJohns Hopkins University
    Martin JanscheGoogle
    Michael RileyGoogle
    Murat SaraclarBogazici University
    Abinav SethyIBM
    Richard SproatUniversity of Illinois
    Patrick WolfeHarvard University
    Graduate Students
    Arnab GhoshalJohns Hopkins University
    Kristy HollingsheadOregon Health & Science University
    Christopher WhiteJohns Hopkins University
    Undergraduate Students
    Erica CooperMassachusetts Institute of Technology
    Affiliate Members
    Mona DiabColumbia University
    Bhuvana RamabhadranIBM
  • Robust Speaker Recognition Over Varying Channels

    Final Group Presentation

    Final Presentation Video

    Team Members

    Senior Members
    Niko BrummerSpescom DataVoice
    Patrick KennyCentre de Recherche en Informatique de Montreal
    Jason PelecanosIBM
    Douglas ReynoldsMIT Lincoln Labs
    Robbie VogtQueensland University of Technology
    Graduate Students
    Fabio CastaldoPolytechnic University of Turin
    Najim DehakEcole de Technologie Superieure
    Reda DehakEPITA
    Ondrej GlembekBrno University of Technology
    Zahi KaramMassachusettes Institute of Technology
    Undergraduate Students
    Ciprian Constantin CostinThe Alexandru Ioan Cuza University
    Valiantsina HubeikaValiantsina Hubeika xhubei00 at stud dot fit dot vutbr dot cz Brno University of Technology
    Elly (Hye Young) NaGeorge Mason University
    John Noecker Jr.Duquesne University
    Affiliate Members
    Sachin KajarekarSRI International
    Nicolas SchefferSRI International
  • Vocal Aging Explained by Vocal Tract Modeling

    Final Group Presentation

    Final Presentation Video

    Team Members

    Senior Members
    Peter BeyerleinUniversity of Applied Sciences Wildau
    Elmar NoethUniversity of Erlangen
    Georg StemmerSiemens AG
    Graduate Students
    Puyang XuCLSP
    Andrew CassidyJohns Hopkins University
    Eva LasarcykSaarland University
    Blaise PotardLORIA
    Werner SpiegelUniversity of Erlangen
    Undergraduate Students
    Young Chol SongStony Brook University
    Varada KolhatkarUniversity of Minnesota, Duluth
    Stephen ShumUniversity of California, Berkeley
    Affiliate Members
    Andreas AndreouCLSP
    Nemala Sridhar KrishnaJohns Hopkins University

WS'07 Summer Workshop

Research Groups

  • Exploiting Lexical & Encyclopedic Resources For Entity Disambiguation

    Entity disambiguation is the problem of determining whether two mentions of entities refer to the same object: e.g., trying to decide whether the entity called "Jim Clark" in one document is the same as the entity called "Jim Clark" in another document. To do this accurately, it is necessary to extract from these documents descriptions of these entities as exhaustive and accurate as possible. This in turn requires 'tracking' these entities in each document - identifying all or most of their mentions - and collecting their properties, particularily those that help the most to discriminate between individuals.

    The goal of the workshop is to further the state of the art in entity disambiguation by developing better techniques for tracking entities and for extracting their properties. A particular focus will be improving entity tracking by using lexical and encyclopedic knowledge extracted both from structured lexical databases and from semi-strcutured repositories such as Wikipedia. Lack of such knowledge is one of the main problems with current entity tracking methods, which typically cannot detect that 'the Packwood proposal' and 'the Packwood plan' in the following example refer to the same entity.

    • [The Packwood proposal] would reduce the tax depending on how long an asset was held. It also would create a new IRA that would shield from taxation the appreciation on investments made for a wide variety of purposes, including retirement, medical expenses, first-home purchases and tuition.
    • A White House spokesman said President Bush is "generally supportive" of [the Packwood plan]

    Methods to be used include text mining techniques (supervised and unsupervised) to extract object properties; better machine learning techniques to improve entity tracking (e.g., using tree kernels); methods for extracting knowledge from WordNet, semantic role labellers, and Wikipedia; and clustering methods for entity disambiguation.

    Click here for technical details

    Team Members

    Senior Members
    Ron ArtsteinUniversity of Essex
    David DayMITRE
    Jason DuncanDepartment of Defense
    Alessandro MoschittiUniversity of Trento
    Massimo PoesioUnversity of Essex and University of Trento
    Xiaofeng YangInstitute for Infocomm Research, Singapore
    Graduate Students
    Jason SmithCLSP
    Robert HallUniversity of Massachussetts
    Simone PonzettoEML Research
    Yannick VersleyUniversity of Tubingen
    Michael WickUniversity of Massachusetts
    Undergraduate Students
    Vladimir EidelmanColumbia University
    Alan JernUniversity of California Los Angeles
    Brett ShwomNew York University
    Affiliate Members
    Walter DaelmansUniversity of Antwerp
    Claudio GiulianoFBK-IRST
    Janet HitzemanMITRE
    Veronique HosteUniversity of Antwerp
    Emily JamisonOhio
    Mijail KabadjovEdinburgh University
    Gideon MannUniversity of Massachusetts
    Sameer PradhanBBN
    Michael StrubeEML Research
  • Recovery from Model Inconsistency in Multilingual Speech Recognition

    Current ASR has difficulties in handling unexpected words that are typically replaced by acoustically acceptable high prior probability words. Identifying parts of the message where such a replacement could have happened may allow for corrective strategies.

    We aim to develop data-guided techniques that would yield unconstrained estimates of posterior probabilities of sub-word classes employed in the stochastic model solely from the acoustic evidence, i.e. without use of higher level language constraints.

    These posterior probabilities then could be compared with the constrained estimates of posterior probabilities derived with the constraints implied by the underlying stochastic model.
    Parts of the message where any significant mismatch between these two probability distributions is found should be re-examined and corrective strategies applied.

    This may allow for development of systems that are able to indicate when they "do not know" and eventually may be able to "learn-as-you-go" in applications encountering new situations and new languages.

    During the 2007 Summer Workshop we intend to focus on detection and description of out-of-vocabulary and mispronounced words in the 6 language Call-home database. Additionally, in order to describe the suspect parts of the message, we will work on language-independent recognizer of speech sounds that could be applied for phonetic transcription of identified suspect parts of the recognized message.

    Click here for technical details.

    Team Members

    Senior Members
    Sanjeev KhudanpurCLSP
    Hynek HermanskyCLSP
    Lukas BurgetBrno University of Technology
    Chin-Hui LeeGeorgia Technical Institute
    Haizhong LiInstitute for Infocomm Research
    Jon NedelDepartment of Defense
    Geoffrey ZweigMicrosoft
    Graduate Students
    Ariya RastrowCLSP
    Pavel MatejkaBrno University of Technology
    Petr SchwartzBrno University of Technology
    Rong TongNanyang Technological University
    Chris WhiteCLSP
    Undergraduate Students
    Mirko HannemannMagdeburg University, Germany
    Sally IsaacoffUniversity of Michigan
    Puneet SahaniNSIT; Delhi University

WS'06 Summer Workshop

Research Groups

  • Articulatory Feature-based Speech Recognition

    Prevailing approaches to automatic speech recognition (hidden Markov models, finite-state transducers) are typically based on the assumption that a word can be represented as a single sequence of phonetic states. However, the production of a word involves the simultaneous motion of several articulators, such as the lips and tongue, which may move asynchronously and may not always reach their target positions. This may be more naturally and parsimoniously modeled using multiple streams of hidden states, each corresponding to an articulatory feature (AF). Recent theories of phonology support this idea, representing words using multiple streams of sub-phonetic features, which may be either directly related to the articulators or more abstract (e.g. manner and place). In addition, factoring the observation model of a recognizer into multiple factors, each corresponding to a different AF, may allow for savings in training data. Finally, such an approach can be naturally applied to audio-visual speech recognition, in which the asynchrony between articulators is particularly striking; and multilingual speech recognition, which may leverage the universality of some AFs across languages.

    This project will explore the large space of possible AF-based models for automatic speech recognition, on both audio-only and audio-visual tasks. While a good deal of previous work has investigated various components of such a recognizer, such as AF classifiers and AF-based pronunciation models, little effort has gone into building complete, fully AF-based recognizers. Our models will be represented as dynamic Bayesian networks. This is a natural framework for modeling processes with inherent factorization of the state space, and allows for investigation of a large variety of models using universal training and decoding algorithms.

    Find Details about the plans and progress of this project here and here.

    Team Members

    Senior Members
    Nash BorgesDoD
    Ozgur CetinICSI
    Mark Hasegawa-JohnsonUIUC
    Simon KingUniversity of Edinburgh
    Karen LivescuMIT
    Graduate Students
    Chris BartelsUniversity of Washington
    Art KantorUIUC
    Partha LalUniversity of Edinburgh
    Lisa YungJohns Hopkins University
    Undergraduate Students
    Ari BezmanDartmouth
    Stephen Dawson-HaggertyHarvard
    Bronwyn WoodsSwarthmore
  • Open Source Toolkit for Statistical Machine Translation

    The objective of this JHU Workshop is the development of novel methods for statistical machine translation that improve the state of the art, specifically factored translation models, and lattice-based decoding methods. As part of this workshop, we will implement these techniques and distribute them in an open source toolkit.

    We propose to extend phrase-based statistical machine translation models using a factored representation. Current statistical MT approaches represent each word simply as their textual form. A factored translation approach replaces this representation with a feature vector for each word derived from a variety of information sources. These features may be the surface form, lemma, stem, part-of-speech tag, morphological information, syntactic, semantic or automatically derived categories, etc. This representation is then used to construct statistical translation models that can be combined together to maximize translation quality.

    We also propose to extend current MT decoding methods to process multiple, ambiguous hypotheses in the form of an input lattice. A lattice representation allows an MT system to arbitrate between multiple ambiguous hypotheses from upstream processing so that the best translation can be produced. During the workshop we will implement lattice decoding and run experiments with errorful ASR input. We will compare different lattice-based strategies against single-hypothesis input results.

    Find Details about the plans and progress of this project here and here.

    Team Members

    Senior Members
    Chris Callison-BurchCLSP
    Nicola BertoldiITC-IRST
    Marcello FedericoITC-IRST
    Philipp KoehnUniversity of Edinburgh
    Wade ShenLincoln Labs
    Graduate Students
    Ondrej BojarCharles University
    Brooke CowanMIT
    Chris DyerUniversity of Maryland
    Hieu HoangUniversity of Edinburgh
    Richard ZensAachen University
    Undergraduate Students
    Alexandra ConstantinWilliams College
    Evan HerbstCornell
    Christine Corbett MoranMIT
  • WS06 Post Workshop Research

    WS06 Post Workshop Research


    At the conclusion of each workshop, the student participants are invited and encouraged to compete for funding to continue a research project at their home institution during the coming academic year. Their proposals are presented as part of the closing reports of the research teams. An independent panel of three or four experts is appointed as reviewers of the proposals. At the conclusion of the presentations, they convene to discuss the student proposals and make a recommendation to CLSP on which projects they feel should be funded. The winners are announced at the closing dinner of the workshop. Winners are requested to submit a formal proposal, including a financial proposal, to CLSP, upon their return to school. The proposals are reviewed by the CLSP director and awarded based on available funding.



    Lance Ramshaw (BBN)
    Andreas Stolcke (SRI)
    Mari Ostendorf (UWashington

    Evaluation Guidelines For Judges

    Team Members

WS'05 Summer Workshop

Research Groups

  • Parsing and Spoken Structural Event Detection

    Even though speech-recognition accuracy has improved significantly over the past 10 years, these systems do not currently generate/model structural information (meta-data) such as sentence boundaries (e.g., periods) or the form of a disfluency (e.g., in .I want [to go] * {I mean} meet with Fred., .to go. is an edit, which is signaled by an interruption point indicated as *, as well as an edit term .I mean..). Automatic detection of these phenomena would simultaneously improve parsing accuracy and provide a mechanism for cleaning up transcriptions for the downstream text processing modules. Similarly, constraints imposed by text processing systems such as parsers can be used to assign certain types of meta-data for correct identification of disfluencies.

    The goal of this workshop is to investigate the enrichment of speech recognition output using parsing constraints and the improvement of parsing accuracy due to speech recognition enrichment. We will investigate the following questions: (1) How does the incorporation of syntactic knowledge affect sentence boundary and disfluency detection accuracy? (2) How does the availability of more accurate sentence boundaries and disfluency annotation affect parsing accuracy? This workshop project is interdisciplinary bringing together researchers from the speech recognition and natural language processing communities. The undergraduates on this project will be exposed to research that spans these two important areas, and will gain experience on approaches to interfacing between technologies in these two areas.

    Click here for technical details

    Team Members

    Senior Members
    Bonnie DorrUniversity of Maryland
    John HaleMichigan State University
    Mary HarperPurdue University
    Brian RoarkOregon Health and Sciences University
    Izhak ShafranJohns Hopkins University
    Graduate Students
    Matt LeaseBrown University
    Yang LiuICSI
    Matt SnoverUniversity of Maryland
    Lisa YungJohns Hopkins University
    Undergraduate Students
    Anna KrasnyanskayaUCLA
    Robin StewartWilliams
  • Parsing Arabic Dialects

    Problem Definition: The proposed project will tackle the problem of parsing Arabic dialects. Parsing is an important component in many advanced NLP systems, and has also been shown to be useful for language modeling for ASR. As is well known, Arabic exhibits diglossia, i.e., the coexistence of two forms of language, a high variety with standard orthography and sociopolitical clout which is not natively spoken by anyone (Modern Standard Arabic, MSA) and low varieties that are primarily spoken and lack writing standards (Arabic dialects). The dialects and MSA form a continuum of variation at the lexical, phonological, morphological, and syntactic levels.

    There are important resources currently available for MSA with much on-going NLP work; for example, there are several syntactic and semantic parsers for MSA. However, Arabic dialect resources and NLP research are still at an infancy stage. There are linguistic studies of Arabic dialectal syntax but there is no language engineering work (such as computational grammars). There are no parallel written corpora between any of the dialects and any other language, including MSA. Thus, most of the techniques developed for parsing that exploit supervised (in the canonical sense) machine learning do not apply, since there is no sufficient annotated data to learn from. We would like to leverage existing resources and tools for MSA in order to parse Arabic dialects using both symbolic techniques and machine learning approaches.


    • General NLP research: We will investigate how to leverage available syntactic resources for families of resource-poor languages.
    • Tools: we will create standard tools, i.e. parsers with compatible tokenization and morphological analysis components, for the processing of Arabic (MSA and dialects). These can be used in applications such as dialect translation, information retrieval, information extraction from speech data, dialect transcription, language modeling for ASR, and semantic parsing of Arabic dialects.
    • Resources: we will create standards for the transcription of Arabic dialects, as well as grammars and small corpora and lexica.

    Click here for technical details

    Team Members

    Senior Members
    David ChiangUniversity of Maryland
    Mona DiabColumbia University
    Nizar HabashColumbia University
    Rebecca HwaUniversity of Pittsburgh
    Owen RambowColumbia University
    Khalil Sima'anUniversity of Amsterdam
    Graduate Students
    Roger LevyStanford University
    Carol NicholsUniversity of Pittsburgh
    Undergraduate Students
    Vincent LaceyGeorgia Tech
    Safiullah ShareefJohns Hopkins University
  • Statistical Machine Translation by Parsing
    Machine translation (MT) is more important than ever. The quality of MT output has increased substantially in recent years, due to more sophisticated utilization of statistical learning methods and objective evaluation methods. However, statistical MT (SMT) systems often generate "word salad," where the output may contain many correct words but in the wrong order, making it hard to understand. We propose to investigate a new approach to SMT that has models of word order at its core, in contrast to other syntax-based approaches. Models that integrate word order more directly promise to greatly improve the readability of translations. Our research will simultaneously focus on two language pairs -- English/French and English/Arabic -- thus demonstrating the generality of the approach. In addition to improved MT, goals of the workshop include training students to contribute to MT and NLP research for years to come, and a complete easy-to-use reference implementation for worldwide distribution.


    Click here for workshop results

    Team Members

    Senior Members
    Stephen ClarkOxford University
    Keith HallJohns Hopkins University
    Mary HearneDublin City University
    Dan MelamedNew York University
    Andy WayDublin City University
    Dekai WuHong Kong University of Science and Technology
    Graduate Students
    Marine CarpuatHong Kong University of Science and Technology
    Markus DreyerJohns Hopkins University
    Declan GrovesDublin City University
    Yihai ShenHong Kong University of Science and Technology
    Ben WellingtonNew York University
    Undergraduate Students
    Andrea BurbankStanford University
    Pamela FoxUniversity of Southern California

WS'04 Summer Workshop

Research Groups

  • Dialectal Chinese Speech Recognition

    There are eight major dialectal regions in addition to Mandarin (Northern China) in China, including Wu (Southern Jiangsu, Zhejiang, and Shanghai), Yue (Guangdong, Hong Kong, Nanning Guangxi), Min (Fujian, Shantou Guangdong, Haikou Hainan, Taipei Taiwan), Hakka (Meixian Guangdong, Hsin-chu Taiwan), Xiang (Hunan), Gan (Jiangxi), Hui (Anhui), and Jin (Shanxi). These dialects can be further divided into more than 40 sub-categories. Although the Chinese dialects share a written language and standard Chinese (Putonghua) is widely spoken in most regions, speech is still strongly influenced by the native dialects. This great linguistic diversity poses problems for automatic speech and language technology. Automatic speech recognition relies to a great extent on the consistent pronunciation and usage of words within a language. In Chinese, word usage, pronunciation, and syntax and grammar vary depending on the speaker's dialect. As a result speech recognition systems constructed to process standard Chinese (Putonghua) perform poorly for the great majority of the population.

    The goal of our summer project is to develop a general framework to model phonetic, lexical, and pronunciation variability in dialectal Chinese automatic speech recognition tasks. The baseline system is a standard Chinese recognizer. The goal of our research is to find suitable methods that employ dialect-related knowledge and training data (in relatively small quantities) to modify the baseline system to obtain a dialectal Chinese recognizer for the specific dialect of interest. For practical reasons during the summer, we will focus on one specific dialect, for example the Wu dialect or the Chuan dialect. However the techniques we intend to develop should be broadly applicable.

    Our project will build on established ASR tools and systems developed for standard Chinese. In particular, our previous studies in pronunciation modeling have established baseline Mandarin ASR systems along with their component lexicons and language model collections. However, little previous work or resources are available to support research in Chinese dialect variation for ASR. Our pre-workshop will therefore focus on further infrastructure development:

    • Dialectal Lexicon Construction. We will establish an electronic dialect dictionary for the chosen dialect. The lexicon will be constructed to represent both standard and dialectal pronunciations.
    • Dialectal Chinese Database Collection. We will set up a dialectal Chinese speech database with canonical pinyin level and dialectal pinyin level transcriptions. The database could contain two parts: read speech and spontaneous speech. For the spontaneous speech part, the generalized initial/final (GIF) level transcription should be also included.

    Our effort at the workshop will be to employ these materials to develop ASR system components that can be adapted from standard Chinese to the chosen dialect. Emphasis will be placed on developing techniques that work robustly with relatively small (or even no) dialect data. Research will focus primarily on acoustic phenomena, rather than syntax or grammatical variation, which we intend to pursue after establishing baseline ASR experiments.

    Opening Day Presentation, July 6, 2004
    Progress Report, July 28, 2004
    Final Presentation, August 15, 2004
    Final Report

    Team Members

    Senior Members
    Liang GuIBM
    Dan JurafskyStanford University
    Izhak ShafranJohns Hopkins University
    Richard SproatUniversity of Illinois
    Feng(Thomas) ZhangTsinghua University
    Graduate Students
    Jing LiTsinghua University
    Yi SuJohns Hopkins University
    Stavros TsakalidisJohns Hopkins University
    Yanli ZhangUniversity of Illinois
    Haolang ZhouJohns Hopkins University
    Undergraduate Students
    Philip BramsenMIT
    David KirschLehigh University
  • Joint Visual-Text Modeling

    There has been a renewed spurt of research activity in Multimedia Information Retrieval. This may be partly attributed to the emergence of a NIST-sponsored video analysis track at TREC, coinciding with a renewed interest from industry and government in developing techniques for mining multimedia data.

    Traditionally, multimedia retrieval has been viewed as a system-level combination of text- and speech-based retrieval techniques and image content-based retrieval techniques. It is our hypothesis that such system-level integration limits the exploitation of mutually informative and complementary cues in the different modalities. In addition, prevailing techniques for retrieval of images and speech differ vastly and this further inhibits truly cohesive interaction between these systems for multimedia information retrieval. For instance, if the query words have been incorrectly recognized by the ASR system, then speech-based retrieval systems may fail in retrieval of relevant video shots. Current systems back-off to image content-based searches and since image retrieval systems perform poorly for finding images related only by semantics, the overall performance of such late-fusion systems is poor. This situation is exacerbated in cross-language information retrieval where there is an additional degradation in the ASR transcripts resulting from subsequent machine translation.

    We propose to investigate a unified approach to multimedia information retrieval. We proceed by discretizing the visual information in videos using blobs (Carson, 1997). Contrary to the simplistic representation incorrectly suggested by the nomenclature, blobs robustly capture some key features of image regions or segments that are fundamental to object recognition . shape, texture and color information. The discretization then permits us to view multimedia retrieval as a task of retrieving documents comprising these visual tokens and words. This represents a generalization of statistical text retrieval models in IR to statistical multimedia retrieval models. With joint visual-text modeling, we can better represent the relationships between words and the associated image cues. In cases where the speech transcript may be inaccurate, the visual part of the document can now be related to the query terms.

    The infrastructure for visual tokenization will be developed prior to the workshop and will not be a focus of the workshop. During the workshop, the focus of the team will be on novel techniques for joint modeling of visual and text information. We will investigate a variety of techniques incorporating approaches suggested by Berger and Lafferty (1999), Ponte and Croft (1998), and Duygulu et al (2002). In particular methods that handle visual tokens in the same manner as word tokens will be investigated.

    All techniques will be evaluated using the NIST 2003 Video TREC benchmark-test corpus and queries. Information retrieval systems using the joint modeling approach will be compared with late fusion of unimodal systems of identical design (e.g. machine translation based retrieval systems, LSI based systems for both word tokens and visual tokens). We will investigate whether our newly proposed (joint) early fusion approach is indeed beneficial and compare its performance with a state-of-the-art Video TREC multimodal system.

    This workshop offers a unique opportunity to bring together experts from distinct disciplines. The team is diverse with members from industrial research and academic backgrounds and with multimedia processing and language modeling/machine translation expertise. This collaboration will result in a first demonstration of joint visual-text modeling for multimedia retrieval. The workshop will also allow graduate students to develop skills and new research directions, and introduce undergraduate students to cutting edge research in the area of multimedia search.

    Final Group Presentation

    Final Presentation Video

    Team Members

    Senior Members
    Sanjeev KhudanpurCLSP
    Pinar DuyguluBilkent University, Turkey
    Giri IyengarIBM TJ Watson Research Center
    Dietrich KlakowUniversity of Saarland, Germany
    Harriet NockIBM TJ Watson Research Center
    Manmatha R.University of Massachusetts
    Graduate Students
    Shaolei FengUniversity of Massachusetts
    Pavel IrcingUniversity of West Bohemia
    Brock PytlikJHU
    Paola VirgaJHU
    Undergraduate Students
    Petkova DesislavaMount Holyoke
    Matthew KrauseGeorgetown
  • Landmark Based Speech Recognition

    We seek to bring together new ideas from linguistics (especially nonlinear phonology) with new ideas from artificial intelligence (especially graphical models and support vector machines) in order to better match human speech recognition performance. Specifically, we will focus on two aspects of human speech communication that are not well modeled by current ASR:

    (1) Asynchrony between manner of articulation (the type of a sound, e.g., stop, fricative, vowel) and place of articulation (the shape of the lips and tongue when a sound is produced, e.g., lip rounding, position of the tongue tip). Asynchrony will be modeled by creating a dictionary of dynamic Bayesian networks . one for each word in the English language . each designed to learn word-dependent synchronization between manner and place. Specifically, the pronunciation model will consist of a graphical model of each word, with separate hidden state-streams representing the distinctive features of manner, place and voicing. Arcs in the graphical model will explicitly represent learnable approximate synchronization relationships between the different distinctive feature tiers.

    (2) Extra attention to acoustic phonetic landmarks . consonant releases, consonant closures, and syllable nuclei. During the lip closure of a [p], there is no sound: in order to determine that the stop is a [p], a human listener must pay special attention to the 50ms immediately before stop closure and immediately after release. Current ASR systems pay attention uniformly to the signal at all times. This workshop will develop discriminative classifiers (support vector machines) that detect and classify perceptually important acoustic phonetic landmarks. Methods will be developed to integrate the discriminant computation of the support vector machine with the generative probabilistic framework of the graphical models. Specifically, one of the methods we propose to test is a set of low-dimensional Gaussian likelihood functions, synchronized with the landmarks specified by a graphical model, and observing the discriminant output scores of the support vector machine.

    Evaluation experiments will use the proposed system to re-score lattices generated by a state of the art speech recognizer on the Switchboard test corpus. The baseline for comparison will be the maximum-likelihood path through the lattice before rescoring, i.e., based on acoustic and language model scores computed by a state-of-the-art speech recognizer. The summer.s effort will be considered a success if we are able to augment the acoustic model scores in a way that reduces word error rate of the maximum-likelihood path after rescoring.

    The proposed research builds on results published by the workshop participants. Juneja and Espy-Wilson have demonstrated that support vector machines trained to detect and classify acoustic phonetic landmarks achieve 80% correct recognition in TIMIT of the six English manner class categories, using a total of only 160 trainable parameters. For comparison, a manner class recognizer consisting of six HMMs, each with three 13-mixture states observing a 48-dimensional vector (total: 22716 parameters) achieves manner class recognition accuracy of 74%. Livescu and Glass have demonstrated improved phoneme recognition in TIMIT using a distinctive feature based pronunciation model consisting of five streams per word. Lattice rescoring has been demonstrated in an oracle experiment by Hasegawa-Johnson. Using Greenberg's transcriptions of the WS97 Switchboard sub-corpus, his experiment demonstrated that perfect knowledge of both manner and place distinctive features is sufficient for a 12% relative word error rate reduction in the maximum-likelihood path through a set of recognition lattices. The proposed effort will integrate these existing methods using new training scripts and lattice rescoring programs, and will test these ideas for improving word error rate.

    Final Report

    closing remarks

    Landmark-Based Speech Recognition: Report of the Working Group, Mark Hasegawa-Johnson

    opening remarks

    Opening Day Presentation: July 6, 2004
    Status Report: July 21, 2004

    seminar information

    Knowledge Acquisition from a Large Number of Sources by Jim Baker
    Undergraduate Seminars


    Periodic vector toolkit: Apply SVMs and neural nets to HTK files
    Transcription corrections and re-alignment for WS96/WS97


    NAACL Laboratory on Landmark Classification: June 30, 2004

    preliminary activities

    Online Discussion, 11/2003-6/2004, Excerpts
    First Planning Meeting: Notes and Slides
    Second Planning Meeting: Notes

    Team Members

    Senior Members
    Jim BakerCarnegie Mellon University
    Steve GreenbergBerkeley
    Mark Hasegawa-JohnsonUniversity of Illinois
    Katrin KirchoffUniversity of Washington
    Jennifer MullerDepartment of Defense
    Kemal SonmezSRI
    Graduate Students
    Sarah BorysUniversity of Illinois
    Ken ChenUniversity of Illinois
    Amit JunejaUniversity of Maryland
    Karen LivescuMIT
    Vidya MohanJHU
    Undergraduate Students
    Emily CooganUniversity of Illinois
    Tianyu WangGeorgia Tech

WS'03 Summer Workshop

Research Groups

  • Confidence Estimation for Natural Language Applications

    Significant progress has been made in natural language processing (NLP) technologies in recent years, but most still do not match human performance. Since many applications of these technologies require human-quality results, some form of manual intervention is necessary.
    The success of such applications therefore depends heavily on the extent to which errors can be automatically detected and signaled to a human user. In our project we will attempt to devise a generic method for NLP error detection by studying the problem of Confidence Estimation (CE) in NLP results within a Machine Learning (ML) framework.

    Although widely used in Automatic Speech Recognition (ASR) applications, this approach has not yet been extensively pursued in other areas of NLP. In ASR, error recovery is entirely based on confidence measures: results with a low level of confidence are rejected and the user is asked to repeat his or her statement. We argue that a large number of other NLP applications could benefit from such an approach. For instance, when post-editing MT output, a human translator could revise only those automatic translations that have a high probability of being wrong. Apart from improving user interactions, CE methods could also be used to improve the underlying technologies. For example, bootstrap learning could be based on outputs with a high confidence level, and NLP output re-scoring could depend on probabilities of correctness.

    Our basic approach will be to use a statistical Machine Learning (ML) framework to post-process NLP results: an additional ML layer will be trained to discriminate between correct and incorrect NLP results and compute a confidence measure (CM) that is an estimate of the probability of an output being correct. We will test this approach on a statistical MT application using a very strong baseline MT system. Specifically, we will start off with the same training corpus (Chinese-English data from recent NIST evaluations), and baseline system as the Syntax for Statistical Machine Translation team.

    During the workshop we will investigate a variety of confidence features and test their effects on the discriminative power of our CM using Receiver Operating Characteristic (ROC) curves. We will investigate features intended to capture the amount of overlap, or consensus, among the system's n-best translation hypotheses, features focusing on the reliability of estimates from the training corpus, ones intended to capture the inherent difficulty of the source sentence under translation, and those that exploit information from the base statistical MT system. Other themes for investigation include a comparison of different ML frameworks such as Neural Nets or Support Vector Machines, and a determination of the optimal granularity for confidence estimates (sentence-level, word-level, etc).

    Two methods will be used to evaluate final results. First, we will perform a re-scoring experiment where the n-best translation alternatives output by the baseline system will be re-ordered according to their confidence estimates. The results will be measured using the standard automatic evaluation metric BLEU, and should be directly comparable to those obtained by the Syntax for Statistical Machine Translation team. We expect this to lead to many insights about the differences between our approach and theirs. Another method of evaluation will be to estimate the tradeoff between final translation quality and amount of human effort invested, in a simulated post-editing scenario.


    Final Group Presentation

    Final Presentation Video

    Team Members

    Senior Members
    George FosterUniversity of Montreal
    Simona GandraburUniversity of Montreal
    Cyril GoutteXerox
    Graduate Students
    Erin FitzgeraldJHU
    Lidia ManguIBM
    Alberto SanchisUniversity of Valencia
    Nicola UeffingRWTH Aachen
    Undergraduate Students
    John BlatzPrinceton
    Alex KuleszaHarvard
  • Semantic Analysis Over Sparse Data

    Semantic Analysis Over Sparse Data
    The aim of the task is to verify the feasibility of a machine learning-based semantic approach to the data sparseness problem that is encountered in many areas of natural language processing such as language modeling, text classification, question answering and information extraction.
    The suggested approach takes advantage of several technologies for supervised and unsupervised sense disambiguation that have been developed in the last decade and of several resources that have been made available.

    The task is motivated by the fact that current language processing models are considerably affected by sparseness of training data, and current solutions, like class-based approaches, do not elicit appropriate information: the semantic nature and linguistic expressiveness of automatically derived word classes is unclear. Many of these limitations originate from the fact that fine-grained automatic sense disambiguation is not applicable on a large scale.

    The workshop will develop a weakly supervised method for sense modeling (i.e. reduction of possible word senses in corpora according to their genre) and apply it to a huge corpus in order to coarsely sense-disambiguate it. This can be viewed as an incremental step towards fine-grained sense disambiguation. The created semantic repository as well as the developed techniques will be made available as resources for future work on language modeling, semantic acquisition for text extraction, question answering, summarization, and most other natural language processing tasks.


    Team Members

    Senior Members
    Roberto BasiliUniversity of Rome
    Kalina BontchevaUniversity of Sheffield
    Hamish CunnignhamUniversity of Sheffield
    Louise GuthrieUniversity of Sheffield
    Fabio ZanzottoUniversity of Rome
    Graduate Students
    Jia CuiJHU
    David GuthrieUniversity of Sheffield
    Jerry LiuColumbia
    Klaus MachereyUniversity of Aachen
    Undergraduate Students
    Kristiyan HaralambievUniversity of Sofia
    Cassia MartinHarvard
    Affiliate Members
    Marco CammisaUniversity of Rome
    Martin HolubCharles University
  • Syntax for Statistical Machine Translation

    Syntax for Statistical Machine Translation
    In recent evaluations of machine translation systems, statistical systems based on probabilistic models have outperformed classical approaches based on interpretation, transfer, and generation. Nonetheless, the output of statistical systems often contains obvious grammatical errors. This can be attributed to the fact that the syntactic well-formedness is only influenced by local n-gram language models and simple alignment models. We aim to integrate syntactic structure into statistical models to address this problem. A very convenient and promising approach for this integration is the maximum entropy framework, which allows to integrate many different knowledge sources into an overall model and to train the combination weights discriminatively. This approach will allow us to extend a baseline system easily by adding new feature functions.
    The workshop will start with a strong baseline -- the alignment template statistical machine translation system that obtained best results in the 2002 DARPA MT evaluations. During the workshop, we will incrementally add new features representing syntactic knowledge that deal with specific problems of the underlying baseline. We want to investigate a broad range of possible feature functions, from very simple binary features to sophisticated tree-to-tree translation models. Simple feature functions might test if a certain constituent occurs in the source and the target language parse tree. More sophisticated features will be derived from an alignment model where whole sub-trees in source and target can be aligned node by node. We also plan to investigate features based on projection of parse trees from one language onto strings of another, a useful technique when parses are available for only one of the two languages. We will extend previous tree-based alignment models by allowing partial tree alignments when the two syntactic structures are not isomorphic.

    We will work with the Chinese-English data from the recent evaluations, as large amounts of sentence-aligned training corpora, as well as multiple reference translations are available. This will also allow us to compare our results with the various systems participating in the evaluations. In addition, annotation is underway on a Chinese-English parallel tree-bank. We plan to evaluate the improvement of our system using both automatic metrics for comparison with reference translations (BLEU and NIST) as well as subjective evaluations of adequacy and fluency. We hope both to improve machine translation performance and advance the understanding of how linguistic representations can be integrated into statistical models of language.


    Team Members

    Senior Members
    Sanjeev KhudanpurCLSP
    Daniel GildeaUniversity of Pennsylvania
    Franz OchUSC/ISI
    Anoop SarkarSimon Fraser University
    Kenji YamadaXerox
    Graduate Students
    Alexander FraserUSC/ISI
    Shankar KumarJHU
    Libin ShenUniversity of Pennsylvania
    David SmithJHU
    Undergraduate Students
    Katherine EngStanford
    Viren JainUniversity of Pennsylvania
    Jin ZhenMt. Holyoke

WS'02 Summer Workshop

Research Groups

  • Generation in the context of MT

    Generation in the context of MT
    Let's imagine a system for translating a sentence from a foreign language (say Arabic) into your native language (say English). Such a system works as follows. It analyzes the foreign-language sentence to obtain a structural representation that captures its essence, i.e. "who did what to whom where," It then translates (or transfers) the actors, actions, etc. into words in your language while "copying over" the deeper relationship between them. Finally it synthesizes a syntactically well-formed sentence that conveys the essence of the original sentence.
    Each step in this process is a hard technical problem, to which the best known solutions are either not adequate for applications, or good enough only in narrow application domains, failing when applied to other domains. This summer, we will concentrate on improving one of these three steps, namely the synthesis (or generation).

    The target language for generation will be English, and that the source language to the MT system a language of a completely different type (Arabic and Czech). We will further assume that the transfer produces a fairly deeply analyzed sentence structure. The incorporation of the deep analysis makes the whole approach very novel - so far no large-coverage translation system has tried to operate with such a structure, and the application to very diverse languages makes it an even more exciting enterprise!

    Within the generation process, we will focus on the structural (syntactic) part, assuming that a morphological generation module exists to complete the generation process, and will be added to the suite so as to be able to evaluate the final result, namely, the goodness of the plain English text coming out of the system. Statistical methods will be used throughout.

    A significant part of the workshop preparation will be devoted to assembling and running a simplified MT system from Arabic/Czech to English (up to the syntactic structure level), in order to have realistic training data for the workshop project. As a consequence, we will not only understand and solve the generation problem, but also learn the mechanics of an end-to-end MT system, creating the intellectual preparation of team members to work on other parts of the MT system in the future.


    Team Members

    Senior Members
    Jason EisnerCLSP
    Bonnie DorrUniversity of Maryland
    Jan HajicCharles University
    Gerald PennUniversity of Toronto
    Dragomir RadevUniversity of Michigan
    Owen RambowUniversity of Pennsylvania
    Graduate Students
    Martin CmejrekCharles University
    Yuan DingUniversity of Pennsylvania
    Undergraduate Students
    Terry KooStanford
    Kristen PartonStanford
  • Novel Speech Recognition Models for Arabic

    Novel Speech Recognition Models for Arabic
    Previous research on large-vocabulary automatic speech recognition (ASR) has mainly concentrated on European and Asian languages. Other language groups have been explored to a lesser extent, for instance Semitic languages like Hebrew and Arabic. These languages possess certain characteristics which present problems for standard ASR systems. For example, their written representation does not contain most of the vowels present in the spoken form, which makes it difficult to utilize textual training data. Furthermore, they have a complex morphological structure, which is characterized not only by a high degree of affixation but also by the interleaving of vowel and consonant patterns (so-called "non-concatenative morphology"). This leads to a large number of possible word forms, which complicates the robust estimation of statistical language models.
    In this workshop group we aim to develop new modeling approaches to address these and related problems, and to apply them to the task of conversational Arabic speech recognition. We will develop and evaluate a multi-linear language model, which decomposes the task of predicting a given word form into predicting more basic morphological patterns and roots. Such a language model can be combined with a similarly decomposed acoustic model, which necessitates new decoding techniques based on modeling statistical dependencies between loosely coupled information streams. Since one pervading issue in language processing is the tradeoff between language-specific and language-independent methods, we will also pursue an alternative control approach which relies on the capabilities of existing, language-independent recognition technology. Under this approach no mophological analysis will be performed and all word forms will be treated as basic vocabulary units. Furthermore, acoustic model topologies will be used which specify short vowels as optional rather than obligatory elements, in order to facilitate the use of text documents as language model training data. Finally, we will investigate the possibility of using large, generally available text and audio sources to improve the accuracy of conversational Arabic speech recognition.

    Visit original website

    Final report [PDF]

    Team Members

    Senior Members
    Jeff BilmesUniversity of Washington
    John HendersonMITRE
    Katrin KirchhoffUniversity of Washington
    Pat SchoneDoD
    Rich SchwartzBBN Technologies
    Graduate Students
    Sourin DasJHU
    Gang JiUniversity of Washington
    Mohamed NoamanyBBN Technologies
    Undergraduate Students
    Melissa EganPomona College
    Feng HeSwarthmore College
  • SuperSID: Exploiting High-level Information for High-performance Speaker Recognition

    SuperSID: Exploiting High-level Information for High-performance Speaker Recognition
    Identifying individuals based on their speech is an important component technology in many application, be it automatically tagging speakers in the transcription of a board-room meeting (to track who said what), user verification for computer security or picking out a known terrorist or narcotics trader among millions of ongoing satellite telephone calls.
    How do we recognize the voices of the people we know? Generally, we use multiple levels of speaker information conveyed in the speech signal. At the lowest level, we recognize a person based on the sound of his/her voice (e.g., low/high pitch, bass, nasality, etc.). But we also use other types of information in the speech signal to recognize a speaker, such as a unique laugh, particular phrase usage, or speed of speech among other things.

    Most current state-of-the-art automatic speaker recognition systems, however, use only the low level sound information (specifically, very short-term features based on purely acoustic signals computed on 10-20 ms intervals of speech) and ignore higher-level information. While these systems have shown reasonably good performance, there is much more information in speech which can be used and potentially greatly improve accuracy and robustness.

    In this workshop we will look at how to augment the traditional signal-processing based speaker recognition systems with such higher-level knowledge sources. We will be exploring ways to define speaker-distinctive markers and create new classifiers that make use of these multi-layered knowledge sources. The team will be working on a corpus of recorded telephone conversations (Switchboard I and II corpora) that have been transcribed both by humans and by machine and have been augmented with a rich database of phonetic and prosodic features. A well-defined performance evaluation procedure will be used to measure progress and utility of newly developed techniques.


    Team Members

    Senior Members
    Walter AndrewsDoD
    Joe CampbellMIT Lincoln Laboratory
    Jiri NavratilIBM
    Barbara PeskinICSI
    Doug ReynoldsMIT Lincoln Laboratory
    Graduate Students
    Andre AdamiOGI
    Qin JinCarnegie Mellon University
    David KlusacekCharles University
    Undergraduate Students
    Joy AbramsonYork University
    Radu MihaescuPrinceton University
  • Weakly Supervised Learning for Wide Coverage Parsing

    Weakly Supervised Learning for Wide-Coverage Parsing
    Before a computer can try to understand or translate a human sentence, it must identify the phrases and diagram the grammatical relationships among them. This is called parsing.
    State-of-the-art parsers correctly guess over 90% of the phrases and relationships, but make some errors on nearly half the sentences analyzed. Many of these errors distort any subsequent automatic interpretation of the sentence.

    Much of the problem is that these parsers, which are statistical, are not "trained" on enough example parses to know about many of the millions of potentially related word pairs. Human labor can produce more examples, but still too few by orders of magnitude.

    In this project, we seek to achieve a quantum advance by automatically generating large volumes of novel training examples. We plan to bootstrap from up to 350 million words of raw newswire stories, using existing parsers to generate the new parses together with confidence measures.

    We will use a method called co-training, in which several reasonably good parsing algorithms collaborate to automatically identify one another's weaknesses (errors) and to correct them by supplying new example parses to one another. This accuracy-boosting technique has widespread application in other areas of machine learning, natural language processing and artificial intelligence.

    Numerous challenges must be faced: how do we parse 350 million words of text in less than a year (we have 6 weeks)? How to use partly incompatible parsers to train one another? Which machine learning techniques scale up best? What kind of grammars, probability models, and confidence measures work best? The project will involve a significant amount of programming, but the rewards should be high.


    Team Members

    Senior Members
    Rebecca HwaUniversity of Maryland
    Miles OsborneUniversity of Edinburgh
    Anoop SarkarUniversity of Pennsylvania
    Mark SteedmanUniversity of Edinburgh
    Graduate Students
    Stephen ClarkUniversity of Edinburgh
    Julia HockenmaierUniversity of Edinburgh
    Paul RuhlenJHU
    Undergraduate Students
    Steven BakerCornell University
    Jeremiah CrimJHU

WS'01 Summer Workshop

Research Groups

  • Automatic Summarization of Multiple (Multilingual) Documents

    Imagine the following situation. Your favorite search engine finds out that in addition to the 200 documents in English that match your query, another 200 are also relevant, but they are all in Chinese. Suppose also that your Chinese is not all that good.

    Imagine now that you have a sophisticated search engine that will actually automatically extract the most useful information from all the Chinese (as well as English) documents and summarize it for you in English so that you don't have to read the entire documents! The goal of this six-week project is to study this integration of cross-lingual information retrieval and subsequent multi-document summarization.

    To conduct a scientific and thorough study, we will build in the months before the workshop, a parallel corpus of Chinese and English news stories consisting of:

    9000 pairs of news articles that are manual translations of one another
    30-50 queries in both languages,
    Both manually and automatically translated document relevance-judgments for the queries (i.e. an answer-key of documents which are truly relevant to each query)
    Sentence relevance-judgments for the top-ranked documents for each query (i.e. the sentences in a document which best summarize the contents of the entire document)
    In such an environment, a user can type in a query in English and obtain ranked translated summaries of potentially relevant documents in English as well as Chinese.

    We will investigate the quality of cross-language retrieval and summarization using a rank correlation metric and correlate that valuation technique with other evaluation methods such as precision/recall and relative utility.

    At the end of the summer, a publicly available toolkit for cross-lingual summarization and evaluation will be available. The toolkit will implement (at arbitrary compression rates) multiple summarization algorithms, such as position-based, TF*IDF, largest common subsequence, and keywords. The methods for evaluating the quality of the summaries will be both intrinsic (such as percent agreement, precision/recall, and relative utility) and extrinsic (document rank).


    Team Members

    Senior Members
    Wai LamChinese University of Hong Kong
    Dragomir RadevUniversity of Michigan
    Horacio SaggionUniversity of Sheffield
    Simone TeufelColumbia University
    Graduate Students
    Danyu LiuChinese University of Hong Kong
    Hong QiUniversity of Michigan
    Undergraduate Students
    John BlitzerCornell University
    Arda CelebiBilkent University
  • Discriminatively Structured Graphical Models for Speech Recognition

    Discriminatively Structured Dynamic Graphical Models for Speech Recognition
    In recent years there has been growing interest in discriminative parameter training techniques, resulting from notable improvements in speech recognition performance on tasks ranging in size from digit recognition to Switchboard. Typified by Maximum Mutual Information (MMI) or Minimum Classification Error (MCE) training, these methods assume a fixed statistical modeling structure, and then optimize only the associated numerical parameters (such as means, variances, and transition matrices). Such is also the state of typical structure learning and model selection procedures in statistics, where the goal is to determine the structure (edges and nodes) of a graphical model (and thereby the set of conditional independence statements) that best describes the data.

    In this project, we explore the novel and significantly different methodology of discriminative structure learning. Here, the fundamental dependency relationships between random variables in a probabilistic model are learned in a discriminative fashion, and are learned separately and in isolation from the numerical parameters. The resulting independence properties of the model might in fact be wrong with respect to the true model, but are made only for the sake of optimizing classification performance. In order to apply the principles of structural discriminability, we adopt the framework of graphical models, which allows an arbitrary set of random variables and their conditional independence relationships to be modeled at each time frame.

    We also, in this project, present results using a new graphical modeling toolkit (GMTK). Using GMTK and discriminative structure learning heuristics, the results presented herein indicate that significant gains result from discriminative structural analysis of both conventional MFCC and novel AM-FM features on the Aurora continuous digits task. Lastly, we also present results using GMTK on several other tasks, such as on an IBM audio-video corpus, preliminary results on the SPINE-1 data set using hidden noise variables, on hidden articulatory modeling using GMTK, and on the use of interpolated language models represented by graphs within GMTK.


    Team Members

    Senior Members
    Jeff BilmesUniversity of Washington
    Yigal BrandmanPhonetact, Inc.
    Kirk JacksonDoD
    Tom RichardsonUniversity of Washington
    Eric SandnessSpeechWorks
    Geoff ZweigIBM
    Graduate Students
    Karim FilaliUniversity of Washington
    Karen LivescuMIT
    Peng XuJHU
    Undergraduate Students
    Eva HoltzHarvard
    Jeremiah TorresStanford

WS'00 Summer Workshop

Research Groups

  • Audio-Visual Speech Recognition

    It is well known that humans have the ability to lip-read: we combine audio and visual Information in deciding what has been spoken, especially in noisy environments. A dramatic example is the so-called McGurk effect, where a spoken sound /ga/ is superimposed on the video of a person uttering /ba/. Most people perceive the speaker as uttering the sound /da/.
    We will strive to achieve automatic lip-reading by computers, i.e., to make computers recognize human speech even better than is now possible from the audio input alone, by using the video of the speaker's face. There are many difficult research problems on the way to succeeding in this task, e.g., tracking the speakers head as she moves in the video-frame, identifying the type of lip-movement, guessing the spoken words independently from the video and the audio and combining the information from the two signals to make a better guess of what was spoken. In the summer, we will focus on a specific problem: how best to combine the information from the audio and video signal.

    For example, using visual cues to decide whether a person said /ba/ rather than /ga/ can be easier than making the decision based on audio cues, which can sometimes be confusing. On the other hand, deciding between /ka/ and /ga/ is more reliably done from the audio than the video. Therefore our confidence in the audio-based and video-based hypotheses depends on the kinds of sounds being confused. We will invent and test algorithms for combining the automatic speech classification decisions based on the audio and visual stimuli, resulting in audio-visual speech recognition that significantly improves the traditional audio-only speech recognition performance.


    Final Group Presentation

    Final Presentation Video

    Team Members

    Senior Members
    Andreas AndreouCLSP
    Juergen LuettinIDIAP
    Iain MatthewsHCII, CMU
    Chalapathy NetiIBM
    Gerasimos PotamianosIBM
    Graduate Students
    Herve GlotinICP-Grenoble, France
    Undergraduate Students
    Azad MashariUniversity of Toronto
    June SisonUC Santa Cruz
  • Mandarin-English Information

    Our globally interconnected world increasingly demands technologies to support on-demand retrieval of relevant information in any medium and in any language. If we search the web for, say, the loss of life in an earthquake in Turkey, by entering keywords in English, the most relevant stories are likely to be in Turkish or even Greek. Furthermore, the latest information may be in the form of audio files of the evening's news. One would like to be able to firstly find such information and then to translate it to English. Finding such information is beyond the capabilities of most commercially available search engines; good automatic translation is even harder. In this project, we will extend the state-of-the-art for searching audio and on-line text in one language for a user who speaks another language.
    A very large corpus of concurrent Mandarin and English textual and spoken news stories is available for conducting such research. These textual and spoken documents in both languages will be automatically indexed; in case of spoken documents, this will involve automatic speech recognition. Given a query in either language, we will then investigate systems that retrieve relevant documents in both languages for the user. Such cross-lingual and cross-media (CLCM) information retrieval is a novel problem with many technical challenges. Several schemes for recognizing the audio, indexing the text, and for estimating translation models to match queries in one language with documents in another language will be investigated in the summer. Applications of this research include audio and video browsing, spoken document retrieval, automated routing of information, and automatically alerting the user when special events occur.

    Final Group Presentation

    Final Presentation Video

    Team Members

    Senior Members
    Sanjeev KhudanpurCLSP
    Erika GramsAdvanced Analytic Tools
    Gina-Anne LevowUniversity of Maryland
    Helen MengCUHK
    Douglas OardUniversity of Maryland
    Patrick SchoneDepartment of Defense
    Hsin-Min WangAcademia Sinica, Taiwan
    Graduate Students
    Berlin ChenAcademia Sinica, Taiwan
    Wai-Kit LoCUHK
    Jianqiang WangUniversity of Maryland
    Undergraduate Students
    Karen TangPrinceton University
  • Pronunciation Modeling of Mandarin Casual Speech

    When people speak casually in daily life, they are not consistent in their pronunciation. In listening to such casual speech, it is quite common to find many different pronunciations of individual words. Current automatic speech recognition systems can reach a word accuracies above 90% when evaluated on carefully produced standard speech, but in recognizing casual, unplanned speech, performance drops to 75% or even lower. There are many reasons for this. In casual speech, one phoneme can shift to another. In Mandarin for example, the initial /sh/ in "wo shi (I am)" is often pronounced weakly and shifts into an /r/. In some other cases, sounds are dropped. In Mandarin, phonemes such as /b/, /p/, /d/, /t/, and /k/ are often reduced and as a result are often recognized as silence. These problems are made especially severe in Mandarin casual speech since most Chinese are non-native Mandarin speakers. Chinese languages such as Cantonese are as different from the standard Mandarin as French is different from English. As a result, there is an even larger pronunciation variation due to the influence of speakers' native language.
    We propose to study and model such pronunciation differences in casual speech using interviews found in Mandarin news broadcasts. We hope to include experienced researchers from both China and the US in the areas of pronunciation modeling, Mandarin speech recognition, and Chinese phonology.

    Final Group Presentation

    Final Presentation Video

    Team Members

    Senior Members
    William ByrneCLSP/JHU
    Pascale FungHKUST
    Terri KammDepartment of Defense
    Tom ZhengTsinghua University
    Graduate Students
    Zhanjiang SongTsinghua University
    Veera VenkatramaniCLSP/JHU
    Liu YiKHUST
    Undergraduate Students
    Umar RuhiUniversity of Toronto
  • Reading Comprehension

    Building a computer system that can acquire information by reading texts has been a long standing goal of computer science. Consider designing a computer system that can take the following third grade reading comprehension exam.
    How Maple Syrup is Made
    Maple syrup comes from sugar maple trees. At one time, maple syrup was used to make sugar. This is why the tree is called a "sugar" maple tree. Sugar maple trees make sap. Farmers collect the sap. The best time to collect sap is in February and March. The nights must be cold and the days warm. The farmer drills a few small holes in each tree. He puts a spout in each hole. Then he hangs a bucket on the end of each spout. The bucket has a cover to keep rain and snow out. The sap drips into the bucket. About 10 gallons of sap come from each hole.
    1. Who collects maple sap? (Farmers)
    2. What does the farmer hang from a spout? (A bucket)
    3. When is sap collected? (February and March)
    4. Where does the maple sap come from? (Sugar maple trees)
    5. Why is the bucket covered? (to keep rain and snow out)

    Such exams measure understanding by asking a variety of questions. Different types of questions probe different aspects of understanding.

    Existing techniques currently earn roughly a 40% grade; still failing but encouraging. We will investigate methods by which a computer can understand the text better, and hope that by the end of the workshop the computer will be ready to move on to the fourth grade!


    Final Group Presentation

    Final Presentation Video

    Team Members

    Senior Members
    Eric BreckMITRE
    Marc LightMITRE
    Ellen RiloffUniversity of Utah
    Mats RoothUniversity of Stuttgart
    Graduate Students
    Gideon MannJHU
    Mike ThelenUniversity of Utah
    Undergraduate Students
    Pranav AnandHarvard University
    Brianne BrownBryn Mawr College

WS'99 Summer Workshop

Research Groups

  • Normalization of Non-Standard Words

    Real text contains a variety of "non-standard" token types, such as digit sequences; words, acronyms and letter sequences in all capitals; mixed case words (WinNT, SunOS); abbreviations; Roman numerals; URL's and e-mail addresses. Many of these kinds of elements are pronounced according to principles that are quite different from the pronunciation of ordinary words. Furthermore, many items have more than one plausible pronunciation, and the correct one must be disambiguated from context: IV could be "four", "fourth", "the fourth", or "I.V."

    Normalizing or rewriting such text using ordinary words is an important issue for several applications. For instance, an essential feature of natural human-computer interfaces is that the computer be capable of responding with spoken replies or comments. A Text-to-Speech module synthesizes the spoken response from such text input and must be able to render such items appropriately into speech. In Automatic Speech Recognition nonstandard types cause problems for training acoustic as well as language models. More sophisticated text normalization will be an important tool for utilizing the vast amounts of on-line text resources. Normalized text is likely to be of specific benefit in information extraction applications.

    This project will apply language modeling techniques to creation of wide coverage models for disambiguating non-standard words in English. Its aim is to create (1) a publicly available corpus of tagged examples, plus a publicly available taxonomy of cases to be considered and (2) a set of tools that would represent the best state of the art in text normalization for English.


    Team Members

    Senior Members
    Alan BlackUniversity of Edinburgh, CSTR
    Stanley ChenCMU
    Mari OstendorfBoston University
    Richard SproatAT&T Labs
    Graduate Students
    Shankar KumarCLSP
    Undergraduate Students
    Christopher RichardsWilliams
  • Statistical Machine Translation

    Automatic translation from one human language to another using computers, better known as machine translation (MT), is a longstanding goal of computer science. In order to be able to perform such a task, the computer must "know" the two languages --- synonyms for words and phrases, grammars of the two languages, and semantic or world knowledge. One way to incorporate such knowledge into a computer is to use bilingual experts to hand-craft the necessary information into the computer program. Another is to let the computer learn some of these things automatically by examining large amounts of parallel text: documents which are nearly exact translations of each other. The Canadian government produces one such resource, for example, in the form of parliamentary proceedings which are recorded in both English and French.

    Recently, statistical data analysis has been used to gather MT knowledge automatically, from parallel bilingual text. The techniques have unfortunately not been disseminated to the scientific community in very usable form, and new follow-on ideas have not developed rapidly. In pre-workshop activity, we plan to reconstruct a baseline statistical MT system for distribution to all researchers, and to use it as a platform for workshop experiments. These experiments will include working with morphology, online dictionaries, widely available monolingual texts, and syntax. The goal will be to improve the accuracy of the baseline and/or achieve the same accuracy with only limited parallel corpora. We will work with the French-English Hansard data as well as with a new language, perhaps Czech or Chinese.


    Team Members

    Senior Members
    David YarowskyCLSP
    Kevin KnightUSC/ISI
    John LaffertyCMU
    Dan MelamedWest Group
    David PurdyDoD
    Graduate Students
    Yaser Al-OnaizanUSC/ISI
    Jan CurinCharles Univ., CR
    Franz OchRWTH Aachen
    Undergraduate Students
    Noah SmithCLSP
    Michael JahrStanford
  • Topic-Based Novelty Detection

    Computers are being increasingly used to manage large volumes of news and information increasingly available in electronic form. The task of the computer is to organize the incoming data into segments or stories which are related and to index them in a way which makes it easier for the user to digest them.

    A key problem of digesting new data is deciding which parts contain redundant information so attention can be focused on the new material. This project proposes to investigate the problem of analyzing newly arrived news stories for two purposes: (1) to decide if the story discusses an event or topic that has not been seen earlier (first story detection); and (2) to identify, within a sequence of stories on the same pre-defined topic, which portions of subsequent stories contain new information and to determine the new named entities that are central to the topic (within-topic novelty detection). The project will focus on extending and combining Information Retrieval and Natural Language Processing Extraction techniques toward addressing these questions. Specifically, the team will look at identifying who/where/when entities and how to use them in Information Retrieval and other language modeling approaches for addressing this problem. An important component of the proposed project is investigating the impact on the detection results of using (degraded) text put out by a speech recognition system. The evaluation of the project's results will be based on established measures from the Topic Detection Tracking initiative in the case of first story detection, and on accuracy of aligning predicted new text with actual new information (as identified by human experts prior to the workshop) in the case of novelty detection.


    Team Members

    Senior Members
    James AllanUMass
    Hubert JinBBN
    Martin RajmanEDFL
    Charles WayneDoD
    Graduate Students
    Daniel GildeaICSI
    Victor LavrenkoUMass
    Undergraduate Students
    David CaputoYale
    Rose HobermanUTexas
  • Toward Language-Independent Acoustic Modeling

    The state of the art in automatic speech recognition (ASR) has advanced considerably for those languages for which large amounts of data is available to build the ASR system. Obtaining such data is usually very difficult as it includes tens of hours of recorded speech along with accurate transcriptions, an on-line dictionary or lexicon which lists how words are pronounced in terms of elementary sound units such as phonemes, and on-line text resources. The text resources are used to train a language model which helps the recognizer anticipate likely words, the dictionary tells the recognizer identify how a word will sound in terms of phonemes when it is spoken, and the speech recordings are used to learn the acoustic signal pattern for each phoneme, resulting in a hierarchy of models which work together to recognize successive spoken words. Relatively little research has been done for building speech recognition systems for languages for which such data resources are not available --- a situation which unfortunately is true for all but a few languages of the world.

    This project will investigate the use of speech from diverse source languages to build an ASR system for a single target language. We will study promising modeling techniques to develop ASR systems in languages for which large amounts of training data are not available. We intend to pursue three themes. The first concerns the development of algorithms to map pronunciation dictionary entries in the target language to elements in the dictionaries of the source languages. The second theme will be Discriminative Model Combination (DMC) of acoustic models in the individual source languages for recognition of speech in the target language. The third theme will be development of clustering and adaptation techniques to train a single set of acoustic models using data pooled from the available source languages. The goal is to develop Czech Broadcast News (BN) transcription systems using a small amount of Czech adaptation data to augment training data available in English, Spanish, and Mandarin. The best data for this modeling task would be natural, unscripted speech collected on a quiet, wide-band acoustic channel. News broadcasts are a good source of such speech and are fairly easily obtained. Broadcast news data of other source or target languages, possibly German or Russian, will be used if they become available in a suitable amount and quality.


    Team Members

    Senior Members
    Sanjeev KhudanpurCLSP
    Peter BeyerleinPRL
    William BryneCLSP/JHU
    John MorganWest Point
    Joe PiconeMiss. State
    Graduate Students
    Juan HuertaCMU
    Nino PeterekCharles Univ., CR
    Undergraduate Students
    Bhaskara MarthiUToronto
    Wei WangRice

WS'98 Summer Workshop

Research Groups

  • Core Natural Language Processing Technology Applicable to Multiple Languages

    Syntactic analysis is one of the crucial ingredients of natural language understanding. When we hear a sentence such as ``I saw John,'' we identify saw as the main verb or event in the sentence, I as the subject doing the seeing and John as the object being seen. While this example is simple, things become complex very quickly as the sentence to be understood grows longer. A common problem that arises is ambiguity, e.g., in the sentence ``I saw the man with a telescope,'' either the seen man had a telescope or a telescope was used to see the man. Even for moderately long sentences, tens and hundreds of thousands of distinct analyses are possible. Yet automatic syntactic analysis based on statistical methods has been quite successful for English - state of the art parsers correctly extract 90% of the dependencies from newspaper text such as the Wall Street Journal. This is done by annotating a large number of sentences by hand and building a statistical model which estimates how frequently a particular analysis is encountered. This model then ranks the various analyses of a new sentence by likelihood, and efficiently computes the most likely one. Most state of the art parsing models heavily use lexical information in choosing an analysis. While much is known about parsing English text, it is easy to see that parsing a highly inflective language or a free word order language such as Czech adds a new dimension of difficulty. The inflective nature mean that the vocabulary as seen by a computer appears huge, because each inflectional form is a distinct word. Unlike English, Czech does not obey the subject-verb-object ordering of words as in ``Peter sells cars.'' The identification of subjects or objects is often via their inflectional forms, and discourse plays a role in syntactic analysis. Participants in this project plan to explore techniques of syntactic analysis, both known and new, which utilize inflectional information to deal with the free word order. The techniques developed here for Czech newspaper text are expected to be useful for Polish, Russian, Serbo-Croatian and other Slavic languages, and for languages such as Spanish, German and Italian which exhibit inflectional and free word order behavior to smaller degrees.

    Team Members

    Senior Members
    Eric BrillCLSP/JHU
    Jan HajicCharles Univ., CR
    Doug JonesDoD
    Lance RamshawBBN
    Graduate Students
    Michael CollinsUPenn
    Barbora HladkaCharles Univ., CR
    Christoph TillmannLehrstuhl, Aachen
    Daniel ZemanCharles Univ., CR
    Undergraduate Students
    Cynthia KuoStanford
    Oren SchwartzUPenn
  • Dynamic Segmental Models of Speech Coarticulation

    Automatic speech recognition has achieved significant success by using powerful and complex models for representing and interpreting the speech (acoustic) signal. However these models require unreasonable amounts of training data. Some researchers think that the nature and fundamental philosophy of the current acoustic-phonetic modelling methods, such as hidden Markov models, are inappropriate. Participants in this project plan to explore a different way of thinking of the nature of speech patterns. Their proposed model has a long history in speech science, but it has yet to be successfully applied to automatic speech recognition. The speech signal can be thought of as being generated by a relatively low dimensional system, namely our articulatory organs, moving slowly relative to the variations of the signal picked up by a microphone. The proposed computational model consists of a linear dynamical process describing smooth movement of the vocal tract resonance, which flows from one phonetic unit to another, with the observed features of the acoustic signal being a nonlinear function of this process. Vocal tract resonance is a characteristic of the vocal tract that is related to the familiar notion of formants; it corresponds roughly to the formants for vocalic sounds and though it may not correspond to spectral peaks for consonants, it changes smoothly through them as the configuration of the articulators changes. The participating researchers expect that this model will be robust even for modest amounts of training data due to its compactness. Computational techniques they plan to use in this project include nonlinear regression, multilayer perceptrons and Kalman filtering.

    Team Members

    Senior Members
    John BridleDragon UK
    Li DengWaterloo
    Joe PiconeMiss. State
    Hywel RichardsDragon UK
    Mike SchusterNara, Japan
    Graduate Students
    Terri KammCLSP
    Jeff MaWaterloo
    Undergraduate Students
    Sandi PikeBrown
    Roland ReaganCMU
  • Rapid Speech Recognizer Adaptation for New Speakers

    Humans have little difficulty recognizing speech in noisy environments, speech distorted by having passed through an unknown channel or speech from nonnative speakers. We adapt to the characteristics of the new speech, often after hearing only a few seconds of it. Adaptation techniques have been developed for automatic speech recognizers which attempt to similarly compensate for differences between the speech on which the system was trained, and the speech which it has to recognize. However, several minutes of speech from the new speaker or environment have to be provided to the system to obtain any significant improvement in recognition performance. An automatic speech recognition system employs a number of models for small segments of speech sounds such as phonemes. Simply put, transforming each of these models requires that a sufficient number of samples of each segment be seen from the new speaker. When a small amount of new speech is heard, humans are able to exploit relationships between various sounds so that having heard a few of them in the distorted environment is adequate to adjust for the unheard ones as well. In automatic systems therefore, if sufficient speech is not available to adapt all the models individually, some method must be devised to transform the models of the unheard or insufficiently heard segments based on the heard ones. The participants in this project plan to alleviate the commonly used remedy of tying, or forcing to be identical, the transformation of the models of related speech units. They instead plan to study the dependencies between the speech units, so that the model transformation for one unit influences but is not necessarily identical to the transformation for another unit. They plan to use this knowledge to transform each model individually without requiring a large sample of each speech segment for adaptation. Modelling techniques they plan to employ include covariance models such as Markov random fields and dependency trees.

    Team Members

    Senior Members
    Sanjeev KhudanpurCLSP
    Sid BerkowitzDoD
    Enrico BocchieriAT&T
    William BryneCLSP/JHU
    Vassilis DigalakisTUC
    Ashvin KannanNuance
    Ananth SankarSRI
    Graduate Students
    John McDonoughCLSP
    Costas BoulisTUC
    Undergraduate Students
    Heather CollierWVU
    Adrian CorduneanuToronto

WS'97 Summer Workshop

Research Groups

  • Acoustic Processing/Modeling Group: Exploring the Time Dimension at Different Scales

    In the 1997 JHU/CLSP workshop (WS97) our group revisits the acoustic processor architecture employed in the state of the art, large vocabulary, continuous speech recognition systems. We investigate data driven processing paradigms exploring techniques at different context scales. At the short time scales (~10ms) we investigate the non-linear frequency mapping known as Mel-scale. At the medium time scales, (context ~ 100ms) we investigate linear discriminant and heteorscedastic discriminant transforms. At time scales with longer context (~1000ms) we explore feature-trajectory filtering. At even longer time scales (~ 500 ms to 4s) we experiment with adaptive Cepstrum bias normalization techniques.

    The results of our investigation are very encouraging and are summarized in the online final reports and papers.


    Team Members

    Senior Members
    Andreas AndreouCLSP
    Hynek HermanskyCLSP
    Juergen LuettinIDIAP
    Yasuhiro MinamiNTT Human Interface Labs
    Christian WellekensEurecom
    Graduate Students
    Terri KammCLSP
    Daniel FainCalTech
  • Discourse LM


    Team Members

    Senior Members
    Dan JurafskyUC, Boulder
    Marie MeteerBBN
    Liz ShribergSRI
    Andreas StolckeSRI
    Paul TaylorEdinburgh
    Carol Van Ess-DykemaDoD
    Graduate Students
    Becky BatesBU
    Noah CoccaroUC, Boulder
    Rachel MartinJHU
    Klaus RiesCMU
  • Pronunciation Modelling

    Our goal is to model the extensive pronunciation variation found in the Switchboard corpus, likely an important factor in the difficulty current ASR systems have on this conversational speech task. In contrast to previous efforts, we will use the recently created ICSI hand-labeled phonetic transcriptions of Switchboard as the target data of our modeling. This new corpus potentially contains a wealth of information about pronunciation in conversational speech. We will use relevant phonological, prosodic, syntactic, and discourse information as the source data of our modeling including baseform pronunciation of words, lexical stress, pitch accent, and segmental durations. We will map from source to target by various stochastic and rule-based methods including statistical decision trees, rewrite rules, and MMI. The initial measure of performance will be the reduction of the conditional entropy of the target ICSI transcriptions given the source linguistic information. Next, these mappings will be used in a speech recognizer to create alternative pronunciations in context and word error rate will be measured. As time permits, the pronunciation models created above will be used to transcribe automatically a portion of the speech corpus and then the acoustic models will be re-estimated based on these transcriptions. We will also explore generating constrained automatic alignments of all of the data as an alternative to the ICSI data.

    Team Members

    Senior Members
    Sanjeev KhudanpurCLSP
    Bill ByrneCLSP/JHU
    Michael RileyAT&T Labs
    Chuck WootersDoD
    George ZavaliagkosBBN
    Graduate Students
    Murat SaralarCLSP
    Michael FinkCMU
    Harriet NockCambridge
  • Syllable Based Speech Processing

    Goal: The goal of the proposed research project is to explore the use of syllable-based acoustical information in speech recognition.

    Motivation: For at least a decade now the triphone model has been the dominant method of modeling speech acoustics. But triphones are relatively inefficient decompositional models, and they have become ponderous and resistant to integration of new information such as temporal characteristics and dependencies. The resulting tension between the desire to add model detail and the burden of doing so has caused researchers to look for alternatives to the triphone model. The syllable is one such alternative, appealing because of its close connection to articulation, its integration of coarticulation, and the possibility for relatively compact statistical coverage.


    Team Members

    Senior Members
    Andres CorradaDragon
    George DoddingtonDoD
    Joe PiconeMiss. State
    Barbara WheatleyDoD
    Graduate Students
    Vaibhava GoelCLSP
    Mark OdrawskiCLSP
    Arvin GanapathirajuMiss. State
    Katrin KirchhoffICSI

WS'96 Summer Workshop

Research Groups

  • Automatic Learning of Word Pronunciation from Data

    Today's recognizers are based on single pronunciations for most words. There are certain types of pronunciation variation (phone deletion/reduction, dialect) that are impossible to model at the acoustic level. The goal of this project is to learn automatically models of word pronunciation from data. For the Switchboard and Callhome corpora, a small number of words make up a large fraction of the total words spoken. To start with, the pronunciations of these errorful words will be learned. Pronunciation variants will no longer be treated as mutually independent, i.e., under the assumption that any speaker would choose one of the given variants with a given probability, independent of related choices made by him in the same conversation.

    Team Members

    Senior Members
    Sanjeev KhudanpurCLSP
    Charles GallesDoD
    Yu-Hung KaoTI
    Steven WegmannDragon
    Mitch WeintraubSRI
    Graduate Students
    Murat SaralarCLSP
    Eric FoslerICSI
  • Dependency Modeling

    The goal of this project is to build an improved speech language model for conversational English that explores non-local lexical dependencies between words.

    Team Members

    Senior Members
    David EngleDoD
    Victor JimenezValencia
    Harry PrintzIBM
    Eric RistadPrinceton
    Roni RosenfeldCMU
    Andrewas StolckeSRI
    Dekai WuHong Kong HKUST
    Graduate Students
    Ciprian ChelbaCLSP
    Lidia ManguCLSP
  • Modeling Systematic Variations in pronunciation Modeling via a Language-Dependent Hidden Speaking Mode

    The goal of this project is to utilize the results of conversational discourse analysis (e.g., prosody) to model pronunciation variations, and, more specifically, to investigate the use of a hidden speaking mode to represent systematic variations that are correlated with the word sequence (predictable from syntactic structure and thus includable into the language model). Possible systematic factors affecting pronunciation differences are local factors, such as coarticulation effects across word boundaries, linguistic factors, such as phrasing and focus (which are cued by prosodic markers that affect the acoustic realization of a word), speaking rate (which may be associated with discourse structure), and global factors, such as dialect.

    Team Members

    Senior Members
    Bill BryneJHU
    Mari OstendorfBU
    Ken RossBBN
    Liz ShribergSRI
    David TalkinEntropic
    Alex WaibelCMU
    Barbara WheatleyDoD
    Graduate Students
    Asela GunarwardanaCLSP
    Michiel BacchianiBU
    Michael FinkeCMU
    Sam RoweisCaltech
    Torsten ZeppenfeldCMU
  • Speech Data Modeling

    This project is concerned with speech and speaker variations as it affects signal processing and acoustic modeling :
    Signal processing will involve non-linear speaker and channel adaptation by finding a common low dimensional mapping of training data, based on the J-RASTA signal processing approach. Knowledge of the Multi-band Recognition paradigm will be experimented with and exploited.
    Variability as a function of global and local speaking rate will be incorporated into the acoustic model. We will exploit discriminant HMM technology developed by Bourlard and Morgan, using transition probabilities that are dependent on acoustics (and in this case rate).

    Team Members

    Senior Members
    Hynek HermanskyCLSP
    Herve BourlardFaculte Polytechnique de Mons (BE) / ICSI
    Jordan CohenIDA
    Nelson MorganICSI
    Christophe RisFaculte Polytechnique de Mons (BE)
    Graduate Students
    Mark OdrawskiCLSP
    Nikki MirghaforiICSI
    Sangita TibrewalaOregon Graduate Institute

LM'95 Summer Research Workshop

Research Groups

  • Fast Sparse Data Training/Portability Group

    Team Members

    Senior Members
    Satya DharanipragadaJHU
    Herman NeyRWTH Aachen
    John PrangeDoD
    Andreas StolckeSRI
    Mitch WeintraubSRI
    Graduate Students
    Sanjeev KhudanpurCLSP
    Yaman AksuJHU
    Affiliate Members
    Fred JelinekCLSP
    Liz SchribergSRI
  • Language Modeling for Conversational Speech Recognition


    1) Creating language models that are attuned to spontaneous conversations.
    "There is more to conversational speech than Dysfluencies."

    2) Beyond mere word-based transcription:
    identifying and cleaning up disfluencies.
    transforming the output into "simplified English".

    Project Page & Team Report

    Team Members

    Senior Members
    Rajeev AgarwalTI
    Bill ByrneJHU
    Mark LibermanUPenn
    Roni RosenfeldCMU
    Liz ShribergSRI
    Jack UnverferthDoD
    Enrique VidalValencia-Spain
    Graduate Students
    Dimitra VergyriCLSP
    Rukmini IyerBBN
  • Language Modeling Issues for Spanish Language Large Vocabulary Continuous Speech Recognition

    The goal of this workshop project is to explore language modeling techniques to improve recognition of unrestricted, conversational Spanish, over telephone channels. The basic training and test data will be from the Spanish language component of the Linguistic Data Consortium's Call Home corpus. This is a corpus of transcribed telephone conversations. Text corpora, as well as other sources of transcribed speech, will be available.

    We will be starting with a baseline Spanish speech recognizer built with BBN's Byblos speech recognition system. The workshop will be provided with N-best and/or lattice outputs from this recognizer. We will endeavor to develop and evaluate language models for improving on the baseline performance level. In particular it will be desirable to exploit specific aspects of the Spanish language to improve the performance of the recognizer. The N-best lists and lattices will provide one means for evaluating our ideas and perplexity measurements another. Our progress will also be measured by our improved understanding of how language characteristics should influence our choice of a language model for recognition.


    Team Members

    Senior Members
    German Bordel
    Pierre Dupont
    Herb Gish
    Jose Oncina
    Carol Van Ess-Dykema
    Graduate Students
    Lin Chase
    Eric Wheeler
  • Phrase Structure Language Models

    The goal is to develop language models for improving the accuracy in recognizing conversational speech. We want to explore the use of phrase structure (possibly including syntactic lexical information such as morphology, part-of-speech tags, etc.) to improve on the infamous trigram language model. Specifically, we would like to explore parsing-based models for the prediction of the next word.
    We expect to use the various available treebanks (Wall Street Journal, Brown Corpus) for written text but we need a treebank for conversational speech. Specifically, we want one million words of Switchboard marked for disfluency and surface structure similar to the WSJ Treebank.

    Team Members

    Senior Members
    David HarrisDOD
    Steve LoweDragon
    Srinivasa RaoIBM
    Eric RistadPrinceton
    Salim RoukosIBM
    Graduate Students
    Xiaoqiang LuoCLSP