arXiv

Title:
Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation

Abstract:
Generative Error Correction (GEC) has emerged as a powerful post-processing method to enhance the performance of Automatic Speech Recognition (ASR) systems. However, we show that GEC models struggle to generalize beyond the specific types of errors encountered during training, limiting their ability to correct new, unseen errors at test time, particularly in out-of-domain (OOD) scenarios. This phenomenon amplifies with named entities (NEs), where, in addition to insufficient contextual information or knowledge about the NEs, novel NEs keep emerging. To address these issues, we propose DARAG (Data- and Retrieval-Augmented Generative Error Correction), a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios. We augment the GEC training dataset with synthetic data generated by prompting LLMs and text-to-speech models, thereby simulating additional errors from which the model can learn. For OOD scenarios, we simulate test-time errors from new domains similarly and in an unsupervised fashion. Additionally, to better handle named entities, we introduce retrieval-augmented correction by augmenting the input with entities retrieved from a database. Our approach is simple, scalable, and both domain- and language-agnostic. We experiment on multiple datasets and settings, showing that DARAG outperforms all our baselines, achieving 8% -- 30% relative WER improvements in ID and 10% -- 33% improvements in OOD settings.

2024.12.30

INTERSPEECH 2024

Title:
Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation

Abstract:
Language-agnostic many-to-one end-to-end speech translation models can convert audio signals from different source languages into text in a target language. These models do not need source language identification, which improves user experience. In some cases, the input language can be given or estimated. Our goal is to use this additional language information while preserving the quality of the other languages. We accomplish this by introducing a simple and effective linear input network. The linear input network is initialized as an identity matrix, which ensures that the model can perform as well as, or better than, the original model. Experimental results show that the proposed method can successfully enhance the specified language, while keeping the language-agnostic ability of the many-to-one ST models.

2024.12.30

ICASSP 2024

Title:
Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation

Abstract:
The growing need for instant spoken language transcription and translation is driven by increased global communication and cross-lingual interactions. This has made offering translations in multiple languages essential for user applications. Traditional approaches to automatic speech recognition (ASR) and speech translation (ST) have often relied on separate systems, leading to inefficiencies in computational resources, and increased synchronization complexity in real time. In this paper, we propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder. We introduce a novel method for joint token-level serialized output training based on timestamp information to effectively produce ASR and ST outputs in the streaming setting. Experiments on {it,es,de}-en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.

2024.12.30

ICASSP 2024

Title:
Diarist: Streaming Speech Translation with Speaker Diarization

Abstract:
End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion. In this work, we propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates tokenlevel serialized output training and t-vector, which were originally developed for multi-talker speech recognition. Due to the absence of evaluation benchmarks in this area, we develop a new evaluation dataset, DiariST-AliMeeting, by translating the reference Chinese transcriptions of the AliMeeting corpus into English. We also propose new metrics, called speaker-agnostic BLEU and speaker-attributed BLEU, to measure the ST quality while taking SD accuracy into account. Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech. To facilitate the research in this new direction, we release the evaluation data, the offline baseline systems, and the evaluation code.

2024.12.30

ASRU 2023

Title:
Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach

Abstract:
Simultaneous Speech-to-Text translation serves a critical role in real-time crosslingual communication. Despite the advancements in recent years, challenges remain in achieving stability in the translation process, a concern primarily manifested in the flickering of partial results. In this paper, we propose a novel revision-controllable method designed to address this issue. Our method introduces an allowed revision window within the beam search pruning process to screen out candidate translations likely to cause extensive revisions, leading to a substantial reduction in flickering and, crucially, providing the capability to completely eliminate flickering. The experiments demonstrate the proposed method can significantly improve the decoding stability without compromising substantially on the translation quality.

2024.12.30

ASRU 2023

Title:
Building High-Accuracy Multilingual ASR With Gated Language Experts and Curriculum Training

Abstract:
We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring user input for language identification (LID) during inference. Our method incorporates a gating mechanism and LID loss to enable transformer experts to learn language-specific information. Linear experts are applied on joint network to stabilize training. The curriculum training scheme leverages LID to guide gated experts in improving their respective language-specific performance. Experimental results on an English and Spanish bilingual task show significant average relative word error reductions of 12.5% and 7.3% compared to the baseline bilingual and monolingual models, respectively. Our models even perform similarly to upper-bound models with oracle LID. Extending our approach to trilingual, quadrilingual, and pentalingual models reveals similar advantages to those seen in the bilingual models, highlighting its ease of extension to multiple languages.

2024.12.30

ASRU 2023

Title:
Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

Abstract:
In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder. To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target words by leveraging an off-the-shelf textual aligner. Experiments in monolingual (it-en) and multilingual ({de,es,it}-en) settings demonstrate that our approach achieves the best quality-latency balance. With an average ASR latency of 1s and ST latency of 1.3s, our model shows no degradation or even improves output quality compared to separate ASR and ST models, yielding an average improvement of 1.1 WER and 0.4 BLEU in the multilingual case.

2024.12.30

ASRU 2023

Title:
A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

Abstract:
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high streaming capability. Instead of human labeled speech translation (ST) data, SM2 models are trained using weakly supervised data generated by converting the transcriptions in speech recognition corpora with a machine translation service. With 351 thousand hours of anonymized speech training data from 25 languages, SM2 models achieve comparable or even better ST quality than some recent popular large-scale non-streaming speech models. More importantly, we show that SM2 has the truly zero-shot capability when expanding to new target languages, yielding high quality ST results for {source-speech, target-text} pairs that are not seen during training.

2024.12.30

INTERSPEECH 2023

Title:
LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation using Neural Transducers

Abstract:
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. It is thus possible to use a single transducer model to perform both tasks. In real-world applications, such joint ASR and ST models may need to be streaming and do not require source language identification (i.e. language-agnostic). In this paper, we propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers. Based on the transducer model structure, we propose four methods, a unified joint and prediction network for multilingual output, a clustered multilingual encoder, target language identification for encoder, and connectionist temporal classification regularization. Experimental results show that LAMASSU not only drastically reduces the model size but also reaches the performances of monolingual ASR and bilingual ST models.

2024.12.30

ICASSP 2023

Title:
Data2vec-SG: Improving Self-supervised Learning Representations for Speech Generation Tasks

Abstract:
Self-supervised learning has been successfully applied to various speech recognition and understanding tasks. However, for generative tasks such as speech enhancement and speech separation, most self-supervised speech representations did not show substantial improvements. To deal with this problem, in this paper, we propose data2vec-SG (Speech Generation), which is a teacher-student learning framework that addresses speech generation tasks. Our data2vec-SG introduces a reconstruction module into data2vec and enforces the representations to contain not only the semantic information but also the acoustic knowledge to generate clean speech waveforms. Experimental results demonstrate that the proposed framework boosts the performance of various speech generation tasks including speech enhancement, speech separation, and packet loss concealment. Meanwhile, the learned representation is also capable of helping other downstream tasks, which is demonstrated by the good performance in the speech recognition task in both clean and noisy conditions.

2023.02.26

ICASSP 2023

Title:
Self-Supervised Learning with Bi-Label Masked Speech Prediction for Streaming Multi-Talker Speech Recognition

Abstract:
Self-supervised learning (SSL), which utilizes the input data itself for representation learning, has achieved state-of-the-art results for various downstream speech tasks. However, most of the previous studies focused on offline single-talker applications, with limited investigations in multi-talker cases, especially for streaming scenarios. In this paper, we investigate SSL for streaming multi-talker speech recognition, which generates transcriptions of overlapping speakers in a streaming fashion. We first observe that conventional SSL techniques do not work well on this task due to the poor representation of overlapping speech. We then propose a novel SSL training objective, referred to as bi-label masked speech prediction, which explicitly preserves representations of all speakers in overlapping speech. We investigate various aspects of the proposed system including data configuration and quantizer selection. The proposed SSL setup achieves substantially better word error rates on the LibriSpeechMix dataset.

2023.02.26

INTERSPEECH 2022

Title:
Large-Scale Streaming End-to-End Speech Translation with Neural Transducers

Abstract:
Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we introduce it to streaming end-to-end speech translation (ST), which aims to convert audio signals to texts in other languages directly. Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency, exploits speech information, and avoids error propagation from ASR to MT. To improve the modeling capacity, we propose attention pooling for the joint network in TT. In addition, we extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time. Experimental results on a large-scale 50 thousand (K) hours pseudo-labeled training set show that TT-based ST not only significantly reduces inference time but also outperforms non-streaming cascaded ST for English-German translation.

2022.03.29

INTERSPEECH 2022

Title:
Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Abstract:
Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.

2022.03.29

arXiv

Title:
A Conformer Based Acoustic Model for Robust Automatic Speech Recognition

Abstract:
This study addresses robust automatic speech recognition (ASR) by introducing a Conformer-based acoustic model. The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation, but employs a Conformer encoder instead of the BLSTM network. The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling. The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus. Coupled with utterance-wise normalization and speaker adaptation, our model achieves 6.25% word error rate, which outperforms the previous best system by 8.4% relatively. In addition, the proposed Conformer-based model is 18.3% smaller in model size and reduces total training time by 79.6%.

2022.03.29

ICASSP 2022

Title:
Continuous Speech Separation with Recurrent Selective Attention Network

Abstract:
While permutation invariant training (PIT) based continuous speech separation (CSS) significantly improves the conversation transcription accuracy, it often suffers from speech leakages and failures in separation at "hot spot" regions because it has a fixed number of output channels. In this paper, we propose to apply recurrent selective attention network (RSAN) to CSS, which generates a variable number of output channels based on active speaker counting. In addition, we propose a novel block-wise dependency extension of RSAN by introducing dependencies between adjacent processing blocks in the CSS framework. It enables the network to utilize the separation results from the previous blocks to facilitate the current block processing. Experimental results on the LibriCSS dataset show that the RSAN-based CSS (RSAN-CSS) network consistently improves the speech recognition accuracy over PIT-based models. The proposed block-wise dependency modeling further boosts the performance of RSAN-CSS.

2021.10.30

INTERSPEECH 2021

Title:
Multitask Training with Text Data for End-to-End Speech Recognition

Abstract:
We propose a multitask training method for attention-based end-to-end speech recognition models to better incorporate language level information. We regularize the decoder in a sequence-to-sequence architecture by multitask training it on both the speech recognition task and a next-token prediction language modeling task. Trained on either the 100 hour subset of LibriSpeech or the full 960 hour dataset, the proposed method leads to an 11% relative performance improvement over the baseline and is comparable to language model shallow fusion, without requiring an additional neural network during decoding. Analyses of sample output sentences and the word error rate on rare words demonstrate that the proposed method can incorporate language level information effectively.

2021.06.03

IEEE/ACM TASLP 2020

Title:
Multi-Microphone Complex Spectral Mapping for Utterance-wise and Continuous Speaker Separation

Abstract:
We propose multi-microphone complex spectral mapping, a simple way of applying deep learning for time-varying non-linear beamforming, for offline utterance-wise and block-online continuous speaker separation in reverberant conditions, aiming at both speaker separation and dereverberation. Assuming a fixed array geometry between training and testing, we train deep neural networks (DNN) to predict the real and imaginary (RI) components of target speech at a reference microphone from the RI components of multiple microphones. We then integrate multi-microphone complex spectral mapping with beamforming and post-filtering to further improve separation, and combine it with frame-level speaker counting for block-online continuous speaker separation (CSS). Although our system is trained on simulated room impulse responses (RIR) based on a fixed number of microphones arranged in a given geometry, it generalizes well to a real array with the same geometry. State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.

2021.06.03

arXiv

Title:
Efficient End-to-End Speech Recognition Using Performers in Conformers

Abstract:
On-device end-to-end speech recognition poses a high requirement on model efficiency. Most prior works improve the efficiency by reducing model sizes. We propose to reduce the complexity of model architectures in addition to model sizes. More specifically, we reduce the floating-point operations in conformer by replacing the transformer module with a performer. The proposed attention-based efficient end-to-end speech recognition model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear computation complexity. The proposed model also outperforms previous lightweight end-to-end models by about 20% relatively in word error rate.

2020.11.10

IEEE/ACM TASLP 2020

Title:
Speaker Separation Using Speaker Inventories and Estimated Speech

Abstract:
We propose speaker separation using speaker inventories and estimated speech (SSUSIES), a framework leveraging speaker profiles and estimated speech for speaker separation. SSUSIES contains two methods, speaker separation using speaker inventories (SSUSI) and speaker separation using estimated speech (SSUES). SSUSI performs speaker separation with the help of speaker inventory. By combining the advantages of permutation invariant training (PIT) and speech extraction, SSUSI significantly outperforms conventional approaches. SSUES is a widely applicable technique that can substantially improve speaker separation performance using the output of first-pass separation. We evaluate the models on both speaker separation and speech recognition metrics.

2020.05.13

IEEE/ACM TASLP 2020

Title:
Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR

Abstract:
This study proposes a complex spectral mapping approach for single- and multi-channel speech enhancement, where deep neural networks (DNNs) are used to predict the real and imaginary (RI) components of the direct-path signal from noisy and reverberant ones. The proposed system contains two DNNs. The first one performs single-channel complex spectral mapping. The estimated complex spectra are used to compute a minimum variance distortion-less response (MVDR) beamformer. The RI components of beamforming results, which encode spatial information, are then combined with the RI components of the mixture to train the second DNN for multi-channel complex spectral mapping. With estimated complex spectra, we also propose a novel method of time-varying beamforming. State-of-the-Art performance is obtained on the speech enhancement and recognition tasks of the CHiME-4 corpus. More specifically, our system obtains 6.82%, 3.19% and 1.99% word error rates (WER) respectively on the single-, two-, and six-microphone tasks of CHiME-4, significantly surpassing the current best results of 9.15%, 3.91% and 2.24% WER.

2020.01.09

ASRU 2019

Title:
Speech Separation Using Speaker Inventory

Abstract:
Overlapped speech is one of the main challenges in conversational speech applications such as meeting transcription. Blind speech separation and speech extraction are two common approaches to this problem. Both of them, however, suffer from limitations resulting from the lack of abilities to either leverage additional information or process multiple speakers simultaneously. In this work, we propose a novel method called speech separation using speaker inventory (SSUSI), which combines the advantages of both approaches and thus solves their problems. SSUSI makes use of a speaker inventory, i.e. a pool of pre-enrolled speaker signals, and jointly separates all participating speakers. This is achieved by a specially designed attention mechanism, eliminating the need for accurate speaker identities. Experimental results show that SSUSI outperforms permutation invariant training based blind speech separation by up to 48% relatively in word error rate (WER). Compared with speech extraction, SSUSI reduces computation time by up to 70% and improves the WER by more than 13% relatively.

2019.09.16

INTERSPEECH 2019

Title:
Large Margin Training for Attention Based End-to-End Speech Recognition

Abstract:
End-to-end speech recognition systems are typically evaluated using the maximum a posterior criterion. Since only one hypothesis is involved during evaluation, the ideal number of hypotheses for training should also be one. In this study, we propose a large margin training scheme for attention based end-to-end speech recognition. Using only one training hypothesis, the large margin training strategy achieves the same performance as the minimum word error rate criterion using four hypotheses. The theoretical derivation in this study is widely applicable to other sequence discriminative criteria such as maximum mutual information. In addition, this paper provides a more succinct formulation of the large margin concept, paving the road towards a better combination of support vector machine and deep neural network.

2019.07.05

INTERSPEECH 2019

Title:
Bridging the Gap Between Monaural Speech Enhancement and Recognition with Distortion-Independent Acoustic Modeling

Abstract:
Monaural speech enhancement has made dramatic advances in recent years. Although enhanced speech has been demonstrated to have better intelligibility and quality for human listeners, feeding it directly to automatic speech recognition (ASR) systems trained with noisy speech has not produced expected improvements in ASR performance. The lack of an enhancement benefit on recognition, or the gap between monaural speech enhancement and recognition, is often attributed to speech distortions introduced in the enhancement process. In this study, we analyze the distortion problem and propose a distortion-independent acoustic modeling scheme. Experimental results show that the distortion-independent acoustic model is able to overcome the distortion problem. Moreover, it can be used with various speech enhancement models. Both the distortion-independent and a noise-dependent acoustic model perform better than the previous best system on the CHiME-2 corpus. The noise-dependent acoustic model achieves a word error rate of 8.7%, outperforming the previous best result by 6.5% relatively.

2019.07.05

INTERSPEECH 2019

Title:
Enhanced Spectral Features for Distortion-Independent Acoustic Modeling

Abstract:
It has recently been shown that a distortion-independent acoustic modeling method is able to overcome the distortion problem caused by speech enhancement. In this study, we improve the distortion-independent acoustic model by feeding it with enhanced spectral features. Using enhanced magnitude spectra, the automatic speech recognition (ASR) system achieves a word error rate of 7.8% on the CHiME-2 corpus, outperforming the previous best system by more than 10% relatively. Compared with the corresponding enhanced waveform signal based system, systems using enhanced spectral features obtain up to 24% relative improvement. These comparisons show that speech enhancement is helpful for robust ASR and that enhanced spectral features are more suitable for ASR tasks than enhanced waveform signals.

2019.07.05

IEEE/ACM TASLP 2019

Title:
Bridging the Gap Between Monaural Speech Enhancement and Recognition with Distortion-Independent Acoustic Modeling

Abstract:
Monaural speech enhancement has made dramatic advances since the introduction of deep learning a few years ago. Although enhanced speech has been demonstrated to have better intelligibility and quality for human listeners, feeding it directly to automatic speech recognition (ASR) systems trained with noisy speech has not produced expected improvements in ASR performance. The lack of an enhancement benefit on recognition, or the gap between monaural speech enhancement and recognition, is often attributed to speech distortions introduced in the enhancement process. In this study, we analyze the distortion problem, compare different acoustic models, and investigate a distortion-independent training scheme for monaural speech recognition. Experimental results suggest that distortion-independent acoustic modeling is able to overcome the distortion problem. Such an acoustic model can also work with speech enhancement models different from the one used during training. Moreover, the models investigated in this paper outperform the previous best system on the CHiME-2 corpus.

2019.02.07

ICASSP 2019

Title:
Improving Speech Recognition Error Prediction For Modern and Off-the-Shelf Speech Recognizers

Abstract:
Modeling the errors of a speech recognizer can help simulate errorful recognized speech data from plain text, which has proven useful for tasks like discriminative language modeling, improving robustness of NLP systems, where limited or even no audio data is available at train time. Previous work typically considered replicating behavior of GMM-HMM based systems, but the behavior of more modern posterior-based neural network acoustic models is not the same and requires adjustments to the error prediction model. In this work, we extend a prior phonetic confusion based model for predicting speech recognition errors in two ways: first, we introduce a sampling-based paradigm that better simulates the behavior of a posterior-based acoustic model. Second, we investigate replacing the confusion matrix with a sequence-to-sequence model in order to introduce context dependency into the prediction. We evaluate the error predictors in two ways: first by predicting the errors made by a Switchboard ASR system on unseen data (Fisher), and then using that same predictor to estimate the behavior of an unrelated cloud-based ASR system on a novel task. Sampling greatly improves predictive accuracy within a 100-guess paradigm, while the sequence model performs similarly to the confusion matrix.

2018.10.31

ICASSP 2019

Title:
Token-Wise Training for Attention Based End-to-End Speech Recognition

Abstract:
In attention based end-to-end (A-E2E) speech recognition systems, the dependency between output tokens is typically formulated as an input-output mapping in decoder. Due to such dependency, decoding errors can easily propagate along output sequence. In this paper, we propose a token-wise training (TWT) method for A-E2E models. The new method is flexible and can be combined with a variety of loss functions. Applying TWT to multiple hypotheses, we propose a novel TWT in beam (TWTiB) training scheme. Trained on the benchmark Switchboard (SWBD) 300h corpus, TWTiB outperforms the previous best training scheme on the SWBD evaluation subset.

2018.10.31

SLT 2018

Title:
Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions

Abstract:
Acoustic model and language model (LM) have been two major components in conventional speech recognition systems. They are normally trained independently, but recently there has been a trend to optimize both components simultaneously in a unified end-to-end (E2E) framework. However, the performance gap between the E2E systems and the traditional hybrid systems suggests that some knowledge has not yet been fully utilized in the new framework. An observation is that the current attention-based E2E systems could produce better recognition results when decoded with LMs which are independently trained with the same resource.

In this paper, we focus on how to improve attention-based E2E systems without increasing model complexity or resorting to extra data. A novel training strategy is proposed for multi-task training with the connectionist temporal classification (CTC) loss. The sequence-based minimum Bayes risk (MBR) loss is also investigated. Our experiments on SWB 300hrs showed that both loss functions could significantly improve the baseline model performance. The additional gain from joint-LM decoding remains the same for CTC trained model but is only marginal for MBR trained model. This implies that while CTC loss function is able to capture more acoustic knowledge, MBR loss function exploits more lexicon dependency.

2018.07.30

ICASSP 2018

Title:
Filter-and-Convolve: A CNN Based Multichannel Complex Concatenation Acoustic Model

Abstract:
We propose a convolutional neural network (CNN) based multichannel complex-domain concatenation acoustic model. The proposed model extracts speech-specific information from multichannel noisy speech signals. In addition, we design two CNN templates that have wide applicability and several speaker adaptation methods for the multichannel complex concatenation acoustic model. Even with a simple BeamformIt beamformer and the baseline language model, our method obtains a word error rate (WER) of 5.39% on the CHiME-4 corpus, outperforming the previous best result by 13.06% relatively. Using an MVDR beamformer, our model outperforms the corresponding best system by 9.77% relatively.

2017.11.06

ICASSP 2018

Title:
Utterance-Wise Recurrent Dropout and Iterative Speaker Adaptation for Robust Monaural Speech Recognition

Abstract:
This study addresses monaural (single-microphone) automatic speech recognition (ASR) in adverse acoustic conditions. Our study builds on a state-of-the-art monaural robust ASR method that uses a wide residual network with bidirectional long short-term memory (BLSTM). We propose a novel utterance-wise dropout method for training LSTM networks and an iterative speaker adaptation technique. When evaluated on the monaural speech recognition task of the CHiME-4 corpus, our model yields a word error rate (WER) of 8.28% using the baseline language model, outperforming the previous best monaural ASR by 16.19% relatively.

2017.11.06

Publications

Back to Home Page

arXiv

INTERSPEECH 2024

ICASSP 2024

ICASSP 2024

ASRU 2023

ASRU 2023

ASRU 2023

ASRU 2023

INTERSPEECH 2023

ICASSP 2023

ICASSP 2023

INTERSPEECH 2022

INTERSPEECH 2022

arXiv

ICASSP 2022

INTERSPEECH 2021

IEEE/ACM TASLP 2020

arXiv

IEEE/ACM TASLP 2020

IEEE/ACM TASLP 2020

ASRU 2019

INTERSPEECH 2019

INTERSPEECH 2019

INTERSPEECH 2019

IEEE/ACM TASLP 2019

ICASSP 2019

ICASSP 2019

SLT 2018

ICASSP 2018

ICASSP 2018