Title:
Model Compression for End-to-end Speech Recognition

Abstract:
Model compression is important to on-device automatic speech recognition. In this study, we propose three weight sharing based model compression methods to compress a well-trained conformer-based end-to-end speech recognition system without retraining the model. These methods are pruning without retraining, submatrix weight sharing, and full-range sensitivity analysis. On the LibriSpeech corpus, the proposed methods together achieve 9-fold model compression with negligible performance degradation. In addition, the proposed methods can work with 8-bit weight quantization. With model retraining, the proposed techniques achieve 20-fold or 40-fold model compression if some increases of word error rate can be tolerated.

2021.06.03

Title:
Multitask Training with Text Data for End-to-End Speech Recognition

Abstract:
We propose a multitask training method for attention-based end-to-end speech recognition models to better incorporate language level information. We regularize the decoder in a sequence-to-sequence architecture by multitask training it on both the speech recognition task and a next-token prediction language modeling task. Trained on either the 100 hour subset of LibriSpeech or the full 960 hour dataset, the proposed method leads to an 11% relative performance improvement over the baseline and is comparable to language model shallow fusion, without requiring an additional neural network during decoding. Analyses of sample output sentences and the word error rate on rare words demonstrate that the proposed method can incorporate language level information effectively.

2021.06.03

Title:
Multi-Microphone Complex Spectral Mapping for Utterance-wise and Continuous Speaker Separation

Abstract:
We propose multi-microphone complex spectral mapping, a simple way of applying deep learning for time-varying non-linear beamforming, for offline utterance-wise and block-online continuous speaker separation in reverberant conditions, aiming at both speaker separation and dereverberation. Assuming a fixed array geometry between training and testing, we train deep neural networks (DNN) to predict the real and imaginary (RI) components of target speech at a reference microphone from the RI components of multiple microphones. We then integrate multi-microphone complex spectral mapping with beamforming and post-filtering to further improve separation, and combine it with frame-level speaker counting for block-online continuous speaker separation (CSS). Although our system is trained on simulated room impulse responses (RIR) based on a fixed number of microphones arranged in a given geometry, it generalizes well to a real array with the same geometry. State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.

2021.06.03

Title:
Speaker Separation Using Speaker Inventories and Estimated Speech

Abstract:
We propose speaker separation using speaker inventories and estimated speech (SSUSIES), a framework leveraging speaker profiles and estimated speech for speaker separation. SSUSIES contains two methods, speaker separation using speaker inventories (SSUSI) and speaker separation using estimated speech (SSUES). SSUSI performs speaker separation with the help of speaker inventory. By combining the advantages of permutation invariant training (PIT) and speech extraction, SSUSI significantly outperforms conventional approaches. SSUES is a widely applicable technique that can substantially improve speaker separation performance using the output of first-pass separation. We evaluate the models on both speaker separation and speech recognition metrics.

2020.05.13

Title:
Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR

Abstract:
This study proposes a complex spectral mapping approach for single- and multi-channel speech enhancement, where deep neural networks (DNNs) are used to predict the real and imaginary (RI) components of the direct-path signal from noisy and reverberant ones. The proposed system contains two DNNs. The first one performs single-channel complex spectral mapping. The estimated complex spectra are used to compute a minimum variance distortion-less response (MVDR) beamformer. The RI components of beamforming results, which encode spatial information, are then combined with the RI components of the mixture to train the second DNN for multi-channel complex spectral mapping. With estimated complex spectra, we also propose a novel method of time-varying beamforming. State-of-the-Art performance is obtained on the speech enhancement and recognition tasks of the CHiME-4 corpus. More specifically, our system obtains 6.82%, 3.19% and 1.99% word error rates (WER) respectively on the single-, two-, and six-microphone tasks of CHiME-4, significantly surpassing the current best results of 9.15%, 3.91% and 2.24% WER.

2020.01.09

Title:
Speech Separation Using Speaker Inventory

Abstract:
Overlapped speech is one of the main challenges in conversational speech applications such as meeting transcription. Blind speech separation and speech extraction are two common approaches to this problem. Both of them, however, suffer from limitations resulting from the lack of abilities to either leverage additional information or process multiple speakers simultaneously. In this work, we propose a novel method called speech separation using speaker inventory (SSUSI), which combines the advantages of both approaches and thus solves their problems. SSUSI makes use of a speaker inventory, i.e. a pool of pre-enrolled speaker signals, and jointly separates all participating speakers. This is achieved by a specially designed attention mechanism, eliminating the need for accurate speaker identities. Experimental results show that SSUSI outperforms permutation invariant training based blind speech separation by up to 48% relatively in word error rate (WER). Compared with speech extraction, SSUSI reduces computation time by up to 70% and improves the WER by more than 13% relatively.

2019.09.16

Title:
Large Margin Training for Attention Based End-to-End Speech Recognition

Abstract:
End-to-end speech recognition systems are typically evaluated using the maximum a posterior criterion. Since only one hypothesis is involved during evaluation, the ideal number of hypotheses for training should also be one. In this study, we propose a large margin training scheme for attention based end-to-end speech recognition. Using only one training hypothesis, the large margin training strategy achieves the same performance as the minimum word error rate criterion using four hypotheses. The theoretical derivation in this study is widely applicable to other sequence discriminative criteria such as maximum mutual information. In addition, this paper provides a more succinct formulation of the large margin concept, paving the road towards a better combination of support vector machine and deep neural network.

2019.07.05

Title:
Bridging the Gap Between Monaural Speech Enhancement and Recognition with Distortion-Independent Acoustic Modeling

Abstract:
Monaural speech enhancement has made dramatic advances in recent years. Although enhanced speech has been demonstrated to have better intelligibility and quality for human listeners, feeding it directly to automatic speech recognition (ASR) systems trained with noisy speech has not produced expected improvements in ASR performance. The lack of an enhancement benefit on recognition, or the gap between monaural speech enhancement and recognition, is often attributed to speech distortions introduced in the enhancement process. In this study, we analyze the distortion problem and propose a distortion-independent acoustic modeling scheme. Experimental results show that the distortion-independent acoustic model is able to overcome the distortion problem. Moreover, it can be used with various speech enhancement models. Both the distortion-independent and a noise-dependent acoustic model perform better than the previous best system on the CHiME-2 corpus. The noise-dependent acoustic model achieves a word error rate of 8.7%, outperforming the previous best result by 6.5% relatively.

2019.07.05

Title:
Enhanced Spectral Features for Distortion-Independent Acoustic Modeling

Abstract:
It has recently been shown that a distortion-independent acoustic modeling method is able to overcome the distortion problem caused by speech enhancement. In this study, we improve the distortion-independent acoustic model by feeding it with enhanced spectral features. Using enhanced magnitude spectra, the automatic speech recognition (ASR) system achieves a word error rate of 7.8% on the CHiME-2 corpus, outperforming the previous best system by more than 10% relatively. Compared with the corresponding enhanced waveform signal based system, systems using enhanced spectral features obtain up to 24% relative improvement. These comparisons show that speech enhancement is helpful for robust ASR and that enhanced spectral features are more suitable for ASR tasks than enhanced waveform signals.

2019.07.05

Title:
Bridging the Gap Between Monaural Speech Enhancement and Recognition with Distortion-Independent Acoustic Modeling

Abstract:
Monaural speech enhancement has made dramatic advances since the introduction of deep learning a few years ago. Although enhanced speech has been demonstrated to have better intelligibility and quality for human listeners, feeding it directly to automatic speech recognition (ASR) systems trained with noisy speech has not produced expected improvements in ASR performance. The lack of an enhancement benefit on recognition, or the gap between monaural speech enhancement and recognition, is often attributed to speech distortions introduced in the enhancement process. In this study, we analyze the distortion problem, compare different acoustic models, and investigate a distortion-independent training scheme for monaural speech recognition. Experimental results suggest that distortion-independent acoustic modeling is able to overcome the distortion problem. Such an acoustic model can also work with speech enhancement models different from the one used during training. Moreover, the models investigated in this paper outperform the previous best system on the CHiME-2 corpus.

2019.02.07

Title:
Improving Speech Recognition Error Prediction For Modern and Off-the-Shelf Speech Recognizers

Abstract:
Modeling the errors of a speech recognizer can help simulate errorful recognized speech data from plain text, which has proven useful for tasks like discriminative language modeling, improving robustness of NLP systems, where limited or even no audio data is available at train time. Previous work typically considered replicating behavior of GMM-HMM based systems, but the behavior of more modern posterior-based neural network acoustic models is not the same and requires adjustments to the error prediction model. In this work, we extend a prior phonetic confusion based model for predicting speech recognition errors in two ways: first, we introduce a sampling-based paradigm that better simulates the behavior of a posterior-based acoustic model. Second, we investigate replacing the confusion matrix with a sequence-to-sequence model in order to introduce context dependency into the prediction. We evaluate the error predictors in two ways: first by predicting the errors made by a Switchboard ASR system on unseen data (Fisher), and then using that same predictor to estimate the behavior of an unrelated cloud-based ASR system on a novel task. Sampling greatly improves predictive accuracy within a 100-guess paradigm, while the sequence model performs similarly to the confusion matrix.

2018.10.31

Title:
Token-Wise Training for Attention Based End-to-End Speech Recognition

Abstract:
In attention based end-to-end (A-E2E) speech recognition systems, the dependency between output tokens is typically formulated as an input-output mapping in decoder. Due to such dependency, decoding errors can easily propagate along output sequence. In this paper, we propose a token-wise training (TWT) method for A-E2E models. The new method is flexible and can be combined with a variety of loss functions. Applying TWT to multiple hypotheses, we propose a novel TWT in beam (TWTiB) training scheme. Trained on the benchmark Switchboard (SWBD) 300h corpus, TWTiB outperforms the previous best training scheme on the SWBD evaluation subset.

2018.10.31

Title:
Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions

Abstract:
Acoustic model and language model (LM) have been two major components in conventional speech recognition systems. They are normally trained independently, but recently there has been a trend to optimize both components simultaneously in a unified end-to-end (E2E) framework. However, the performance gap between the E2E systems and the traditional hybrid systems suggests that some knowledge has not yet been fully utilized in the new framework. An observation is that the current attention-based E2E systems could produce better recognition results when decoded with LMs which are independently trained with the same resource.

In this paper, we focus on how to improve attention-based E2E systems without increasing model complexity or resorting to extra data. A novel training strategy is proposed for multi-task training with the connectionist temporal classification (CTC) loss. The sequence-based minimum Bayes risk (MBR) loss is also investigated. Our experiments on SWB 300hrs showed that both loss functions could significantly improve the baseline model performance. The additional gain from joint-LM decoding remains the same for CTC trained model but is only marginal for MBR trained model. This implies that while CTC loss function is able to capture more acoustic knowledge, MBR loss function exploits more lexicon dependency.

2018.07.30

Title:
Filter-and-Convolve: A CNN Based Multichannel Complex Concatenation Acoustic Model

Abstract:
We propose a convolutional neural network (CNN) based multichannel complex-domain concatenation acoustic model. The proposed model extracts speech-specific information from multichannel noisy speech signals. In addition, we design two CNN templates that have wide applicability and several speaker adaptation methods for the multichannel complex concatenation acoustic model. Even with a simple BeamformIt beamformer and the baseline language model, our method obtains a word error rate (WER) of 5.39% on the CHiME-4 corpus, outperforming the previous best result by 13.06% relatively. Using an MVDR beamformer, our model outperforms the corresponding best system by 9.77% relatively.

2017.11.06

Title:
Utterance-Wise Recurrent Dropout and Iterative Speaker Adaptation for Robust Monaural Speech Recognition

Abstract:
This study addresses monaural (single-microphone) automatic speech recognition (ASR) in adverse acoustic conditions. Our study builds on a state-of-the-art monaural robust ASR method that uses a wide residual network with bidirectional long short-term memory (BLSTM). We propose a novel utterance-wise dropout method for training LSTM networks and an iterative speaker adaptation technique. When evaluated on the monaural speech recognition task of the CHiME-4 corpus, our model yields a word error rate (WER) of 8.28% using the baseline language model, outperforming the previous best monaural ASR by 16.19% relatively.

2017.11.06