Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs.

In this paper, we attempt to design a RNN acoustic model that being capable of utilizing the future context effectively and directly, with the model latency and computation cost as low as possible. ... Two context modules, temporal encoding and temporal convolution, are specifically designed for this architecture to model the future context. ... TDNN-LSTM [7] is one of the most powerful acoustic model that can utilize future context effectively while has relatively low model latency. ...

doi:10.21437/interspeech.2018-1544 dblp:conf/interspeech/LiWZL18 fatcat:jz7sbbhfanf7pj3ndtlsmdwhfi

As opposed to conventional models, CTC learns an alignment jointly with the acoustic model, and outputs a blank symbol in addition to the regular acoustic state units. ... Recently neural network acoustic models trained with Connectionist Temporal Classification (CTC) were proposed as an alternative approach to conventional cross-entropy trained neural network acoustic models ... Acknowledgements The authors would like to thank Michiel Bacchiani and Johan Schalkwyk for suggesting and supporting the research and Olivier Siohan and Hasim Sak for useful discussions. ...

doi:10.21437/interspeech.2016-275 dblp:conf/interspeech/PundakS16 fatcat:3xv7utz2anckdf57wbzq3n7e3y

We present a novel model designed for resource-efficient multichannel speech enhancement in the time domain, with a focus on low latency, lightweight, and low computational requirements. ... The temporal processing is applied over a single-channel output stream from the spatial processing using a Long Short-Term Memory (LSTM) network. ... [5] , who proposed a convolutional recurrent model for lightweight, lowcompute, and low-latency multichannel speech enhancement. Meanwhile, Pandey et al. ...

arXiv:2401.07879v1 fatcat:53kwv5doqjasrbl2yuxh3nbbla

of a kind of attention mechanism for unidirectional LSTM based acoustic model. ... Recently, feedforward sequential memory networks (FSMN) has shown strong ability to model past and future long-term dependency in speech signals without using recurrent feedback, and has achieved better ... In contrast to BLSTM, LSTM has no time latency shortcoming. Therefore, it is desired to combine future context to LSTM to make LSTM perform as well as BLSTM with low time latency. ...

doi:10.21437/interspeech.2016-185 dblp:conf/interspeech/TangZWD16 fatcat:2st7ge45xzc6rpxxsrp3nsym3i

The use of future contextual information is typically shown to be helpful for acoustic modeling. ... This model, mGRUIP with context module (mGRUIP-Ctx), has been shown to be able of utilizing the future context effectively, meanwhile with quite low model latency and computation cost. ... [8] proposed the use of temporal convolution, in the form of TDNN layers, for modeling the future temporal context while affording inference with frame-level increments. ...

arXiv:1811.10169v1 fatcat:eodop4tmkzhu7mzyvrkiejinru

The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory. ... These characteristics pose challenges, especially for low-latency scenarios, where the system is often required to be streaming. ... remain low-latency. ...

arXiv:2011.07120v1 fatcat:5ywnqzfuind75n4zdc3xkq5er4

Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. ... We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-attention to enable streaming for Transformer ... These issues are critical for self-attention to work in scenarios demanding low-latency and low-computation such as on-device speech recognition [6] . ...

arXiv:1910.12977v1 fatcat:xfzkd24mczgw5f6fl2taz2znya

Specifically, we compare Emformer with latency-controlled BLSTM (LCBLSTM) on medium latency tasks and LSTM on low latency tasks. ... We compare the transformer based acoustic models with their LSTM counterparts on industrial scale tasks. ... LSTM-based acoustic models In practise, unidirectional LSTM-based acoustic models are widely used in low latency ASR scenarios. ...

arXiv:2010.14665v2 fatcat:k5umdj3imrfcnbybfb5cucn2ne

Multiple Versions

Specifically, we first utilize a convolutional layer to extract local information of the acoustic signals and then a recurrent neural network (RNN) architecture is used to characterize temporal contextual ... The proposed ASE system is evaluated using a benchmark infant cry dataset and compared with several well-known methods. ... Gogate and T. Hussain are supported by the UK Engineering and Physical Sciences Research Council (EPSRC), grant reference EP/T021063/1. ...

arXiv:2201.09913v1 fatcat:nromw5clt5dyxg6gbjtfh2ef4e

Open Access

Our experiments, on a noisy, reverberant voice search task, include training with alternative pronunciations and the application to child speech recognition; combination of multiple models, and convolutional ... We also investigate the latency of CTC models and show that constraining forward-backward alignment in training can reduce the delay for a real-time streaming speech recognition system. ... Sections 3.1 and 3.2 demonstrate improved inference and decoding speed with our low-frame-rate CTC models, and show how constraints during training can limit latency in decoding. ...

doi:10.1109/asru.2015.7404851 dblp:conf/asru/SeniorSQSR15 fatcat:cdx2r37ggzddxi5albmcst2iie

We first discuss acoustic models that can effectively exploit variable-length contextual information, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and their various combination ... We then describe acoustic models that are optimized end-to-end with emphasis on feature representations learned jointly with rest of the system, the connectionist temporal classification (CTC) criterion ... It can be considered as a unified way of using LSTM for temporal, spectral, and spatial computation. ...

doi:10.1109/jas.2017.7510508 fatcat:zcffvbg75bhllcekqghkmwidsy

In addition, the chunk-hopping mechanism allows the SAA to have only a 2.5% relative CER degradation with a 320ms latency. ... However, it is not clear if the self-attention network could be a good alternative of RNNs in automatic speech recognition (ASR), which processes the longer speech sequences and may have online recognition ... The RNN-based baseline uses the Extended-RNA model in [11] , which leverages 4-layers bidirectional LSTM (BLSTM) [18] as the encoder and 1-layer LSTM [19] as the decoder. ...

arXiv:1902.06450v1 fatcat:xujp65wgivhsbn2p65clvgmtcm

Towards flexibility, we show that i-vector based speaker-adapted recognition and latency control mechanism can be explored easily and effectively in CAT. ... compared with the hybrid DNN-HMM models. ... Wav2letter++ is based solely on convolutional neural networks, which use restricted future context and realize low latency. ...

arXiv:1911.08747v1 fatcat:meychp57xjd7bhu3ell4dkd2pa

Our model consists of an encoder and an attention mechanism. The encoder transforms the input signal into a high level representation using RNNs. ... We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN. ... Acknowledgements The authors would like to thank Jingyong Hou for helpful comments and suggestions. ...

doi:10.21437/interspeech.2018-1777 dblp:conf/interspeech/ShanZWX18 fatcat:tefhrrsnvndwvirmh2dug6waxy

Conventional speech recognition systems comprise a large number of discrete components such as an acoustic model, a language model, a pronunciation model, a text-normalizer, an inverse-text normalizer, ... Examples include speech recognition systems based on Connectionist Temporal Classification (CTC), Recurrent Neural Network Transducer (RNN-T), Attention-based Encoder-Decoder models (AED), Monotonic Chunk-wise ... Further improvement is achieved by combining a streaming model with a low-latency non-streaming model, by applying shallow-fusion with a Language Model (LM), and by applying spell correction using a list ...

arXiv:2012.07974v3 fatcat:uxpxqcgcvvg7dfrkl2rxekkmse

Open Access Multiple Versions

Gated Recurrent Unit Based Acoustic Modeling with Future Context

Preserved Fulltext

Lower Frame Rate Neural Network Acoustic Models

Preserved Fulltext

Decoupled Spatial and Temporal Processing for Resource Efficient Multichannel Speech Enhancement [article]

Preserved Fulltext

Future Context Attention for Unidirectional LSTM Based Acoustic Model

Preserved Fulltext

Improving Gated Recurrent Unit Based Acoustic Modeling with Batch Normalization and Enlarged Context [article]

Preserved Fulltext

Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition [article]

Preserved Fulltext

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention [article]

Preserved Fulltext

Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications [article]

Preserved Fulltext

Other Versions

A Novel Temporal Attentive-Pooling based Convolutional Recurrent Architecture for Acoustic Signal Enhancement [article]

Preserved Fulltext

Acoustic modelling with CD-CTC-SMBR LSTM RNNS

Preserved Fulltext

Recent progresses in deep learning based acoustic models

Preserved Fulltext

Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-Hopping [article]

Preserved Fulltext

CAT: CRF-based ASR Toolkit [article]

Preserved Fulltext

Attention-based End-to-End Models for Small-Footprint Keyword Spotting

Preserved Fulltext

A review of on-device fully neural end-to-end automatic speech recognition algorithms [article]

Preserved Fulltext

Other Versions