A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Gated Recurrent Unit Based Acoustic Modeling with Future Context
2018
Interspeech 2018
In this paper, we attempt to design a RNN acoustic model that being capable of utilizing the future context effectively and directly, with the model latency and computation cost as low as possible. ...
Two context modules, temporal encoding and temporal convolution, are specifically designed for this architecture to model the future context. ...
TDNN-LSTM [7] is one of the most powerful acoustic model that can utilize future context effectively while has relatively low model latency. ...
doi:10.21437/interspeech.2018-1544
dblp:conf/interspeech/LiWZL18
fatcat:jz7sbbhfanf7pj3ndtlsmdwhfi
Lower Frame Rate Neural Network Acoustic Models
2016
Interspeech 2016
As opposed to conventional models, CTC learns an alignment jointly with the acoustic model, and outputs a blank symbol in addition to the regular acoustic state units. ...
Recently neural network acoustic models trained with Connectionist Temporal Classification (CTC) were proposed as an alternative approach to conventional cross-entropy trained neural network acoustic models ...
Acknowledgements The authors would like to thank Michiel Bacchiani and Johan Schalkwyk for suggesting and supporting the research and Olivier Siohan and Hasim Sak for useful discussions. ...
doi:10.21437/interspeech.2016-275
dblp:conf/interspeech/PundakS16
fatcat:3xv7utz2anckdf57wbzq3n7e3y
Decoupled Spatial and Temporal Processing for Resource Efficient Multichannel Speech Enhancement
[article]
2024
arXiv
pre-print
We present a novel model designed for resource-efficient multichannel speech enhancement in the time domain, with a focus on low latency, lightweight, and low computational requirements. ...
The temporal processing is applied over a single-channel output stream from the spatial processing using a Long Short-Term Memory (LSTM) network. ...
[5] , who proposed a convolutional recurrent model for lightweight, lowcompute, and low-latency multichannel speech enhancement. Meanwhile, Pandey et al. ...
arXiv:2401.07879v1
fatcat:53kwv5doqjasrbl2yuxh3nbbla
Future Context Attention for Unidirectional LSTM Based Acoustic Model
2016
Interspeech 2016
of a kind of attention mechanism for unidirectional LSTM based acoustic model. ...
Recently, feedforward sequential memory networks (FSMN) has shown strong ability to model past and future long-term dependency in speech signals without using recurrent feedback, and has achieved better ...
In contrast to BLSTM, LSTM has no time latency shortcoming. Therefore, it is desired to combine future context to LSTM to make LSTM perform as well as BLSTM with low time latency. ...
doi:10.21437/interspeech.2016-185
dblp:conf/interspeech/TangZWD16
fatcat:2st7ge45xzc6rpxxsrp3nsym3i
Improving Gated Recurrent Unit Based Acoustic Modeling with Batch Normalization and Enlarged Context
[article]
2018
arXiv
pre-print
The use of future contextual information is typically shown to be helpful for acoustic modeling. ...
This model, mGRUIP with context module (mGRUIP-Ctx), has been shown to be able of utilizing the future context effectively, meanwhile with quite low model latency and computation cost. ...
[8] proposed the use of temporal convolution, in the form of TDNN layers, for modeling the future temporal context while affording inference with frame-level increments. ...
arXiv:1811.10169v1
fatcat:eodop4tmkzhu7mzyvrkiejinru
Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition
[article]
2020
arXiv
pre-print
The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory. ...
These characteristics pose challenges, especially for low-latency scenarios, where the system is often required to be streaming. ...
remain low-latency. ...
arXiv:2011.07120v1
fatcat:5ywnqzfuind75n4zdc3xkq5er4
Transformer-Transducer: End-to-End Speech Recognition with Self-Attention
[article]
2019
arXiv
pre-print
Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. ...
We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-attention to enable streaming for Transformer ...
These issues are critical for self-attention to work in scenarios demanding low-latency and low-computation such as on-device speech recognition [6] . ...
arXiv:1910.12977v1
fatcat:xfzkd24mczgw5f6fl2taz2znya
Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications
[article]
2020
arXiv
pre-print
Specifically, we compare Emformer with latency-controlled BLSTM (LCBLSTM) on medium latency tasks and LSTM on low latency tasks. ...
We compare the transformer based acoustic models with their LSTM counterparts on industrial scale tasks. ...
LSTM-based acoustic models In practise, unidirectional LSTM-based acoustic models are widely used in low latency ASR scenarios. ...
arXiv:2010.14665v2
fatcat:k5umdj3imrfcnbybfb5cucn2ne
A Novel Temporal Attentive-Pooling based Convolutional Recurrent Architecture for Acoustic Signal Enhancement
[article]
2022
arXiv
pre-print
Specifically, we first utilize a convolutional layer to extract local information of the acoustic signals and then a recurrent neural network (RNN) architecture is used to characterize temporal contextual ...
The proposed ASE system is evaluated using a benchmark infant cry dataset and compared with several well-known methods. ...
Gogate and T. Hussain are supported by the UK Engineering and Physical Sciences Research Council (EPSRC), grant reference EP/T021063/1. ...
arXiv:2201.09913v1
fatcat:nromw5clt5dyxg6gbjtfh2ef4e
Acoustic modelling with CD-CTC-SMBR LSTM RNNS
2015
2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
Our experiments, on a noisy, reverberant voice search task, include training with alternative pronunciations and the application to child speech recognition; combination of multiple models, and convolutional ...
We also investigate the latency of CTC models and show that constraining forward-backward alignment in training can reduce the delay for a real-time streaming speech recognition system. ...
Sections 3.1 and 3.2 demonstrate improved inference and decoding speed with our low-frame-rate CTC models, and show how constraints during training can limit latency in decoding. ...
doi:10.1109/asru.2015.7404851
dblp:conf/asru/SeniorSQSR15
fatcat:cdx2r37ggzddxi5albmcst2iie
Recent progresses in deep learning based acoustic models
2017
IEEE/CAA Journal of Automatica Sinica
We first discuss acoustic models that can effectively exploit variable-length contextual information, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and their various combination ...
We then describe acoustic models that are optimized end-to-end with emphasis on feature representations learned jointly with rest of the system, the connectionist temporal classification (CTC) criterion ...
It can be considered as a unified way of using LSTM for temporal, spectral, and spatial computation. ...
doi:10.1109/jas.2017.7510508
fatcat:zcffvbg75bhllcekqghkmwidsy
Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-Hopping
[article]
2019
arXiv
pre-print
In addition, the chunk-hopping mechanism allows the SAA to have only a 2.5% relative CER degradation with a 320ms latency. ...
However, it is not clear if the self-attention network could be a good alternative of RNNs in automatic speech recognition (ASR), which processes the longer speech sequences and may have online recognition ...
The RNN-based baseline uses the Extended-RNA model in [11] , which leverages 4-layers bidirectional LSTM (BLSTM) [18] as the encoder and 1-layer LSTM [19] as the decoder. ...
arXiv:1902.06450v1
fatcat:xujp65wgivhsbn2p65clvgmtcm
CAT: CRF-based ASR Toolkit
[article]
2019
arXiv
pre-print
Towards flexibility, we show that i-vector based speaker-adapted recognition and latency control mechanism can be explored easily and effectively in CAT. ...
compared with the hybrid DNN-HMM models. ...
Wav2letter++ is based solely on convolutional neural networks, which use restricted future context and realize low latency. ...
arXiv:1911.08747v1
fatcat:meychp57xjd7bhu3ell4dkd2pa
Attention-based End-to-End Models for Small-Footprint Keyword Spotting
2018
Interspeech 2018
Our model consists of an encoder and an attention mechanism. The encoder transforms the input signal into a high level representation using RNNs. ...
We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN. ...
Acknowledgements The authors would like to thank Jingyong Hou for helpful comments and suggestions. ...
doi:10.21437/interspeech.2018-1777
dblp:conf/interspeech/ShanZWX18
fatcat:tefhrrsnvndwvirmh2dug6waxy
A review of on-device fully neural end-to-end automatic speech recognition algorithms
[article]
2021
arXiv
pre-print
Conventional speech recognition systems comprise a large number of discrete components such as an acoustic model, a language model, a pronunciation model, a text-normalizer, an inverse-text normalizer, ...
Examples include speech recognition systems based on Connectionist Temporal Classification (CTC), Recurrent Neural Network Transducer (RNN-T), Attention-based Encoder-Decoder models (AED), Monotonic Chunk-wise ...
Further improvement is achieved by combining a streaming model with a low-latency non-streaming model, by applying shallow-fusion with a Language Model (LM), and by applying spell correction using a list ...
arXiv:2012.07974v3
fatcat:uxpxqcgcvvg7dfrkl2rxekkmse
« Previous
Showing results 1 — 15 out of 712 results