Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








712 Hits in 3.7 sec

Gated Recurrent Unit Based Acoustic Modeling with Future Context

Jie Li, Xiaorui Wang, Yuanyuan Zhao, Yan Li
2018 Interspeech 2018  
In this paper, we attempt to design a RNN acoustic model that being capable of utilizing the future context effectively and directly, with the model latency and computation cost as low as possible.  ...  Two context modules, temporal encoding and temporal convolution, are specifically designed for this architecture to model the future context.  ...  TDNN-LSTM [7] is one of the most powerful acoustic model that can utilize future context effectively while has relatively low model latency.  ... 
doi:10.21437/interspeech.2018-1544 dblp:conf/interspeech/LiWZL18 fatcat:jz7sbbhfanf7pj3ndtlsmdwhfi

Lower Frame Rate Neural Network Acoustic Models

Golan Pundak, Tara N. Sainath
2016 Interspeech 2016  
As opposed to conventional models, CTC learns an alignment jointly with the acoustic model, and outputs a blank symbol in addition to the regular acoustic state units.  ...  Recently neural network acoustic models trained with Connectionist Temporal Classification (CTC) were proposed as an alternative approach to conventional cross-entropy trained neural network acoustic models  ...  Acknowledgements The authors would like to thank Michiel Bacchiani and Johan Schalkwyk for suggesting and supporting the research and Olivier Siohan and Hasim Sak for useful discussions.  ... 
doi:10.21437/interspeech.2016-275 dblp:conf/interspeech/PundakS16 fatcat:3xv7utz2anckdf57wbzq3n7e3y

Decoupled Spatial and Temporal Processing for Resource Efficient Multichannel Speech Enhancement [article]

Ashutosh Pandey, Buye Xu
2024 arXiv   pre-print
We present a novel model designed for resource-efficient multichannel speech enhancement in the time domain, with a focus on low latency, lightweight, and low computational requirements.  ...  The temporal processing is applied over a single-channel output stream from the spatial processing using a Long Short-Term Memory (LSTM) network.  ...  [5] , who proposed a convolutional recurrent model for lightweight, lowcompute, and low-latency multichannel speech enhancement. Meanwhile, Pandey et al.  ... 
arXiv:2401.07879v1 fatcat:53kwv5doqjasrbl2yuxh3nbbla

Future Context Attention for Unidirectional LSTM Based Acoustic Model

Jian Tang, Shiliang Zhang, Si Wei, Li-Rong Dai
2016 Interspeech 2016  
of a kind of attention mechanism for unidirectional LSTM based acoustic model.  ...  Recently, feedforward sequential memory networks (FSMN) has shown strong ability to model past and future long-term dependency in speech signals without using recurrent feedback, and has achieved better  ...  In contrast to BLSTM, LSTM has no time latency shortcoming. Therefore, it is desired to combine future context to LSTM to make LSTM perform as well as BLSTM with low time latency.  ... 
doi:10.21437/interspeech.2016-185 dblp:conf/interspeech/TangZWD16 fatcat:2st7ge45xzc6rpxxsrp3nsym3i

Improving Gated Recurrent Unit Based Acoustic Modeling with Batch Normalization and Enlarged Context [article]

Jie Li, Yahui Shan, Xiaorui Wang, Yan Li
2018 arXiv   pre-print
The use of future contextual information is typically shown to be helpful for acoustic modeling.  ...  This model, mGRUIP with context module (mGRUIP-Ctx), has been shown to be able of utilizing the future context effectively, meanwhile with quite low model latency and computation cost.  ...  [8] proposed the use of temporal convolution, in the form of TDNN layers, for modeling the future temporal context while affording inference with frame-level increments.  ... 
arXiv:1811.10169v1 fatcat:eodop4tmkzhu7mzyvrkiejinru

Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition [article]

Ching-Feng Yeh, Yongqiang Wang, Yangyang Shi, Chunyang Wu, Frank Zhang, Julian Chan, Michael L. Seltzer
2020 arXiv   pre-print
The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory.  ...  These characteristics pose challenges, especially for low-latency scenarios, where the system is often required to be streaming.  ...  remain low-latency.  ... 
arXiv:2011.07120v1 fatcat:5ywnqzfuind75n4zdc3xkq5er4

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention [article]

Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, Michael L. Seltzer
2019 arXiv   pre-print
Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts.  ...  We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-attention to enable streaming for Transformer  ...  These issues are critical for self-attention to work in scenarios demanding low-latency and low-computation such as on-device speech recognition [6] .  ... 
arXiv:1910.12977v1 fatcat:xfzkd24mczgw5f6fl2taz2znya

Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications [article]

Yongqiang Wang, Yangyang Shi, Frank Zhang, Chunyang Wu, Julian Chan, Ching-Feng Yeh, Alex Xiao
2020 arXiv   pre-print
Specifically, we compare Emformer with latency-controlled BLSTM (LCBLSTM) on medium latency tasks and LSTM on low latency tasks.  ...  We compare the transformer based acoustic models with their LSTM counterparts on industrial scale tasks.  ...  LSTM-based acoustic models In practise, unidirectional LSTM-based acoustic models are widely used in low latency ASR scenarios.  ... 
arXiv:2010.14665v2 fatcat:k5umdj3imrfcnbybfb5cucn2ne

A Novel Temporal Attentive-Pooling based Convolutional Recurrent Architecture for Acoustic Signal Enhancement [article]

Tassadaq Hussain, Wei-Chien Wang, Mandar Gogate, Kia Dashtipour, Yu Tsao, Xugang Lu, Adeel Ahsan, Amir Hussain
2022 arXiv   pre-print
Specifically, we first utilize a convolutional layer to extract local information of the acoustic signals and then a recurrent neural network (RNN) architecture is used to characterize temporal contextual  ...  The proposed ASE system is evaluated using a benchmark infant cry dataset and compared with several well-known methods.  ...  Gogate and T. Hussain are supported by the UK Engineering and Physical Sciences Research Council (EPSRC), grant reference EP/T021063/1.  ... 
arXiv:2201.09913v1 fatcat:nromw5clt5dyxg6gbjtfh2ef4e

Acoustic modelling with CD-CTC-SMBR LSTM RNNS

Andrew, Hasim Sak, Felix de Chaumont Quitry, Tara Sainath, Kanishka Rao
2015 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)  
Our experiments, on a noisy, reverberant voice search task, include training with alternative pronunciations and the application to child speech recognition; combination of multiple models, and convolutional  ...  We also investigate the latency of CTC models and show that constraining forward-backward alignment in training can reduce the delay for a real-time streaming speech recognition system.  ...  Sections 3.1 and 3.2 demonstrate improved inference and decoding speed with our low-frame-rate CTC models, and show how constraints during training can limit latency in decoding.  ... 
doi:10.1109/asru.2015.7404851 dblp:conf/asru/SeniorSQSR15 fatcat:cdx2r37ggzddxi5albmcst2iie

Recent progresses in deep learning based acoustic models

Dong Yu, Jinyu Li
2017 IEEE/CAA Journal of Automatica Sinica  
We first discuss acoustic models that can effectively exploit variable-length contextual information, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and their various combination  ...  We then describe acoustic models that are optimized end-to-end with emphasis on feature representations learned jointly with rest of the system, the connectionist temporal classification (CTC) criterion  ...  It can be considered as a unified way of using LSTM for temporal, spectral, and spatial computation.  ... 
doi:10.1109/jas.2017.7510508 fatcat:zcffvbg75bhllcekqghkmwidsy

Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-Hopping [article]

Linhao Dong, Feng Wang, Bo Xu
2019 arXiv   pre-print
In addition, the chunk-hopping mechanism allows the SAA to have only a 2.5% relative CER degradation with a 320ms latency.  ...  However, it is not clear if the self-attention network could be a good alternative of RNNs in automatic speech recognition (ASR), which processes the longer speech sequences and may have online recognition  ...  The RNN-based baseline uses the Extended-RNA model in [11] , which leverages 4-layers bidirectional LSTM (BLSTM) [18] as the encoder and 1-layer LSTM [19] as the decoder.  ... 
arXiv:1902.06450v1 fatcat:xujp65wgivhsbn2p65clvgmtcm

CAT: CRF-based ASR Toolkit [article]

Keyu An, Hongyu Xiang, Zhijian Ou
2019 arXiv   pre-print
Towards flexibility, we show that i-vector based speaker-adapted recognition and latency control mechanism can be explored easily and effectively in CAT.  ...  compared with the hybrid DNN-HMM models.  ...  Wav2letter++ is based solely on convolutional neural networks, which use restricted future context and realize low latency.  ... 
arXiv:1911.08747v1 fatcat:meychp57xjd7bhu3ell4dkd2pa

Attention-based End-to-End Models for Small-Footprint Keyword Spotting

Changhao Shan, Junbo Zhang, Yujun Wang, Lei Xie
2018 Interspeech 2018  
Our model consists of an encoder and an attention mechanism. The encoder transforms the input signal into a high level representation using RNNs.  ...  We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN.  ...  Acknowledgements The authors would like to thank Jingyong Hou for helpful comments and suggestions.  ... 
doi:10.21437/interspeech.2018-1777 dblp:conf/interspeech/ShanZWX18 fatcat:tefhrrsnvndwvirmh2dug6waxy

A review of on-device fully neural end-to-end automatic speech recognition algorithms [article]

Chanwoo Kim, Dhananjaya Gowda, Dongsoo Lee, Jiyeon Kim, Ankur Kumar, Sungsoo Kim, Abhinav Garg, Changwoo Han
2021 arXiv   pre-print
Conventional speech recognition systems comprise a large number of discrete components such as an acoustic model, a language model, a pronunciation model, a text-normalizer, an inverse-text normalizer,  ...  Examples include speech recognition systems based on Connectionist Temporal Classification (CTC), Recurrent Neural Network Transducer (RNN-T), Attention-based Encoder-Decoder models (AED), Monotonic Chunk-wise  ...  Further improvement is achieved by combining a streaming model with a low-latency non-streaming model, by applying shallow-fusion with a Language Model (LM), and by applying spell correction using a list  ... 
arXiv:2012.07974v3 fatcat:uxpxqcgcvvg7dfrkl2rxekkmse
« Previous Showing results 1 — 15 out of 712 results