SNR-Invariant Multitask Deep Neural Networks for Robust Speaker Verification.

The use of deep networks to extract embeddings for speaker recognition has proven successfully. ... In this work, we propose an adversarial speaker verification (ASV) scheme to learn the condition-invariant deep embedding via adversarial multi-task training. ... DEEP EMBEDDING FOR SPEAKER VERIFICATION Deep embedding has been widely used for speaker verification. ...

doi:10.1109/icassp.2019.8682488 dblp:conf/icassp/MengZLG19 fatcat:swecovr4u5g3zinxektm23yi7i

Multiple Versions

This paper presents a neural network based approach for blind speech signal quality estimation in terms of signal-tonoise ratio (SNR) and reverberation time (RT60), which is able to classify the type of ... The present state-ofthe-art deep speaker embedding models are domain-sensitive. ... System description The proposed models are trained in a multitask mode: one neural network is simultaneously trained to predict SNR, RT60 and background noise class. ...

doi:10.21437/interspeech.2020-1826 dblp:conf/interspeech/LavrentyevaVANG20 fatcat:a6am67f56ncm5djhcpbfwmkk6m

Motivated by the promising performance of multi-task training in a variety of image processing tasks, we explore the potential of multi-task adversarial training for learning a noise-robust speaker embedding ... Furthermore, experiments indicate that our method is also able to improve the speaker verification performance the clean condition. ... MULTI-TASK ADVERSARIAL NETWORK CNN Based Embedding Learning CNN-based neural network architecture has proved its superior performance in speaker verification tasks [7, 12] . ...

arXiv:1811.09355v2 fatcat:zgguy2as4jbrxlrz7b6kpobhvu

Multiple Versions

Noise robustness is a challenge for speaker recognition systems. To solve this problem, one of the most common approaches is to joint-train a model by using both clean and noisy utterances. ... However, the gradients calculated on noisy utterances generally contain speaker-irrelevant noisy components, resulting in overfitting for the seen noisy data and poor generalization for the unseen noisy ... For example, the multitask adversarial training framework was proposed for training noise-robust speaker models [12] , and the unsupervised adversarial invariance architecture was adopted to disentangle ...

doi:10.21437/interspeech.2021-1216 dblp:conf/interspeech/LiHS21 fatcat:3qtofwpmvrhn5c7k3wgbihmzju

First, we adopt a newly proposed discriminative model that hybridizes Deep Neural Network (DNN) and Total Variability Model (TVM) with the goal of integrating their strengths. ... The paper aims to address the task of speaker verification with single-channel, noisy and far-field speech by learning an embedding or feature representation that is invariant to different acoustic environments ... [13] introduced x-vectors which employed different types of artificial augmentation to train a robust speaker embedding using a Time Delay Neural Network-(TDNN) based speaker classification model [ ...

doi:10.21437/interspeech.2019-3010 dblp:conf/interspeech/JatiPPP0TGN19 fatcat:2y7smyvsgrg2fnpogf6vbro6aa

To alleviate such problem, we propose an end-to-end time-domain framework for noise-robust bandwidth extension, that jointly optimizes a mask-based speech enhancement and an ideal bandwidth extension module ... Speech bandwidth extension methods, such as deep neural networks (DNN) [8, 9] , fully convolutional network [10, 11] , generative adversarial network (GAN) [12] , and wavenet [13] , mostly perform ... With the advent of deep learning, recent studies suggest [17] an unified approach that combines speech enhancement and bandwidth extension (UEE) in a joint training neural network. ...

doi:10.21437/interspeech.2020-2022 dblp:conf/interspeech/HouXZC020 fatcat:g3ishlqknvfjfdzg2ymsjclvxe

Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for ... In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less ... pure neural-network-based methods. ...

arXiv:2008.09586v2 fatcat:vgdadayysvazfna32f5s43nc6e

Multiple Versions

Also, we demonstrate that the integrated two-stage framework further improves the speaker verification performance on the VoxCeleb1 test set in terms of EER and MinDCF. ... In the back-end stage, the probabilistic speaker embeddings are estimated by maximizing the mutual likelihood score between the speech samples belonging to the same speaker, which provide not only speaker ... ., “Front-end factor analysis for speaker verification,” IEEE neural networks toward unsupervised learning of speaker characteristics,” Transactions on Audio, Speech, and Language Processing ...

arXiv:2112.08929v1 fatcat:cm4plnaw2ngtnk23s5pq3cmjhe

Multiple Versions

The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and the cascade mismatch between the source separation model and the VC model. ... Experimental results demonstrate that our proposed framework significantly outperforms the baseline systems while achieving comparable quality and speaker similarity to the VC models trained with clean ... verification (ASV) [4] . ...

arXiv:2211.03036v1 fatcat:3ym2e4dy4vdthagtsdfjotvqtq

This study provides a systematic review of research on DOA estimation using deep neural network methods. ... Then, the DL technology used in DOA estimation is systematically analyzed, including the purpose of using DL in DOA estimation, various DL models (convolutional neural network, deep neural network, and ... Liu et al. [58] ULA DNN Linear - e network consists of two parts: one is a multitasking autoencoder and the other is a fully connected multilayer neural network. ...

doi:10.1155/2021/6392875 fatcat:jtmyuje6zff5bnonpui5qc2vym

DOAJ

The first model is a Neural Network (NN). As a second model, we propose a PLDA generative model on the top layers of the first NN model, which improves the pure NN model. ... We report improvements to the challenge baselines using easy-to-use modeling techniques, which also scales for larger self-supervised learning (SSL) model. We present two models. ... PLDA A PLDA is a well-known classification generative probabilistic model in face recognition [14] and speaker verification [20] literature for its robust likelihood estimates. ...

arXiv:2301.07087v1 fatcat:uyaawlr7ajf37gbqjdecm6p24y

Open Access Multiple Versions

Le, “Sequence to sequence learning robust keyword spotting and speaker verification using CTC-based soft with neural networks,” in Proceedings of NIPS 2014 – ... “Deep convolutional spiking neural networks for keyword spotting,” [34] S. ...

doi:10.1109/access.2021.3139508 fatcat:i4pfpfxcpretlkbefp7owtxcti

DOAJ

In this study, we present an approach to train a single speech enhancement network that can perform both personalized and non-personalized speech enhancement. ... that the proposed unified model obtains promising results on both personalized and non-personalized speech enhancement benchmarks and reaches similar performance to models that are trained specialized for ... While advanced deep neural network architectures have achieved state-of-the-art in offline speech enhancement tasks [1, 2] , recent advances in speech enhancement have been focused on efficient model ...

arXiv:2302.11768v1 fatcat:tjg7mvrw75ctjej3r2wuvbrpom

of deep neural networks. ... We then investigate on approaches for better exploiting speech contexts, proposing some original methodologies for both feed-forward and recurrent neural networks. ... of deep neural networks. ...

arXiv:1712.06086v1 fatcat:2b7ymqmihjan5nkxeqrxq52wki

We transfer the knowledge from the pre-trained model to the attractor encoder of the speaker extraction network. ... Therefore, we propose a self-supervised pre-training strategy, to exploit the speech-lip synchronization cue for target speaker extraction, which allows us to leverage abundant unlabeled in-domain data ... The target-interference SNR is defined as the energy contrast between the target speaker and the interference speaker in terms of SNR. ...

arXiv:2106.07150v2 fatcat:ay6bzvmy4je3pexqykiwksmu5e

Multiple Versions

Adversarial Speaker Verification

Preserved Fulltext

Blind Speech Signal Quality Estimation for Speaker Verification Systems

Preserved Fulltext

Training Multi-Task Adversarial Network for Extracting Noise-Robust Speaker Embedding [article]

Preserved Fulltext

Other Versions

Gradient Regularization for Noise-Robust Speaker Verification

Preserved Fulltext

Multi-Task Discriminative Training of Hybrid DNN-TVM Model for Speaker Verification with Noisy and Far-Field Speech

Preserved Fulltext

Multi-Task Learning for End-to-End Noise-Robust Bandwidth Extension

Preserved Fulltext

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation [article]

Preserved Fulltext

Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-supervised Speaker Verification [article]

Preserved Fulltext

Preserving background sound in noise-robust voice conversion via multi-task learning [article]

Preserved Fulltext

Deep Learning Approach in DOA Estimation: A Systematic Literature Review

Preserved Fulltext

MooseNet: A trainable metric for synthesized speech with plda backend [article]

Preserved Fulltext

Other Versions

Deep Spoken Keyword Spotting: An Overview

Preserved Fulltext

A Framework for Unified Real-time Personalized and Non-Personalized Speech Enhancement [article]

Preserved Fulltext

Deep Learning for Distant Speech Recognition [article]

Preserved Fulltext

Selective Listening by Synchronizing Speech with Lips [article]

Preserved Fulltext

Other Versions