Disentangling Voice and Content with Self-Supervision for Speaker   Recognition

Tianchi Liu; Kong Aik Lee; Qiongqiong Wang; Haizhou Li

by Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li

Released as a article .

2023

Abstract

For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF, respectively. Since neither additional model training nor data is specifically needed, it is easily applicable in practical use.
In text/plain format

Archived Content

There are no accessible files associated with this release. You could check other releases for this work for an accessible version.

"Dark" Preservation Only

Save Paper Now!

Know of a fulltext copy of on the public web? Submit a URL and we will archive it

Type article
Stage

submitted

Date 2023-10-24
Version v2
Language en ^?

arXiv 2310.01128v2

Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)

Cite This

BibTeX
CSL-JSON
MLA
Harvard

Lookup Links

Worldcat
wikidata.org
CORE.ac.uk
Semantic Scholar
Google Scholar

Catalog Record
Revision: 1865c8af-c510-42a2-add1-44553ec2800e
API URL: JSON

Edit Metadata View History

Disentangling Voice and Content with Self-Supervision for Speaker Recognition release_cikt5s2zt5g55iu2vj6yvjzony

Abstract

Archived Content

Disentangling Voice and Content with Self-Supervision for Speaker Recognition `release_cikt5s2zt5g55iu2vj6yvjzony`