Disentangling Voice and Content with Self-Supervision for Speaker Recognition
release_cikt5s2zt5g55iu2vj6yvjzony
by
Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li
2023
Abstract
For speaker recognition, it is difficult to extract an accurate speaker
representation from speech because of its mixture of speaker traits and
content. This paper proposes a disentanglement framework that simultaneously
models speaker traits and content variability in speech. It is realized with
the use of three Gaussian inference layers, each consisting of a learnable
transition model that extracts distinct speech components. Notably, a
strengthened transition model is specifically designed to model complex speech
dynamics. We also propose a self-supervision method to dynamically disentangle
content without the use of labels other than speaker identities. The efficacy
of the proposed framework is validated via experiments conducted on the
VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and
minDCF, respectively. Since neither additional model training nor data is
specifically needed, it is easily applicable in practical use.
In text/plain
format
Archived Content
There are no accessible files associated with this release. You could check other releases for this work for an accessible version.
Know of a fulltext copy of on the public web? Submit a URL and we will archive it
2310.01128v2
access all versions, variants, and formats of this works (eg, pre-prints)