Speaker And Style Disentanglement Of Speech Based On Contrastive Predictive Coding Supported Factorized Variational Autoencoder

Yuying Xie, Michael Kuhlmann, Frederik Rautenberg, Zheng-Hua Tan, Reinhold Haeb-Umbach

yuxi@es.aau.dk | {kuhlmann, rautenberg}@nt.uni-paderborn.de | zt@es.aau.dk | haeb@nt.uni-paderborn.de

Abstract:

Speech signals encompass various information across multiple levels including content, speaker, and style. Disentanglement of these information, although challenging, is important for applications such as voice conversion. The contrastive predictive coding supported factorized variational autoencoder achieves an unsupervised disentanglement of a speech signal into speaker and content information by assuming speaker information to be temporally more stable than content-induced variations. Thus, a single utterance-level embedding vector is used to represent speaker information. This utterance-level feature, however, may consist of not only speaker information but also other temporally stable non-content information such as environment or emotion, which we call style. In this work, we propose a method to further disentangle non-content features into distinct speaker and style features, notably by leveraging readily accessible and well-defined speaker class labels without the necessity for style labels. Experimental results validate the proposed method effectiveness on extracting disentangled features, thereby facilitating speaker, style, or combined speaker-style conversion.

Note:

The following conversions may contain three original utterances from VOiCES devkit or LibriSpeech test set.
We use content utterance, speaker utterance and style utterance to denote the following:
Content utterance: The utterance from which the content should be preserved after conversion.
Speaker utterance: Utterance which provides the speaker identity for the conversion.
Style utterance: The utterance only influences the style of the conversion result.

Clean Voice Conversion:

The following demos are clean voice conversion.
Content utterances, speaker utterances and style utterances are all from LibriSpeech test set.
Content utterances and style utterances are the same, and only speaker utterances are changed during conversion.

	Content Utterance (LibriSpeech)	Speaker Utterance (LibriSpeech)	Style Utterance (LibriSpeech)	Conversion Result
Speaker ID	SPK2300	SPK1089	SPK2300

Speaker ID	SPK2300	SPK1188	SPK2300

Speaker ID	SPK1221	SPK1320	SPK1221

Speaker ID	SPK1580	SPK1995	SPK1580

Speaker ID	SPK1580	SPK237	SPK1580

Speaker ID	SPK1580	SPK2830	SPK1580

Only Change Style

The following demos only change style embeddings.
Content utterances, speaker utterances are the same, from VOiCES_devkit test set.
Style utterances are from LibriSpeech test set.
'Clean content utterance' in below are the clean utterances without reverberation from VOiCES.

	Content Utterance (VOiCES)	Speaker Utterance (VOiCES)	Style Utterance (LibriSpeech)	Conversion Result	Clean Content Utterance (VOiCES)
Utterance ID	Lab41-SRI-VOiCES-rm4-none-sp8855-ch302395-sg0007-mc05-stu-far-dg030	Lab41-SRI-VOiCES-rm4-none-sp8855-ch302395-sg0007-mc05-stu-far-dg030	1089-134686-0010		Lab41-SRI-VOiCES-src-sp8855-ch302395-sg0007

Utterance ID	Lab41-SRI-VOiCES-rm4-none-sp8468-ch294887-sg0002-mc01-stu-clo-dg140	Lab41-SRI-VOiCES-rm4-none-sp8468-ch294887-sg0002-mc01-stu-clo-dg140	1089-134686-0010		Lab41-SRI-VOiCES-src-sp8468-ch294887-sg0002

Utterance ID	Lab41-SRI-VOiCES-rm4-none-sp3816-ch290923-sg0003-mc05-stu-far-dg120	Lab41-SRI-VOiCES-rm4-none-sp3816-ch290923-sg0003-mc05-stu-far-dg120	1089-134686-0010		Lab41-SRI-VOiCES-src-sp3816-ch290923-sg0003

Utterance ID	Lab41-SRI-VOiCES-rm3-none-sp0688-ch015446-sg0034-mc01-stu-clo-dg050	Lab41-SRI-VOiCES-rm3-none-sp0688-ch015446-sg0034-mc01-stu-clo-dg050	1089-134686-0010		Lab41-SRI-VOiCES-src-sp0688-ch015446-sg0034

Utterance ID	Lab41-SRI-VOiCES-rm3-none-sp0667-ch105002-sg0020-mc01-stu-clo-dg080	Lab41-SRI-VOiCES-rm3-none-sp0667-ch105002-sg0020-mc01-stu-clo-dg080	1089-134686-0010		Lab41-SRI-VOiCES-src-sp0667-ch105002-sg0020

Utterance ID	Lab41-SRI-VOiCES-rm3-none-sp0373-ch130977-sg0028-mc01-stu-clo-dg080	Lab41-SRI-VOiCES-rm3-none-sp0373-ch130977-sg0028-mc01-stu-clo-dg080	1089-134686-0010		Lab41-SRI-VOiCES-src-sp0373-ch130977-sg0028

Utterance ID	Lab41-SRI-VOiCES-rm2-none-sp6499-ch057667-sg0021-mc01-stu-clo-dg010	Lab41-SRI-VOiCES-rm2-none-sp6499-ch057667-sg0021-mc01-stu-clo-dg010	1089-134686-0010		Lab41-SRI-VOiCES-src-sp6499-ch057667-sg0021

Utterance ID	Lab41-SRI-VOiCES-rm2-none-sp5588-ch068192-sg0028-mc01-stu-clo-dg090	Lab41-SRI-VOiCES-rm2-none-sp5588-ch068192-sg0028-mc01-stu-clo-dg090	1089-134686-0010		Lab41-SRI-VOiCES-src-sp5588-ch068192-sg0028

Utterance ID	Lab41-SRI-VOiCES-rm2-none-sp5139-ch061422-sg0023-mc01-stu-clo-dg060	Lab41-SRI-VOiCES-rm2-none-sp5139-ch061422-sg0023-mc01-stu-clo-dg060	1089-134686-0010		Lab41-SRI-VOiCES-src-sp5139-ch061422-sg0023

Utterance ID	Lab41-SRI-VOiCES-rm1-none-sp5322-ch007680-sg0022-mc01-stu-clo-dg160	Lab41-SRI-VOiCES-rm1-none-sp5322-ch007680-sg0022-mc01-stu-clo-dg160	1089-134686-0010		Lab41-SRI-VOiCES-src-sp5322-ch007680-sg0022

Utterance ID	Lab41-SRI-VOiCES-rm1-none-sp4267-ch287369-sg0016-mc01-stu-clo-dg100	Lab41-SRI-VOiCES-rm1-none-sp4267-ch287369-sg0016-mc01-stu-clo-dg100	1089-134686-0010		Lab41-SRI-VOiCES-src-sp4267-ch287369-sg0016

Utterance ID	Lab41-SRI-VOiCES-rm1-none-sp3816-ch290923-sg0003-mc01-stu-clo-dg120	Lab41-SRI-VOiCES-rm1-none-sp3816-ch290923-sg0003-mc01-stu-clo-dg120	1089-134686-0010		Lab41-SRI-VOiCES-src-sp3816-ch290923-sg0003

Change Style and Speaker

The following demos change style and speaker embeddings simultaneously.
Content utterances are from VOiCES_devkit test set.
Style utterances and speaker utterances are from LibriSpeech test set.

	Content Utterance (VOiCES)	Speaker Utterance (LibriSpeech)	Style Utterance (LibriSpeech)	Conversion Result
Utterance ID	Lab41-SRI-VOiCES-rm2-none-sp3070-ch166423-sg0042-mc01-stu-clo-dg080	8455-210777-0044	8463-294828-0038

Utterance ID	Lab41-SRI-VOiCES-rm3-none-sp6080-ch058025-sg0029-mc01-stu-clo-dg080	4446-2275-0025	8224-274384-0002

Utterance ID	Lab41-SRI-VOiCES-rm1-none-sp5029-ch030593-sg0041-mc05-stu-far-dg140	5142-36377-0010	2830-3980-0063

Utterance ID	Lab41-SRI-VOiCES-rm1-none-sp3070-ch166423-sg0042-mc01-stu-clo-dg080	2094-142345-0010	260-123288-0016

Utterance ID	Lab41-SRI-VOiCES-rm2-none-sp1390-ch130494-sg0008-mc05-stu-far-dg180	5142-33396-0059	3570-5695-0007

Utterance ID	Lab41-SRI-VOiCES-rm4-none-sp1898-ch145702-sg0001-mc01-stu-clo-dg100	61-70968-0040	237-134493-0010