Speaker And Style Disentanglement Of Speech Based On Contrastive Predictive Coding Supported Factorized Variational Autoencoder

Yuying Xie, Michael Kuhlmann, Frederik Rautenberg, Zheng-Hua Tan, Reinhold Haeb-Umbach

yuxi@es.aau.dk | {kuhlmann, rautenberg}@nt.uni-paderborn.de | zt@es.aau.dk | haeb@nt.uni-paderborn.de

Abstract:

Speech signals encompass various information across multiple levels including content, speaker, and style. Disentanglement of these information, although challenging, is important for applications such as voice conversion. The contrastive predictive coding supported factorized variational autoencoder achieves an unsupervised disentanglement of a speech signal into speaker and content information by assuming speaker information to be temporally more stable than content-induced variations. Thus, a single utterance-level embedding vector is used to represent speaker information. This utterance-level feature, however, may consist of not only speaker information but also other temporally stable non-content information such as environment or emotion, which we call style. In this work, we propose a method to further disentangle non-content features into distinct speaker and style features, notably by leveraging readily accessible and well-defined speaker class labels without the necessity for style labels. Experimental results validate the proposed method effectiveness on extracting disentangled features, thereby facilitating speaker, style, or combined speaker-style conversion.

Note:

The following conversions may contain three original utterances from VOiCES devkit or LibriSpeech test set.
We use content utterance, speaker utterance and style utterance to denote the following:
Content utterance: The utterance from which the content should be preserved after conversion.
Speaker utterance: Utterance which provides the speaker identity for the conversion.
Style utterance: The utterance only influences the style of the conversion result.

Clean Voice Conversion:

The following demos are clean voice conversion.
Content utterances, speaker utterances and style utterances are all from LibriSpeech test set.
Content utterances and style utterances are the same, and only speaker utterances are changed during conversion.

Content Utterance (LibriSpeech)Speaker Utterance (LibriSpeech)Style Utterance (LibriSpeech)Conversion Result
Speaker ID SPK2300SPK1089SPK2300
Speaker ID SPK2300SPK1188SPK2300
Speaker ID SPK1221SPK1320SPK1221
Speaker ID SPK1580SPK1995SPK1580
Speaker ID SPK1580SPK237SPK1580
Speaker ID SPK1580SPK2830SPK1580

Only Change Style

The following demos only change style embeddings.
Content utterances, speaker utterances are the same, from VOiCES_devkit test set.
Style utterances are from LibriSpeech test set.
'Clean content utterance' in below are the clean utterances without reverberation from VOiCES.

Content Utterance (VOiCES)Speaker Utterance (VOiCES)Style Utterance (LibriSpeech)Conversion ResultClean Content Utterance (VOiCES)
Utterance ID Lab41-SRI-VOiCES-rm4-none-sp8855-ch302395-sg0007-mc05-stu-far-dg030 Lab41-SRI-VOiCES-rm4-none-sp8855-ch302395-sg0007-mc05-stu-far-dg030 1089-134686-0010 Lab41-SRI-VOiCES-src-sp8855-ch302395-sg0007
Utterance ID Lab41-SRI-VOiCES-rm4-none-sp8468-ch294887-sg0002-mc01-stu-clo-dg140 Lab41-SRI-VOiCES-rm4-none-sp8468-ch294887-sg0002-mc01-stu-clo-dg1401089-134686-0010Lab41-SRI-VOiCES-src-sp8468-ch294887-sg0002
Utterance ID Lab41-SRI-VOiCES-rm4-none-sp3816-ch290923-sg0003-mc05-stu-far-dg120 Lab41-SRI-VOiCES-rm4-none-sp3816-ch290923-sg0003-mc05-stu-far-dg1201089-134686-0010Lab41-SRI-VOiCES-src-sp3816-ch290923-sg0003
Utterance ID Lab41-SRI-VOiCES-rm3-none-sp0688-ch015446-sg0034-mc01-stu-clo-dg050 Lab41-SRI-VOiCES-rm3-none-sp0688-ch015446-sg0034-mc01-stu-clo-dg0501089-134686-0010Lab41-SRI-VOiCES-src-sp0688-ch015446-sg0034
Utterance ID Lab41-SRI-VOiCES-rm3-none-sp0667-ch105002-sg0020-mc01-stu-clo-dg080 Lab41-SRI-VOiCES-rm3-none-sp0667-ch105002-sg0020-mc01-stu-clo-dg0801089-134686-0010Lab41-SRI-VOiCES-src-sp0667-ch105002-sg0020
Utterance ID Lab41-SRI-VOiCES-rm3-none-sp0373-ch130977-sg0028-mc01-stu-clo-dg080 Lab41-SRI-VOiCES-rm3-none-sp0373-ch130977-sg0028-mc01-stu-clo-dg0801089-134686-0010Lab41-SRI-VOiCES-src-sp0373-ch130977-sg0028
Utterance ID Lab41-SRI-VOiCES-rm2-none-sp6499-ch057667-sg0021-mc01-stu-clo-dg010 Lab41-SRI-VOiCES-rm2-none-sp6499-ch057667-sg0021-mc01-stu-clo-dg0101089-134686-0010Lab41-SRI-VOiCES-src-sp6499-ch057667-sg0021
Utterance ID Lab41-SRI-VOiCES-rm2-none-sp5588-ch068192-sg0028-mc01-stu-clo-dg090 Lab41-SRI-VOiCES-rm2-none-sp5588-ch068192-sg0028-mc01-stu-clo-dg0901089-134686-0010Lab41-SRI-VOiCES-src-sp5588-ch068192-sg0028
Utterance ID Lab41-SRI-VOiCES-rm2-none-sp5139-ch061422-sg0023-mc01-stu-clo-dg060Lab41-SRI-VOiCES-rm2-none-sp5139-ch061422-sg0023-mc01-stu-clo-dg0601089-134686-0010Lab41-SRI-VOiCES-src-sp5139-ch061422-sg0023
Utterance ID Lab41-SRI-VOiCES-rm1-none-sp5322-ch007680-sg0022-mc01-stu-clo-dg160Lab41-SRI-VOiCES-rm1-none-sp5322-ch007680-sg0022-mc01-stu-clo-dg1601089-134686-0010Lab41-SRI-VOiCES-src-sp5322-ch007680-sg0022
Utterance ID Lab41-SRI-VOiCES-rm1-none-sp4267-ch287369-sg0016-mc01-stu-clo-dg100Lab41-SRI-VOiCES-rm1-none-sp4267-ch287369-sg0016-mc01-stu-clo-dg1001089-134686-0010Lab41-SRI-VOiCES-src-sp4267-ch287369-sg0016
Utterance ID Lab41-SRI-VOiCES-rm1-none-sp3816-ch290923-sg0003-mc01-stu-clo-dg120Lab41-SRI-VOiCES-rm1-none-sp3816-ch290923-sg0003-mc01-stu-clo-dg1201089-134686-0010Lab41-SRI-VOiCES-src-sp3816-ch290923-sg0003

Change Style and Speaker

The following demos change style and speaker embeddings simultaneously.
Content utterances are from VOiCES_devkit test set.
Style utterances and speaker utterances are from LibriSpeech test set.

Content Utterance (VOiCES)Speaker Utterance (LibriSpeech)Style Utterance (LibriSpeech)Conversion Result
Utterance ID Lab41-SRI-VOiCES-rm2-none-sp3070-ch166423-sg0042-mc01-stu-clo-dg0808455-210777-00448463-294828-0038
Utterance ID Lab41-SRI-VOiCES-rm3-none-sp6080-ch058025-sg0029-mc01-stu-clo-dg0804446-2275-00258224-274384-0002
Utterance ID Lab41-SRI-VOiCES-rm1-none-sp5029-ch030593-sg0041-mc05-stu-far-dg1405142-36377-00102830-3980-0063
Utterance ID Lab41-SRI-VOiCES-rm1-none-sp3070-ch166423-sg0042-mc01-stu-clo-dg0802094-142345-0010260-123288-0016
Utterance ID Lab41-SRI-VOiCES-rm2-none-sp1390-ch130494-sg0008-mc05-stu-far-dg1805142-33396-00593570-5695-0007
Utterance ID Lab41-SRI-VOiCES-rm4-none-sp1898-ch145702-sg0001-mc01-stu-clo-dg10061-70968-0040237-134493-0010