Speaker And Style Disentanglement Of Speech Based On Contrastive Predictive Coding Supported Factorized Variational Autoencoder
Yuying Xie, Michael Kuhlmann, Frederik Rautenberg, Zheng-Hua Tan, Reinhold Haeb-Umbach
yuxi@es.aau.dk | {kuhlmann, rautenberg}@nt.uni-paderborn.de | zt@es.aau.dk | haeb@nt.uni-paderborn.de
Abstract:
Speech signals encompass various information across multiple levels including content, speaker, and style. Disentanglement of these information, although challenging, is important for applications such as voice conversion. The contrastive predictive coding supported factorized variational autoencoder achieves an unsupervised disentanglement of a speech signal into speaker and content information by assuming speaker information to be temporally more stable than content-induced variations. Thus, a single utterance-level embedding vector is used to represent speaker information. This utterance-level feature, however, may consist of not only speaker information but also other temporally stable non-content information such as environment or emotion, which we call style. In this work, we propose a method to further disentangle non-content features into distinct speaker and style features, notably by leveraging readily accessible and well-defined speaker class labels without the necessity for style labels. Experimental results validate the proposed method effectiveness on extracting disentangled features, thereby facilitating speaker, style, or combined speaker-style conversion.
Note:
The following conversions may contain three original utterances from VOiCES devkit or LibriSpeech test set.
We use content utterance, speaker utterance and style utterance to denote the following:
Content utterance: The utterance from which the content should be preserved after conversion.
Speaker utterance: Utterance which provides the speaker identity for the conversion.
Style utterance: The utterance only influences the style of the conversion result.
Clean Voice Conversion:
The following demos are clean voice conversion.
Content utterances, speaker utterances and style utterances are all from LibriSpeech test set.
Content utterances and style utterances are the same, and only speaker utterances are changed during conversion.
Content Utterance (LibriSpeech) | Speaker Utterance (LibriSpeech) | Style Utterance (LibriSpeech) | Conversion Result | |
Speaker ID | SPK2300 | SPK1089 | SPK2300 | |
Speaker ID | SPK2300 | SPK1188 | SPK2300 | |
Speaker ID | SPK1221 | SPK1320 | SPK1221 | |
Speaker ID | SPK1580 | SPK1995 | SPK1580 | |
Speaker ID | SPK1580 | SPK237 | SPK1580 | |
Speaker ID | SPK1580 | SPK2830 | SPK1580 | |
Only Change Style
The following demos only change style embeddings.
Content utterances, speaker utterances are the same, from VOiCES_devkit test set.
Style utterances are from LibriSpeech test set.
'Clean content utterance' in below are the clean utterances without reverberation from VOiCES.
Content Utterance (VOiCES) | Speaker Utterance (VOiCES) | Style Utterance (LibriSpeech) | Conversion Result | Clean Content Utterance (VOiCES) | |
Utterance ID | Lab41-SRI-VOiCES-rm4-none-sp8855-ch302395-sg0007-mc05-stu-far-dg030 | Lab41-SRI-VOiCES-rm4-none-sp8855-ch302395-sg0007-mc05-stu-far-dg030 | 1089-134686-0010 | Lab41-SRI-VOiCES-src-sp8855-ch302395-sg0007 | |
Utterance ID | Lab41-SRI-VOiCES-rm4-none-sp8468-ch294887-sg0002-mc01-stu-clo-dg140 | Lab41-SRI-VOiCES-rm4-none-sp8468-ch294887-sg0002-mc01-stu-clo-dg140 | 1089-134686-0010 | Lab41-SRI-VOiCES-src-sp8468-ch294887-sg0002 | |
Utterance ID | Lab41-SRI-VOiCES-rm4-none-sp3816-ch290923-sg0003-mc05-stu-far-dg120 | Lab41-SRI-VOiCES-rm4-none-sp3816-ch290923-sg0003-mc05-stu-far-dg120 | 1089-134686-0010 | Lab41-SRI-VOiCES-src-sp3816-ch290923-sg0003 | |
Utterance ID | Lab41-SRI-VOiCES-rm3-none-sp0688-ch015446-sg0034-mc01-stu-clo-dg050 | Lab41-SRI-VOiCES-rm3-none-sp0688-ch015446-sg0034-mc01-stu-clo-dg050 | 1089-134686-0010 | Lab41-SRI-VOiCES-src-sp0688-ch015446-sg0034 | |
Utterance ID | Lab41-SRI-VOiCES-rm3-none-sp0667-ch105002-sg0020-mc01-stu-clo-dg080 | Lab41-SRI-VOiCES-rm3-none-sp0667-ch105002-sg0020-mc01-stu-clo-dg080 | 1089-134686-0010 | Lab41-SRI-VOiCES-src-sp0667-ch105002-sg0020 | |
Utterance ID | Lab41-SRI-VOiCES-rm3-none-sp0373-ch130977-sg0028-mc01-stu-clo-dg080 | Lab41-SRI-VOiCES-rm3-none-sp0373-ch130977-sg0028-mc01-stu-clo-dg080 | 1089-134686-0010 | Lab41-SRI-VOiCES-src-sp0373-ch130977-sg0028 | |
Utterance ID | Lab41-SRI-VOiCES-rm2-none-sp6499-ch057667-sg0021-mc01-stu-clo-dg010 | Lab41-SRI-VOiCES-rm2-none-sp6499-ch057667-sg0021-mc01-stu-clo-dg010 | 1089-134686-0010 | Lab41-SRI-VOiCES-src-sp6499-ch057667-sg0021 | |
Utterance ID | Lab41-SRI-VOiCES-rm2-none-sp5588-ch068192-sg0028-mc01-stu-clo-dg090 | Lab41-SRI-VOiCES-rm2-none-sp5588-ch068192-sg0028-mc01-stu-clo-dg090 | 1089-134686-0010 | Lab41-SRI-VOiCES-src-sp5588-ch068192-sg0028 | |
Utterance ID | Lab41-SRI-VOiCES-rm2-none-sp5139-ch061422-sg0023-mc01-stu-clo-dg060 | Lab41-SRI-VOiCES-rm2-none-sp5139-ch061422-sg0023-mc01-stu-clo-dg060 | 1089-134686-0010 | Lab41-SRI-VOiCES-src-sp5139-ch061422-sg0023 | |
Utterance ID | Lab41-SRI-VOiCES-rm1-none-sp5322-ch007680-sg0022-mc01-stu-clo-dg160 | Lab41-SRI-VOiCES-rm1-none-sp5322-ch007680-sg0022-mc01-stu-clo-dg160 | 1089-134686-0010 | Lab41-SRI-VOiCES-src-sp5322-ch007680-sg0022 | |
Utterance ID | Lab41-SRI-VOiCES-rm1-none-sp4267-ch287369-sg0016-mc01-stu-clo-dg100 | Lab41-SRI-VOiCES-rm1-none-sp4267-ch287369-sg0016-mc01-stu-clo-dg100 | 1089-134686-0010 | Lab41-SRI-VOiCES-src-sp4267-ch287369-sg0016 | |
Utterance ID | Lab41-SRI-VOiCES-rm1-none-sp3816-ch290923-sg0003-mc01-stu-clo-dg120 | Lab41-SRI-VOiCES-rm1-none-sp3816-ch290923-sg0003-mc01-stu-clo-dg120 | 1089-134686-0010 | Lab41-SRI-VOiCES-src-sp3816-ch290923-sg0003 | |
Change Style and Speaker
The following demos change style and speaker embeddings simultaneously.
Content utterances are from VOiCES_devkit test set.
Style utterances and speaker utterances are from LibriSpeech test set.
Content Utterance (VOiCES) | Speaker Utterance (LibriSpeech) | Style Utterance (LibriSpeech) | Conversion Result | |
Utterance ID | Lab41-SRI-VOiCES-rm2-none-sp3070-ch166423-sg0042-mc01-stu-clo-dg080 | 8455-210777-0044 | 8463-294828-0038 | |
Utterance ID | Lab41-SRI-VOiCES-rm3-none-sp6080-ch058025-sg0029-mc01-stu-clo-dg080 | 4446-2275-0025 | 8224-274384-0002 | |
Utterance ID | Lab41-SRI-VOiCES-rm1-none-sp5029-ch030593-sg0041-mc05-stu-far-dg140 | 5142-36377-0010 | 2830-3980-0063 | |
Utterance ID | Lab41-SRI-VOiCES-rm1-none-sp3070-ch166423-sg0042-mc01-stu-clo-dg080 | 2094-142345-0010 | 260-123288-0016 | |
Utterance ID | Lab41-SRI-VOiCES-rm2-none-sp1390-ch130494-sg0008-mc05-stu-far-dg180 | 5142-33396-0059 | 3570-5695-0007 | |
Utterance ID | Lab41-SRI-VOiCES-rm4-none-sp1898-ch145702-sg0001-mc01-stu-clo-dg100 | 61-70968-0040 | 237-134493-0010 | |