Accent-VITS: accent transfer for end-to-end TTS

https://ma-linhan.github.io/AccentVITS/

Linhan Ma¹, Yongmao Zhang¹, Xinfa Zhu¹, Yi Lei¹, Ziqian Ning¹, Pengcheng Zhu², and Lei Xie^1* ¹ Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China ² Fuxi AI Lab, NetEase Inc., Hangzhou, China

1. Abstract

Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker's voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based [1] end-to-end accent transfer model named Accent-VITS. Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer. We leverage a hierarchical CVAE structure [2] to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints. Moreover, the text-to-wave mapping in VITS is decomposed into text-to-accent and accent-to-wave mappings in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective. Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline Text2BN2Mel+Vocoder [3][4].

Fig.1 The architecture of Accent-VITS.

2. Dataset

Standard Mandarin Dataset -- DB1

Text	他微微腆着肚子，双手叉腰，目视前方。	他对这诗越看越喜欢。	央企的身板儿一下子就硬了不少。
Audio
Audio

Accent Mandarin Dataset with Multiple Accent Speakers

Dongbei (Northeast China):

Text	跟我俩赛脸呢啊？等会儿回去看我咋收拾你。	我压根不同意你去。	他考试不及格，被刷掉了。
Audio
Audio

Henan:

Text	我锤了他几下。	那小子一会儿可串没影儿啦。	那次中过一回点，现在看见电器都怵。
Audio
Audio

Shanghai:

Text	记得装点吃呃辣辣身高头啊。	吾可以改，但是需要一段辰光来适应。	跟老于帮伊呃男朋友辣来搿的坐来海嘎塞无。
Audio
Audio

Sichuan:

Text	对于这件事，你又没得话说。	今天的肉卖的好相因哦。	对你笑嚯嚯，因为我讲礼貌。
Audio
Audio

3. Demos--Accent Transfer

Accent: Dongbei (Northeast China)

Ground truth	Target speaker	Text2BN2Mel+Vocoder	Accent-VITS
Text: 我们都是木头人儿，一不许说话，二不许笑，三不许露出大门牙。

Text: 他家来妾了，我们去帮帮忙。

Text: 他可贼能胡咧咧，小嘴儿叭儿叭儿的，东北小伙儿不愿听就会鸟悄儿地走开。

Accent: Henan

Ground truth	Target speaker	Text2BN2Mel+Vocoder	Accent-VITS
Text: 俺说这个屋里俺的兄弟姐妹们啊。

Text: 不必读堵着耳朵去搞实验了。

Text: 我好奇地拿起一个大蒜，左瞅瞅，右瞅瞅。

Accent: Shanghai

Ground truth	Target speaker	Text2BN2Mel+Vocoder	Accent-VITS
Text: 下趟侬忙呃辰光吾尽量伐干扰。

Text: 有只狗早哴向七点钟起来。

Text: 侬辣做撒啊？老公。

Accent: Sichuan

Ground truth	Target speaker	Text2BN2Mel+Vocoder	Accent-VITS
Text: 我们这虽然伙食不错，但是数量是有限嘞。

Text: 真相其实很简单，简单到想不到或不敢想。

Text: 肯定是你小子报嘞警。

[1] Jaehyeon Kim, Jungil Kong, and Juhee Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. ICML. 2021, vol. 139, pp. 5530–5540, PMLR
[2] Sang-Hoon Lee, Seung-Bin Kim, Ji-Hyun Lee, Eunwoo Song, Min-Jae Hwang, and Seong-Whan Lee, “Hierspeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis,” Advances in Neural Information Processing Systems, vol. 35, pp. 16624–16636, 2022
[3] Yongmao Zhang, Zhichao Wang, Peiji Yang, Hongshen Sun, Zhisheng Wang, and Lei Xie, “Accentspeech: Learning accent from crowd-sourced data for target speaker TTS with accents,” in Proc. ISCSLP. 2022, pp. 76–80, IEEE.
[4] Dongyang Dai, Yuanzhe Chen, Li Chen, Ming Tu, Lu Liu, Rui Xia, Qiao Tian, Yuping Wang, and Yuxuan Wang, “Cloning one’s voice using very limited data in the wild,” in Proc. ICASSP. 2022, pp. 8322–8326, IEEE