ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion

Casanova, Edresson; Shulby, Christopher; Korolev, Alexander; Junior, Arnaldo Candido [UNESP]; da Silva Soares, Anderson; Aluísio, Sandra; Ponti, Moacir Antonelli

doi:10.21437/Interspeech.2023-496

ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion

dc.contributor.author	Casanova, Edresson
dc.contributor.author	Shulby, Christopher
dc.contributor.author	Korolev, Alexander
dc.contributor.author	Junior, Arnaldo Candido [UNESP]
dc.contributor.author	da Silva Soares, Anderson
dc.contributor.author	Aluísio, Sandra
dc.contributor.author	Ponti, Moacir Antonelli
dc.contributor.institution	Coqui
dc.contributor.institution	Universidade de São Paulo (USP)
dc.contributor.institution	QuintoAndar
dc.contributor.institution	Darmstadt University of Applied Sciences
dc.contributor.institution	Universidade Estadual Paulista (UNESP)
dc.contributor.institution	Universidade Federal de Goiás (UFG)
dc.contributor.institution	Mercado Livre
dc.date.accessioned	2025-04-29T20:13:55Z
dc.date.issued	2023-01-01
dc.description.abstract	We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied to data augmentation for automatic speech recognition (ASR) systems in low/medium-resource scenarios. Through extensive experiments, we show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training. We also managed to close the gap between ASR models trained with synthesized versus human speech compared to other works that use many speakers. Finally, we show that it is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.	en
dc.description.affiliation	Coqui
dc.description.affiliation	Instituto de Ciências Matemáticas e de Computação Universidade de São Paulo
dc.description.affiliation	QuintoAndar
dc.description.affiliation	Darmstadt University of Applied Sciences
dc.description.affiliation	São Paulo State University
dc.description.affiliation	Federal University of Goiás
dc.description.affiliation	Mercado Livre
dc.description.affiliationUnesp	São Paulo State University
dc.description.sponsorship	Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
dc.description.sponsorship	Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
dc.description.sponsorshipId	FAPESP: #2019/07665-4
dc.description.sponsorshipId	CNPq: 304266/2020-5
dc.format.extent	1244-1248
dc.identifier	http://dx.doi.org/10.21437/Interspeech.2023-496
dc.identifier.citation	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, v. 2023-August, p. 1244-1248.
dc.identifier.doi	10.21437/Interspeech.2023-496
dc.identifier.issn	1990-9772
dc.identifier.issn	2308-457X
dc.identifier.scopus	2-s2.0-85171552571
dc.identifier.uri	https://hdl.handle.net/11449/308910
dc.language.iso	eng
dc.relation.ispartof	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
dc.source	Scopus
dc.subject	ASR Data Augmentation
dc.subject	Cross-lingual Zero-shot Multi-speaker TTS
dc.subject	Cross-lingual Zero-shot Voice Conversion
dc.subject	Low-resource
dc.subject	Speech Recognition
dc.subject	Speech Synthesis
dc.title	ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion	en
dc.type	Trabalho apresentado em evento	pt
dspace.entity.type	Publication

Coleções

Artigos

ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion

Arquivos

Coleções