Logo do repositório

ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion

dc.contributor.authorCasanova, Edresson
dc.contributor.authorShulby, Christopher
dc.contributor.authorKorolev, Alexander
dc.contributor.authorJunior, Arnaldo Candido [UNESP]
dc.contributor.authorda Silva Soares, Anderson
dc.contributor.authorAluísio, Sandra
dc.contributor.authorPonti, Moacir Antonelli
dc.contributor.institutionCoqui
dc.contributor.institutionUniversidade de São Paulo (USP)
dc.contributor.institutionQuintoAndar
dc.contributor.institutionDarmstadt University of Applied Sciences
dc.contributor.institutionUniversidade Estadual Paulista (UNESP)
dc.contributor.institutionUniversidade Federal de Goiás (UFG)
dc.contributor.institutionMercado Livre
dc.date.accessioned2025-04-29T20:13:55Z
dc.date.issued2023-01-01
dc.description.abstractWe explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied to data augmentation for automatic speech recognition (ASR) systems in low/medium-resource scenarios. Through extensive experiments, we show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training. We also managed to close the gap between ASR models trained with synthesized versus human speech compared to other works that use many speakers. Finally, we show that it is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.en
dc.description.affiliationCoqui
dc.description.affiliationInstituto de Ciências Matemáticas e de Computação Universidade de São Paulo
dc.description.affiliationQuintoAndar
dc.description.affiliationDarmstadt University of Applied Sciences
dc.description.affiliationSão Paulo State University
dc.description.affiliationFederal University of Goiás
dc.description.affiliationMercado Livre
dc.description.affiliationUnespSão Paulo State University
dc.description.sponsorshipFundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
dc.description.sponsorshipConselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
dc.description.sponsorshipIdFAPESP: #2019/07665-4
dc.description.sponsorshipIdCNPq: 304266/2020-5
dc.format.extent1244-1248
dc.identifierhttp://dx.doi.org/10.21437/Interspeech.2023-496
dc.identifier.citationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, v. 2023-August, p. 1244-1248.
dc.identifier.doi10.21437/Interspeech.2023-496
dc.identifier.issn1990-9772
dc.identifier.issn2308-457X
dc.identifier.scopus2-s2.0-85171552571
dc.identifier.urihttps://hdl.handle.net/11449/308910
dc.language.isoeng
dc.relation.ispartofProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
dc.sourceScopus
dc.subjectASR Data Augmentation
dc.subjectCross-lingual Zero-shot Multi-speaker TTS
dc.subjectCross-lingual Zero-shot Voice Conversion
dc.subjectLow-resource
dc.subjectSpeech Recognition
dc.subjectSpeech Synthesis
dc.titleASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversionen
dc.typeTrabalho apresentado em eventopt
dspace.entity.typePublication

Arquivos

Coleções