Atenção!


O atendimento às questões referentes ao Repositório Institucional será interrompido entre os dias 20 de dezembro de 2025 a 4 de janeiro de 2026.

Pedimos a sua compreensão e aproveitamos para desejar boas festas!

Logo do repositório

A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation

dc.contributor.authorLima, Rodrigo
dc.contributor.authorLeal, Sidney E.
dc.contributor.authorJunior, Arnaldo Candido [UNESP]
dc.contributor.authorAluísio, Sandra M.
dc.contributor.institutionUniversidade de São Paulo (USP)
dc.contributor.institutionUniversidade Estadual Paulista (UNESP)
dc.date.accessioned2025-04-29T20:05:32Z
dc.date.issued2025-01-01
dc.description.abstractWe present a freely available spontaneous speech corpus for the Brazilian Portuguese language and report preliminary automatic speech recognition (ASR) results, using both the Wav2Vec2-XLSR-53 and Distil-Whisper models fine-tuned and trained on our corpus. The NURC-SP Audio Corpus comprises 401 different speakers (204 females, 197 males) with a total of 239.30 h of transcribed audio recordings. To the best of our knowledge, this is the first large Paulistano accented spontaneous speech corpus dedicated to the ASR task in Portuguese. We first present the design and development procedures of the NURC-SP Audio Corpus, and then describe four ASR experiments in detail. The experiments demonstrated promising results for the applicability of the corpus for ASR. Specifically, we fine-tuned two versions of Wav2Vec2-XLSR-53 model, trained a Distil-Whisper model using our dataset with labels determined by Whisper Large-V3 model, and fine-tuned this Distil-Whisper model with our corpus. Our best results were the Distil-Whisper fine-tuned over NURC-SP Audio Corpus with a WER of 24.22% followed by a fine-tuned versions of Wav2Vec2-XLSR-53 model with a WER of 33.73%, that is almost 10% point worse than Distil-Whisper’s. To enable experiment reproducibility, we share the NURC-SP Audio Corpus dataset, pre-trained models, and training recipes in Hugging-Face and Github repositories.en
dc.description.affiliationUniversity of São Paulo, SP
dc.description.affiliationUniversidade Estadual Paulista, SP
dc.description.affiliationUnespUniversidade Estadual Paulista, SP
dc.description.sponsorshipMinistério da Ciência, Tecnologia e Inovação
dc.format.extent33-47
dc.identifierhttp://dx.doi.org/10.1007/978-3-031-79029-4_3
dc.identifier.citationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), v. 15412 LNAI, p. 33-47.
dc.identifier.doi10.1007/978-3-031-79029-4_3
dc.identifier.issn1611-3349
dc.identifier.issn0302-9743
dc.identifier.scopus2-s2.0-85219182728
dc.identifier.urihttps://hdl.handle.net/11449/306172
dc.language.isoeng
dc.relation.ispartofLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
dc.sourceScopus
dc.subjectAutomatic speech recognition evaluation
dc.subjectBrazilian Portuguese
dc.subjectPublic speech corpora
dc.subjectSpontaneous speech
dc.titleA Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluationen
dc.typeTrabalho apresentado em eventopt
dspace.entity.typePublication
unesp.author.orcid0009-0009-4344-1109[1]
unesp.author.orcid0000-0002-8817-2063[2]
unesp.author.orcid0000-0002-5647-0891[3]
unesp.author.orcid0000-0001-5108-2630[4]

Arquivos

Coleções