Logo do repositório

MFCC Parameters of the Speech Signal: An Alternative to Formant-Based Instantaneous Vocal Tract Length Estimation

dc.contributor.authorVasquez-Serrano, P.
dc.contributor.authorReyes-Moreno, J.
dc.contributor.authorGuido, Rodrigo Capobianco [UNESP]
dc.contributor.authorSepúlveda-Sepúlveda, Alexander
dc.contributor.institutionUniversidad Industrial de Santander
dc.contributor.institutionUniversidade Estadual Paulista (UNESP)
dc.date.accessioned2025-04-29T18:40:49Z
dc.date.issued2023-01-01
dc.description.abstractOn the one hand, the relationship between formant frequencies and vocal tract length (VTL) has been intensively studied over the years. On the other hand, the connection involving mel-frequency cepstral coefficients (MFCCs), which concisely codify the overall shape of a speaker's spectral envelope with just a few cepstral coefficients, and VTL has only been modestly analyzed, being worth of further investigation. Thus, based on different statistical models, this article explores the advantages and disadvantages of the latter approach, which is relatively novel, in contrast to the former which arises from more traditional studies. Additionally, VTL is assumed to be a static and inherent characteristic of speakers, that is, a single length parameter is frequently estimated per speaker. By contrast, in this paper we consider VTL estimation from a dynamic perspective using modern real-time Magnetic Resonance Imaging (rtMRI) to measure VTL in parallel with audio signals. To support the experiments, data obtained from USC-TIMIT magnetic resonance videos were used, allowing for the 2D real-time analysis of articulators in motion. As a result, we observed that the performance of MFCCs in case of speaker-dependent modeling is higher, however, in case of cross-speaker modeling, which uses different speakers’ data for training and evaluating, its performance is not significantly different of that obtained with formants. In complement, we note that the estimation based on MFCCs is robust, with an acceptable computational time complexity, coherent with the traditional approach.en
dc.description.affiliationEscuela de Ing. Eléctrica Electrónica y de Telecomunicaciones (E3T) Universidad Industrial de Santander
dc.description.affiliationInstituto de Biociências Letras e Ciências Exatas Unesp – Univ Estadual Paulista (São Paulo State University), Rua Cristóvão Colombo 2265, Jd Nazareth, SP
dc.description.affiliationUnespInstituto de Biociências Letras e Ciências Exatas Unesp – Univ Estadual Paulista (São Paulo State University), Rua Cristóvão Colombo 2265, Jd Nazareth, SP
dc.description.sponsorshipConselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
dc.identifierhttp://dx.doi.org/10.1016/j.jvoice.2023.05.012
dc.identifier.citationJournal of Voice.
dc.identifier.doi10.1016/j.jvoice.2023.05.012
dc.identifier.issn1873-4588
dc.identifier.issn0892-1997
dc.identifier.scopus2-s2.0-85162114208
dc.identifier.urihttps://hdl.handle.net/11449/298915
dc.language.isoeng
dc.relation.ispartofJournal of Voice
dc.sourceScopus
dc.subjectAcoustic-to-articulatory inversion
dc.subjectFormants
dc.subjectMFCCs
dc.subjectVocal tract length
dc.titleMFCC Parameters of the Speech Signal: An Alternative to Formant-Based Instantaneous Vocal Tract Length Estimationen
dc.typeArtigopt
dspace.entity.typePublication
unesp.campusUniversidade Estadual Paulista (UNESP), Instituto de Biociências, Letras e Ciências Exatas, São José do Rio Pretopt

Arquivos