Logotipo do repositório
 

Publicação:
Multimodal audio-visual information fusion using canonical-correlated Graph Neural Network for energy-efficient speech enhancement

dc.contributor.authorPassos, Leandro A.
dc.contributor.authorPapa, João Paulo [UNESP]
dc.contributor.authorDel Ser, Javier
dc.contributor.authorHussain, Amir
dc.contributor.authorAdeel, Ahsan
dc.contributor.institutionUniversity of Wolverhampton
dc.contributor.institutionUniversidade Estadual Paulista (UNESP)
dc.contributor.institutionBasque Research & Technology Alliance (BRTA)
dc.contributor.institutionUniversity of the Basque Country (UPV/EHU)
dc.contributor.institutionEdinburgh Napier University
dc.contributor.institutionDeepCI
dc.date.accessioned2023-07-29T13:21:14Z
dc.date.available2023-07-29T13:21:14Z
dc.date.issued2023-02-01
dc.description.abstractThis paper proposes a novel multimodal self-supervised architecture for energy-efficient audio-visual (AV) speech enhancement that integrates Graph Neural Networks with canonical correlation analysis (CCA-GNN). The proposed approach lays its foundations on a state-of-the-art CCA-GNN that learns representative embeddings by maximizing the correlation between pairs of augmented views of the same input while decorrelating disconnected features. The key idea of the conventional CCA-GNN involves discarding augmentation-variant information and preserving augmentation-invariant information while preventing capturing of redundant information. Our proposed AV CCA-GNN model deals with multimodal representation learning context. Specifically, our model improves contextual AV speech processing by maximizing canonical correlation from augmented views of the same channel and canonical correlation from audio and visual embeddings. In addition, it proposes a positional node encoding that considers a prior-frame sequence distance instead of a feature-space representation when computing the node's nearest neighbors, introducing temporal information in the embeddings through the neighborhood's connectivity. Experiments conducted on the benchmark ChiME3 dataset show that our proposed prior frame-based AV CCA-GNN ensures a better feature learning in the temporal context, leading to more energy-efficient speech reconstruction than state-of-the-art CCA-GNN and multilayer perceptron.en
dc.description.affiliationCMI Lab School of Engineering and Informatics University of Wolverhampton, England
dc.description.affiliationDepartment of Computing São Paulo State University, Bauru
dc.description.affiliationTECNALIA Basque Research & Technology Alliance (BRTA), Bizkaia
dc.description.affiliationUniversity of the Basque Country (UPV/EHU), Bizkaia
dc.description.affiliationSchool of Computing Edinburgh Napier University, Scotland
dc.description.affiliationDeepCI, Scotland
dc.description.affiliationUnespDepartment of Computing São Paulo State University, Bauru
dc.description.sponsorshipMinisterio de Ciencia e Innovación
dc.description.sponsorshipEusko Jaurlaritza
dc.description.sponsorshipEngineering and Physical Sciences Research Council
dc.description.sponsorshipIdEngineering and Physical Sciences Research Council: EP/T021063/1
dc.format.extent1-11
dc.identifierhttp://dx.doi.org/10.1016/j.inffus.2022.09.006
dc.identifier.citationInformation Fusion, v. 90, p. 1-11.
dc.identifier.doi10.1016/j.inffus.2022.09.006
dc.identifier.issn1566-2535
dc.identifier.scopus2-s2.0-85138109331
dc.identifier.urihttp://hdl.handle.net/11449/247622
dc.language.isoeng
dc.relation.ispartofInformation Fusion
dc.sourceScopus
dc.subjectCanonical correlation analysis
dc.subjectGraph Neural Networks
dc.subjectMultimodal learning
dc.subjectPositional encoding
dc.subjectPrior frames neighborhood
dc.titleMultimodal audio-visual information fusion using canonical-correlated Graph Neural Network for energy-efficient speech enhancementen
dc.typeArtigo
dspace.entity.typePublication
unesp.author.orcid0000-0003-3529-3109[1]
unesp.campusUniversidade Estadual Paulista (UNESP), Faculdade de Ciências, Baurupt
unesp.departmentComputação - FCpt

Arquivos