Logo Eletrica On-Line
início      o projeto      quem somos      links      fale conosco
Imagem Topo Miolo
Imagem do fundo do titulo
Aumentar Letra Diminuir Letra Normalizar Letra Contraste

Livros
OEE
OEFis
CeV
SisEE
SimEE
CDEE
CIS
TFCs
ETDs
IRR
PeA

 


Título: AUTOMATIC SPEECH RECOGNITION IN PORTUGUESE: ADVANCING THE WHISPER ARCHITECTURE THROUGH HYBRID ENCODER DESIGNS
Instituição: PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO - PUC-RIO
Autor(es): ABDIGAL GABRIEL CAMARGO BARRA
Colaborador(es): MARCO AURELIO CAVALCANTI PACHECO - Orientador
MANOELA RABELLO KOHLER - Coorientador
Catalogação: 11 11:49:16.755949/09/2025
Tipo: THESIS Idioma(s): ENGLISH - UNITED STATES
Referência [pt]: https://www.maxwell.vrac.puc-rio.br/eletricaonline/serieConsulta.php?strSecao=resultado&nrSeq=72900@1
Referência [en]: https://www.maxwell.vrac.puc-rio.br/eletricaonline/serieConsulta.php?strSecao=resultado&nrSeq=72900@2
Resumo:
Automatic Speech Recognition (ASR) remains one of the most prominent challenges addressed by Deep Learning, with continuous progress leading to increasingly robust models. Despite this advancement, most state-of-the-art ASRsystems are either trained from scratch on high-resource languages—such as English or Chinese—or employ multilingual strategies that often under represent languages like Portuguese. Moreover, leading architectures like Whisper have demonstrated impressive performance across numerous languages but rely on large-scale, proprietary training pipelines that are computationally in tensive and not fully open-source. This study addresses these limitations by focusing on Portuguese ASR using a more accessible and adaptable approach. First, a full reproduction of the Whisper training methodology was implemented, targeting a smaller architecture and training from scratch on four curated Portuguese datasets. This enables the evaluation of Whisper s training paradigm in a resource constrained, language-specific context. Furthermore, the study explores archi tectural modifications of the encoder block by integrating two variants: (i) the Conformer block, which combines multi-head self-attention with convolutional layers to capture both global and local features, and (ii) the E-Branchformer block, which introduces a parallel cgMLP branch fused through convolution, designed to enhance representational capacity. All models are trained under the same experimental setup, tracking key metrics such as accuracy, Connectionist Temporal Classification loss, Kullback-Leibler divergence loss, and Word Error Rate. The results highlight not only the feasibility of replicating Whisper-like performance with signifi cantly fewer resources but also show that the proposed architectural enhance-ments-particularly the E-Branchformer-yield superior performance across validation and test sets, including standardized benchmarks such as Common Voice. This work contributes to an integrated and practical approach to improving ASR for underrepresented languages, demonstrating that lightweight mod els trained from scratch can offer competitive performance, making advanced speech technologies more accessible for real-world applications in Portuguese.
Descrição: Arquivo:
COMPLETE PDF

<< voltar