Eletrica Online

Automatic Speech Recognition (ASR) remains one of the most prominent challenges addressed by Deep Learning, with continuous progress leading to increasingly robust models. Despite this advancement, most state-of-the-art ASRsystems are either trained from scratch on high-resource languages—such as English or Chinese—or employ multilingual strategies that often under represent languages like Portuguese. Moreover, leading architectures like Whisper have demonstrated impressive performance across numerous languages but rely on large-scale, proprietary training pipelines that are computationally in tensive and not fully open-source. This study addresses these limitations by focusing on Portuguese ASR using a more accessible and adaptable approach. First, a full reproduction of the Whisper training methodology was implemented, targeting a smaller architecture and training from scratch on four curated Portuguese datasets. This enables the evaluation of Whisper s training paradigm in a resource constrained, language-specific context. Furthermore, the study explores archi tectural modifications of the encoder block by integrating two variants: (i) the Conformer block, which combines multi-head self-attention with convolutional layers to capture both global and local features, and (ii) the E-Branchformer block, which introduces a parallel cgMLP branch fused through convolution, designed to enhance representational capacity. All models are trained under the same experimental setup, tracking key metrics such as accuracy, Connectionist Temporal Classification loss, Kullback-Leibler divergence loss, and Word Error Rate. The results highlight not only the feasibility of replicating Whisper-like performance with signifi cantly fewer resources but also show that the proposed architectural enhance-ments-particularly the E-Branchformer-yield superior performance across validation and test sets, including standardized benchmarks such as Common Voice. This work contributes to an integrated and practical approach to improving ASR for underrepresented languages, demonstrating that lightweight mod els trained from scratch can offer competitive performance, making advanced speech technologies more accessible for real-world applications in Portuguese.