Automatic Speech Recognition (ASR) remains one of the most prominent challenges addressed by Deep Learning, with continuous progress leading
to increasingly robust models. Despite this advancement, most state-of-the-art
ASRsystems are either trained from scratch on high-resource languages—such
as English or Chinese—or employ multilingual strategies that often under
represent languages like Portuguese. Moreover, leading architectures like Whisper have demonstrated impressive performance across numerous languages but
rely on large-scale, proprietary training pipelines that are computationally in
tensive and not fully open-source.
This study addresses these limitations by focusing on Portuguese ASR
using a more accessible and adaptable approach. First, a full reproduction
of the Whisper training methodology was implemented, targeting a smaller
architecture and training from scratch on four curated Portuguese datasets.
This enables the evaluation of Whisper s training paradigm in a resource
constrained, language-specific context. Furthermore, the study explores archi
tectural modifications of the encoder block by integrating two variants: (i) the
Conformer block, which combines multi-head self-attention with convolutional
layers to capture both global and local features, and (ii) the E-Branchformer
block, which introduces a parallel cgMLP branch fused through convolution,
designed to enhance representational capacity.
All models are trained under the same experimental setup, tracking
key metrics such as accuracy, Connectionist Temporal Classification loss,
Kullback-Leibler divergence loss, and Word Error Rate. The results highlight
not only the feasibility of replicating Whisper-like performance with signifi
cantly fewer resources but also show that the proposed architectural enhance-ments-particularly the E-Branchformer-yield superior performance across
validation and test sets, including standardized benchmarks such as Common
Voice.
This work contributes to an integrated and practical approach to improving ASR for underrepresented languages, demonstrating that lightweight mod
els trained from scratch can offer competitive performance, making advanced
speech technologies more accessible for real-world applications in Portuguese.
|