An investigation of accent inclusion in Brazilian Portuguese Speech
Speech biometrics, accent inclusion, Brazilian Portuguese, speech corpus, dataset
The use of artificial intelligence is becoming increasingly present in people’s lives, even
if not always noticeable. While the majority of speech technologies have achieved high
accuracy, they fail when tested for accents that deviate from the “standard” of a language.
This becomes more crucial for Brazilian Portuguese, given its lack of resources for properly
developing such systems. The excluding behaviour of speech systems and the lack of
resources, has inspired the objectives of this work. First, to explore news ways for Accent
Conversion for this language using a light-weight model called Sparse Anchor-Based
Representation of Speech with Residual Information (SABr+Res), which should convert
from paulista to nordestino. Second, to collect and release the largest speech dataset for
Brazilian Portuguese to the date. The dataset leverages the availability of public audio
and individuals in video platforms. The TEDx Talks posts a reliable environment for clean
speech from such persons, and therefore this work collects automatically the data, while
manually annotating the demographic information required for the first objective of this
work and also for other possible speech related tasks. With a current validation of 18.7%
it has 110 hours of speech from 520 audios and approximately 515 unique speakers. The
dataset already covers 21 out of the 27 Brazilian states, making the TEDx Talks Brazilian
Accents the most inclusive and representative dataset for the language.