COMPUTATIONAL LANGUAGE MODELS FOR MARMOSET MONKEY VOCALIZATIONS
Computational bioacoustics; Callithrix; Language models; Deep learning; Transformer; Acoustic embeddings.
The vocal communication of marmosets (Callithrix) stands out for its acoustic sophistication and ontogenetic plasticity, presenting structural properties that suggest the existence of a complex syntax. While bird bioacoustics already employs language models based on Deep Learning, research with marmosets still lacks tools capable of modeling the sequential and acoustic complexity of their repertoires. This dissertation investigated the structure of vocal sequences in marmosets through the development and comparison of computational language models. The study used a dataset comprising 91,086 vocalizations from 9 marmosets during their first two months of life. The methodology was divided into three phases: (I) establishing a baseline with Markov Models of orders 0 to 19; (II) applying Deep Learning architectures (RNN, LSTM, and Transformer) using categorical syllable labels; and (III) implementing generative models based on acoustic embeddings extracted via Swin Transformer from spectrograms. Evaluation was performed using Kullback-Leibler Divergence (𝐷𝐾𝐿 ) , BLEU score, and Syllable Proportion metrics. Results demonstrated that for discrete symbolic data, the 13th-order Markov Model established the best performance, outperforming neural networks which, in this scenario, suffered from mode collapse and excessive repetition. However, the introduction of acoustic embeddings reversed this scenario: the Transformer architecture fed with rich spectral characteristics achieved the best global performance, surpassing the stochastic baseline by significantly reducing 𝐷𝐾𝐿 and maintaining structural coherence in long sequences (up to 40 syllables). It is concluded that the richness of acoustic information is indispensable for modeling primate communication and that the proposed hybrid architecture (Swin Transformer + Transformer) represents a methodological advancement capable of capturing temporal dependencies and bioacoustic nuances that escape traditional approaches.