EDUCATIONAL DATA MINING AND MACHINE LEARNING FOR ANALYSIS AND PREVENTION OF SCHOOL EVASION IN AN UNDERGRADUATE COURSE
Dropout, Predictive Analysis, Random Forest, Self-Organizing Maps, SHapley Additive exPlanations
Universities face the challenge of transforming a large amount of student data into actionable
insights to enhance academic management and reduce dropout rates in higher education. A
promising approach to identify factors influencing academic performance is Educational Data
Mining (MDE) and Machine Learning (ML). This research aims to develop a method to
uncover key characteristics related to dropout in the Interdisciplinary Bachelor's in Science
and Technology (C&T) program at the Federal University of Rio Grande do Norte (UFRN),
focusing on students enrolled between 2014 and 2023. Through a literature review, suitable
ML algorithms were identified for a hybrid approach, combining Random Forest
(classification) and Self-Organizing Maps (clustering), with SHapley Additive exPlanations
(SHAP) for explainability analysis. The process involved Knowledge Discovery in Databases
adapted with stages (data collection, preprocessing, feature mapping, training and testing,
explainability analysis). As a result, a predictive model using Random Forest was developed,
achieving an initial accuracy of 93% in identifying at-risk students, and subsequently 91%
and 89% for unknown data, demonstrating consistency and generalization capability. The
research revealed that dropout is influenced by various factors, including curriculum,
socioeconomic, and demographic aspects. Analysis with Self-Organizing Maps created a
feature map illustrating the relationship between attributes and students' educational status.
Combining with SHAP provided comprehensive insights into attribute influences on model
predictions, highlighting the importance of variables such as academic performance, age at
enrollment, hometown, and socioeconomic status. Finally, a Minimum Viable Product (MVP)
was developed as a proof of concept to showcase prediction results and the explainability of
findings, with descriptive and predictive analyses of patterns affecting student retention.