Uncovering the Relationship Between Continuous Integration and Machine Learning Projects
Continuous Integration;Machine Learning; Build Duration; Test Coverage.
Continuous Integration (CI) is a cornerstone of modern software development. However, while widely adopted in traditional software projects, applying CI practices to Machine Learning (ML) projects presents distinctive challenges involving not only code testing but also data validation and model evaluation. Therefore, this thesis investigates the differences, challenges, and strategies of CI adoption in ML through four complementary studies, combining quantitative analyses of large-scale open-source repositories with qualitative insights from practitioner surveys. Study 1, analyzing 93 ML and 92 non-ML GitHub projects, shows that ML projects have longer build durations and lower test coverage. Study 2, surveying 155 practitioners from 47 ML projects, identifies eight main differences in CI adoption, with challenges such as test complexity, infrastructure demands, data handling, and dependency management. Study 3, based on responses from 450 practitioners across a diverse set of open-source projects, establishes a baseline for how CI affects pull request (PR) delivery time, finding that CI streamlines review and quality control but does not necessarily accelerate PR delivery. Study 4, analyzing 27 ML and 31 non-ML projects, reveals that ML projects have significantly longer delivery times and PR lifetimes, receive fewer PRs per release, reject a smaller proportion, have higher merge-to-reject ratios, and follow slower release cadences, about one release every eight months versus every four to five months in non-ML projects. Overall, while core CI principles remain relevant, ML projects require tailored practices, such as tracking model performance metrics, prioritizing test execution, and improving dependency management. The findings highlight the need for standardized guidelines to address these challenges and strengthen CI workflows in ML. By integrating quantitative data and practitioner insights, this thesis advances the understanding of CI in ML, paving the way for more effective and robust CI strategies in the ML domain.