Identification of problems and hot topics for developers of Big Data applications on the Apache Spark framework
Big Data, Apache Spark, Probabilistic Topic Models, Latent Dirichlet Allocation (LDA), Stack Overflow, Taxonomy.
This research aims to identify and classify the main difficulties and issues of interest of application developers regarding the processing of Big Data using the framework Apache Spark. In this sense, we use the Latent Dirichlet Allocation algorithm to perform a probabilistic modeling of topics on information extracted from Stack Overflow, since it is not feasible to manually inspect the entire data set. Starting with the comprehensive study of related works, we established and applied a methodology, as well as constructed a Spark application to execute the tasks, using the Spark SQL and MLlib libraries (for machine learning). Analyzes of the results were carried out by a group of 5 researchers: two doctor professors, one doctoral student and two master students. From the semantic analysis of the labels assigned to each of the identified topics, a taxonomy of interests and difficulties was constructed.