Mutation Test for Big Data programs
Big Data; Mutation Test; Apache Spark; Taxonomy; Mutation Operators.
The growth in the volume of data generated, its continuous and large-scale production, and its heterogeneity led to the development of the concept of Big Data. The collection, storage and, especially, processing of this large volume of data requires important computational resources and adapted execution environments. Different parallel and distributed processing systems are used for Big Data processing. Some systems adopt a control flow model, such as the Hadoop system that applies the MapReduce model, and others adopt a data flow model, such as the Apache Spark. The reliability of large-scale data processing programs becomes important due to the large amount of computational resources required for their execution. Therefore, it is important to test these programs before running them in production in an expensive distributed computing infrastructure. The testing of Big Data processing programs has gained interest in the last years, but the area still has few works that address the functional testing of this type of program, and most of them only address the testing of MapReduce programs. This thesis aims to reduce the gap in the area by proposing a mutation testing approach for programs that follow a data flow model. Mutation testing is a testing technique that relies on simulating faults by modifying a program to create faulty versions called mutants. The generation of mutants is carried by mutation operators that are able to simulate specific faults in the program. Mutants are used in the test design and evaluation process in order to have a test set capable of identifying the faults simulated by the mutants. In order to apply the mutation testing process to Big Data processing programs, it is important to be aware of the types of faults that can be found in this context to design mutation operators that can simulate them. Based on this, we conducted a study to characterize faults and problems that can appear in Spark programs. This study resulted in two taxonomies. The first taxonomy groups and characterizes non-functional problems that affect the execution performance of Spark programs. The second taxonomy focuses on functional faults that affect the behavior of Spark programs. Based on the functional faults taxonomy, we designed a set of mutation operators for programs that follow a data flow model. These operators simulate faults in the program through changes in its data flow and operations. The mutation operators were formalized with a model we propose to represent data processing programs based on data flow. To support the application of our mutation operators, we developed the tool TRANSMUT-Spark that automates the main steps of the mutation testing process in Spark programs. We conducted experiments to evaluate the mutation operators and tool in terms of costs and effectiveness. The results of these experiments showed the feasibility of applying the mutation testing process in Spark programs and their contribution to the testing process in order to develop more reliable programs.