Aware Compression of Deep Neural Network Models Based on Pruning Followed by Quantization.
Deep learning, Aware Compression, Quantization, Pruning, Microservices.
Deep learning techniques, particularly deep neural networks (DNNs), have been successfully utilized in many problems. However, these types of algorithms require significant computational effort due to the large number of parameters and mathematical operations involved, which can be problematic for applications with limited computational resources, low latency requirements or low power consumption. Therefore, this work proposes the application of a novel training strategy for aware compression of DNN models based on pruning, quantization and pruning followed by quantization, capable of reducing processing time and memory footprint. The aware compression strategy was applied in two domains. In the first domain, for automatic modulation classification, it was possible to reduce the model size by 13 times while maintaining an accuracy only 1.8% lower than that of the uncompressed model. In the second domain, the same technique was applied to an image classification model, to be validated in microservices environments, resulting in the removal of approximately 80% of the network parameters, a memory size reduction of about 7 times and maintaining accuracy very close to of the uncompressed model's accuracy.