Aware Compression of Deep Neural Network Models Based on Pruning Followed by Quantization.
Deep learning, Aware Quantization, Aware Pruning, Microservices, Automatic Modulation Classification
Deep learning techniques, particularly deep neural networks (DNNs), have been successfully employed in numerous problems. However, these types of algorithms demand substantial computational effort due to the large number of parameters and mathematical operations involved, which can pose challenges for applications with limited computational resources, low latency requirements, or low energy consumption. Thus, this work proposes the application of a novel training strategy for the conscious compression of DNN models based on pruning, quantization, and pruning followed by quantization, capable of reducing processing time and memory footprint. The compression strategy was applied in two domains: the first for automatic modulation classification, where the model size was reduced by 13 times, maintaining an accuracy only 1.8% lower than the uncompressed model. In the second domain, focused on image classification in microservices environments, the same compression strategy was applied. In this context, a 7.6-fold reduction in model size was observed, with accuracy close to the uncompressed model. Furthermore, the implementation of these techniques reduced prediction latency by 1.7 times and significantly decreased the time required for the deployment of microservices containing these models. These results underscore the effectiveness of the proposed approach, indicating its potential positive impact in scenarios that require computational efficiency and resource conservation.