CM2021:P000070

第十二届中国数学会计算机数学大会 (CM2021)

2021年 6月4日 ~ 7日

P000070

Autotuning the Performance of Matrix Multiplication and Convolution for Deep Learning on CPU

*长波陈 (中国科学院重庆绿色智能技术研究院)
昊宇池 (中国科学院重庆绿色智能技术研究院)


Deep learning (DL) compilers have emerged with the aim of closing the gap  between abundant, fast-growing DL models and the lag of high performance implementations of these models on diverse hardware devices. In this work, we introduce several strategies and integrate them into a unified autotuning framework, called AutoMCL, to improve the performance of DL compilers by combining human's expertise with machine's learned intelligence. The preliminary experiments conducted on different CPU platforms show that the proposed framework brings an average $29.07\times$ speedup compared to TensorFlow and an average $1.55\times$ speedup while consuming only $0.47\times$ optimization time compared to a  state-of-art DL compiler AutoTVM for fully connected neural networks on an Intel CPU, 
  and an average $1.36\times$ speedup compared to TensorFlow and an average $1.09\times$ speedup with similar compilation time compared to AutoTVM for several well-known convolutional neural networks on multiple CPUs.

Deep learning (DL) compilers have emerged with the aim of closing the gap between abundant, fast-growing DL models and the lag of high performance implementations of these models on diverse hardware devices. In this work, we introduce several strategies and integrate them into a unified autotuning framework, called AutoMCL, to improve the performance of DL compilers by combining human's expertise with machine's learned intelligence. The preliminary experiments conducted on different CPU platforms show that the proposed framework brings an average $29.07\times$ speedup compared to TensorFlow and an average $1.55\times$ speedup while consuming only $0.47\times$ optimization time compared to a state-of-art DL compiler AutoTVM for fully connected neural networks on an Intel CPU, and an average $1.36\times$ speedup compared to TensorFlow and an average $1.09\times$ speedup with similar compilation time compared to AutoTVM for several well-known convolutional neural networks on multiple CPUs.

Supported by SmartChair

Math formula preview: