Add a SideNet to your MainNet


  • Adrien Morisot Cohere



Deep neural network, Conditional computation, Computer vision, Natural language processing


As the performance and popularity of deep neural networks has increased, so too has their computational cost. There are many effective techniques for reducing a network’s computational footprint--quantisation, pruning, knowledge distillation--, but these lead to models whose computational cost is the same regardless of their input. Our human reaction times vary with the complexity of the tasks we perform: easier tasks--e.g. telling apart dogs from boats--are executed much faster than harder ones--e.g. telling apart two similar-looking breeds of dogs. Driven by this observation, we develop a method for adaptive network complexity by attaching a small classification layer, which we call SideNet, to a large pretrained network, which we call MainNet. Given an input, the SideNet returns a classification if its confidence level, obtained via softmax, surpasses a user-determined threshold, and only passes it along to the large MainNet for further processing if its confidence is too low. This allows us to flexibly trade off the network’s performance with its computational cost. Experimental results show that simple single hidden layer perceptron SideNets added onto pretrained ResNet and BERT MainNets allow for substantial decreases in compute with minimal drops in performance on image and text classification tasks.


D. Amodei and D. Hernandez. AI and compute.

ai-and-compute, 2018.

S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

A. Bapna, N. Arivazhagan, and O. Firat. Controlling computation versus quality for neural sequence models. arXiv preprint arXiv:2002.07106, 2020.

E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.

T. Bolukbasi, J. Wang, O. Dekel, and V. Saligrama. Adaptive neural networks for efficient inference. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 527–536. JMLR. org, 2017.

G. Castellano, A. M. Fanelli, and M. Pelillo. An iterative pruning algorithm for feedforward neural networks. IEEE transactions on Neural networks, 8(3):519–531, 1997.

K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.

S. Desai and G. Durrett. Calibration of pre-trained transformers. arXiv preprint arXiv:2003.07892, 2020.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Y. Gal. Uncertainty in deep learning. University of Cambridge, 1:3, 2016.

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.

C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, and et al. Streaming end-to-end speech recognition for mobile devices. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019. doi: 10.1109/icassp.2019.8682336. URL

J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. Patwary, M. Ali, Y. Yang, and Y. Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

S. Leroux, S. Bohez, E. De Coninck, T. Verbelen, B. Vankeirsbilck, P. Simoens, and B. Dhoedt. The cascading neural network: building the internet of smart things. Knowledge and Information Systems, 52(3):791–814, 2017.

A. Matton and L. de Oliveira. Emergent properties of finetuned language representation models, 2019.

L. Zhang, Z. Tan, J. Song, J. Chen, C. Bao, and K. Ma. Scan: A scalable neural networks framework towards compact and efficient models. In Advances in Neural Information Processing Systems, pages 4029–4038, 2019.

V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.

E. Park, D. Kim, S. Kim, Y. Kim, G. Kim, S. Yoon, and S. Yoo. Big/little deep neural network for ultra low power inference. In 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 124–132, 2015.

V. Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.

IncIdeas/BitterLesson.html, 2019.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. URL

S. Team. Hey siri: An on-device dnn-powered voice trigger for apple’s personal assistant. Apple Machine Learning Journal, 1(6), 2017.