Questionable Practices in Methodological Deep Learning Research


  • Daniel J. Trosten UiT The Arctic University of Norway



Deep Learning, Evaluation, Scientific Method


Evaluation of new methodology in deep learning (DL) research is typically done by reporting point estimates of a few performance metrics, calculated from a single training run. This paper argues that this frequently used evaluation protocol in DL is fundamentally flawed -- presenting 8 questionable practices that are widely adopted in the evaluation of new DL methods. The questionable practices are derived from violations of statistical principles of the scientific method, and from Hansson's definition of pseudoscience. A survey of recent publications from a top-tier DL conference indicates the widespread adoption of these practices in state-of-the-art DL research. Lastly, arguments in favor of the questionable practices, possible reasons for their adoption, and measures that have been taken to remove them, are discussed.


G. Casella and R. L. Berger. Statistical Inference. Thomson Learning, Australia; Pacific Grove, CA, second edition, 2002. ISBN 978-0-534-24312-8.

R. A. Güler, N. Neverova, and I. Kokkinos. DensePose: Dense Human Pose Estimation in the Wild. In CVPR, 2018. doi: 10.1109/CVPR.2018.00762.

S. O. Hansson. Science and Pseudo-Science. In The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, fall 2021 edition, 2021.

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016. doi: 10.1109/CVPR.2016.90.

B. Hepburn and H. Andersen. Scientific Method. In The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, summer 2021 edition, 2021.

J. Hooker. Testing Heuristics: We Have It All Wrong. Journal of Heuristics, 1:33--42, 1995. doi: 10.1007/BF02430364.

G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely Connected Convolutional Networks. In CVPR, 2017. doi: 10.1109/CVPR.2017.243.

M. Hutson. Artificial intelligence faces reproducibility crisis. Science, 359(6377):725--726, 2018. doi: 10.1126/science.359.6377.725.

P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-To-Image Translation With Conditional Adversarial Networks. In CVPR, 2017. doi: 10.1109/CVPR.2017.632.

A. Kleppe, O.-J. Skrede, S. De Raedt, K. Liestøl, D. J. Kerr, and H. E. Danielsen. Designing deep learning studies in cancer diagnostics. Nature Reviews. Cancer, 21(3):199--211, 2021. doi: 10.1038/s41568-020-00327-9.

Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436--444, 2015. doi: 10.1038/nature14539.

T. Liao, R. Taori, I. D. Raji, and L. Schmidt. Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning. In Advances in Neural Information Processing Systems, 2021.

S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. In CVPR, 2016. doi: 10.1109/CVPR.2016.282.

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You Only Look Once: Unified, Real-Time Object Detection. In CVPR, 2016. doi: 10.1109/CVPR.2016.91.

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception Architecture for Computer Vision. In CVPR, 2016. doi: 10.1109/CVPR.2016.308.

D. Wolpert and W. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1 (1):67--82, 1997. doi: 10.1109/4235.585893.

S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated Residual Transformations for Deep Neural Networks. In CVPR, 2017. doi: 10.1109/CVPR.2017.634.

Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu. Residual Dense Network for Image Super-Resolution. In CVPR, 2018. doi: 10.1109/CVPR.2018.00262.

Z. Zhong, L. Zheng, D. Cao, and S. Li. Re-Ranking Person Re-Identification With k-Reciprocal Encoding. In CVPR, 2017. doi: 10.1109/CVPR.2017.389.