The effect of dataset confounding on predictions of deep neural networks for medical imaging


  • Beatriz Garcia Santa Cruz University of Luxembourg
  • Andreas Husch
  • Frank Hertel



bias, confounder, medical imaging, Deep learning


The use of Convolutional Neural Networks (CNN) in medical imaging has often outperformed previous solutions and even specialists, becoming a promising technology for Computer-aidedDiagnosis (CAD) systems. However, recent works suggested that CNN may have poor generalisation on new data, for instance, generated in different hospitals. Uncontrolled confounders have been proposed as a common reason. In this paper, we experimentally demonstrate the impact of confounding data in unknown scenarios. We assessed the effect of four confounding configurations: total, strong, light and balanced. We found the confounding effect is especially prominent in total confounder scenarios, while the effect on light and strong confounding scenarios may depend on the dataset robustness. Our findings indicate that the confounding effect is independent of the architecture employed. These findings might explain why models can report good metrics during the development stage but fail to translate to real-world settings. We highlight the need for thorough consideration of these commonly unattended aspects, to develop safer CNN-based CAD systems.


D. C. Castro, I. Walker, and B. Glocker. Causality matters in medical imaging. Nature Communications, 11(1):1–10, 2020.

A. J. DeGrave, J. D. Janizek, and S.-I. Lee. Ai for radiographic covid-19 detection selects shortcuts over signal. Nature Machine Intelligence, pages 1–10, 2021.

T. Franquet. Imaging of community-acquired pneumonia. Journal of thoracic imaging, 33 (5):282–294, 2018.

B. Garcia Santa Cruz, M. N. Bossa, J. S¨olter, and A. D. Husch. Public covid-19 x-ray datasets and their impact on model bias – a systematic review of a significant problem. Medical Image Analysis, 74:102225, 2021.

G. J. Griffith, T. T. Morris, M. J. Tudball, A. Herbert, G. Mancano, L. Pike, G. C. Sharp, J. Sterne, T. M. Palmer, G. D. Smith, et al. Collider bias undermines our understanding of covid-19 disease risk and severity. Nature communications, 11(1):1–12, 2020.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.

F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.

B. Kelly. The chest radiograph. The Ulster medical journal, 81(3):143, 2012.

D. Kermany, K. Zhang, M. Goldbaum, et al. Labeled optical coherence tomography (oct) and chest x-ray images for classification. Mendeley data, 2(2), 2018.

D. S. Kermany, M. Goldbaum, W. Cai, C. C. Valentim, H. Liang, S. L. Baxter, A. McKeown, G. Yang, X. Wu, F. Yan, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell, 172(5): 1122–1131, 2018.

J.-G. Lee, S. Jun, Y.-W. Cho, H. Lee, G. B. Kim, J. B. Seo, and N. Kim. Deep learning in medical imaging: general overview. Korean journal of radiology, 18(4):570–584, 2017.

Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, 2019.

RSNA. Rsna pneumonia detection challenge, 2018. URL

P. Rui, K. Kang, and M. Albert. National hospital ambulatory medical care survey: 2015 emergency department summary tables. National center for health statistics, 2017.

N. Sambasivan, S. Kapania, H. Highfill, D. Akrong, P. Paritosh, and L. M. Aroyo. “everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2021.

B. G. Santa Cruz, C. Vega, and F. Hertel. The need of standardised metadata to encode causal relationships: Towards safer datadriven machine learning biological solutions. Computational Intelligence Methods for Bioinformatics and Biostatistics 2021,10.5281/zenodo.5729350 Nov 16, 2021, 2021.

L. N. Smith. A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820, 2018.

C. Vega. From Hume to Wuhan: An Epistemological Journey on the Problem of Induction in COVID-19 Machine Learning Models and its Impact Upon Medical Research. IEEE Access, 9:97243–97250, 2021. doi: 10.1109/ACCESS.2021.3095222.

E. Vittinghoff, C. E. McCulloch, D. V. Glidden, and S. C. Shiboski. 5 linear and non-linear regression methods in epidemiology and biostatistics. Handbook of statistics, 27:148–186, 2007.

K. N. Vokinger, S. Feuerriegel, and A. S. Kesselheim. Mitigating bias in machine learning for medicine. Communications medicine, 1(1):1–3, 2021.

X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers. Chestx-ray8: Hospitalscale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017.

L. Wynants, B. Van Calster, G. S. Collins, R. D. Riley, G. Heinze, E. Schuit, M. M. Bonten, D. L. Dahly, J. A. Damen, T. P. Debray, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. bmj, 369, 2020.