The eﬀect of dataset confounding on predictions of deep neural networks for medical imaging

The use of Convolutional Neural Networks (CNN) in medical imaging has often outperformed previous solutions and even specialists, becoming a promising technology for Computer-aided-Diagnosis (CAD) systems. However, recent works suggested that CNN may have poor generalisation on new data, for instance, generated in diﬀerent hospitals. Uncontrolled confounders have been pro-posed as a common reason. In this paper, we exper-imentally demonstrate the impact of confounding data in unknown scenarios. We assessed the eﬀect of four confounding conﬁgurations: total, strong, light and balanced. We found the confounding ef-fect is especially prominent in total confounder scenarios, while the eﬀect on light and strong confounding scenarios may depend on the dataset robustness. Our ﬁndings indicate that the confounding eﬀect is independent of the architecture employed. These ﬁndings might explain why models can report good metrics during the development stage but fail to translate to real-world settings. We highlight the need for thorough consideration of these commonly unattended aspects, to develop safer CNN-based CAD systems.


Introduction
The use of Machine Learning (ML) and Deep Learning (DL) in medicine is very promising to improve patient care. Such solutions are applied * Corresponding Author:garciasantacruz.beatriz@gmail.com to many medical areas like oncology and neurology. One of the most promising use cases includes assisting the radiologists in the diagnosis process. DL is expected to provide more accurate, faster and objective (in that it reports quantitative analysis) diagnosis [12]. However, these systems might fail to translate into real-world scenarios, presenting multiple challenges for safe applications [19]. It has been reported that ML-based health systems produce systematic errors on patient subgroup classification, consequently generating wrong predictions and flawed risk estimations [21]. Such systematic errors could originate at any stage of developing pipeline, from dataset generation, model development, model evaluation until its final deployment [21]. In a recent example, a systematic review that analysed prediction models for the diagnosis and prognosis of COVID-19 pneumonia reported that almost all of the published models for prediction were poorly documented. Consequently, such models have a high risk of associated bias and their performance was overrated and led to poor generalisation [23]. In the same vein, ref. [4] systematically reviewed the publicly available X-Ray imaging datasets employed to build such models. This work suggested that, without welldocumented datasets and/or complementary metadata, models may learn induced bias or uncontrolled confounders as strong features during the model training, which hampers their safe translation into clinical practice. Previous works demonstrate that in case of potential confounding scenarios, shortcuts can have a variable effect on the model generalisation [2]. Considering the potential 1 harm, further analysis of the confounding variables with respect to the model generalisation capacity is needed.

Bias and confounders in DL
The problems of bias and confounders in DL are becoming more prominent due to the harm and long impact effects they may have in high-stakes disciplines, such as health-care, education or justice [16], especially in underrepresented groups [13]. Bias can be defined as a systematic error presented in the data that may result in wrong predictive estimations. In this context, particularly relevant are selection bias and collider bias, where a population subgroup with certain characteristics (e.g. age, gender) is more likely to be selected, having an increased presence in the dataset compared to its presence in the general population. Induced associations between variables may thus affect the sampling likelihood of an individual [5]. Additionally, confounding factors are variables that influence both the predictor and the outcome [20]. The presence of uncontrolled confounders leads to spurious associations, hampering generalizability and transportability [4].
In medical imaging, such difficulties are often augmented by data scarcity, population and prevalence shifts. Common practices to circumvent such issues include mixing datasets from different populations and/or training models with populations that are different from the target population. Nonetheless, these practices can lead to learn spurious correlations from the confounding factors originated from differences in the dataset generative process (e.g. data acquisition devices, population characteristics) [1]. Considering the impact of such errors, induced systematic bias needs further analysis for applications to medical imaging. In this paper, we study the impact of potential unknown confounders caused by dataset composition. To this end, we focus on pneumonia datasets as a case study.

Pneumonia X-ray manifestations
Pneumonia is an infection mainly caused by bacteria or viruses, that manifests in inflammation of the lungs alveoli, which fill with fluid or pus that cause painful breathing. It is especially dangerous in children, older adults and immunosuppressed patients, causing over 15% of deaths in children under 5 years old worldwide [15]. The differential diagnoses of pneumonia includes examination of a chest radiograph (CXR) by trained specialists, often accompanied with the corresponding clinical history and laboratory tests. Pneumonia generally manifests in CXR as an area or areas with increased opacity [3]. An example of CXR is presented in

Dataset 2: RSNA Pneumonia Detection Challenge
The RSNA Pneumonia Detection Challenge dataset is a subset of the NIH CXR14 dataset, which comes from the NIH Clinical Center, United States of America [22]. The labelling involved six human board-certified specialists. Labels consisted of binary classification in positive and negative based on previous findings [14]. The additional annotations from the positive class are not employed in this work.

Experimental design
To study the effect of potential unknown confounders in a CNN classification network for medical imaging, we simulated different scenarios with different degrees of confounding. For the sake of simplicity we considered only two classes as target labels control and disease (pneumonia) and one confounding factor: the age. Age is divided into two groups: children (from dataset 1) and adults (from dataset 2). Note that, despite we focus on age as the crucial differential factor of the samples coming from these two datasets, other sources of variations such as scanner, protocol acquisition and image preprocessing may also affect the results but are not addressed in this conceptual study.   Table 1 summarises the composition of each confounding dataset combination.

Confounding datasets
• Total confounding: All the disease samples derive from one age group class while all the control samples derive from the other age group.
• Strong confounding: Most samples (95%) from one class (control or disease) derive from one age group (children or adult) and vice-versa.
• Light confounding: As described for strong confounder but with an 85%.
• Balanced:. The categories are class balanced.

Training and evaluation
Each experiment proceeds as follows: a model is trained on one combination, then evaluated against the validation set (internal test) and the test set (external test). Each model employs 80% of the dataset for training and 20% for validation. Networks were training using the already per-train models from torch-vision during 15 epoch using the Leslie Smith's 1 cycle policy [18]. Each experiment was conducted 5 times (each time, the validation set was a different random combination).
To study the effect of the architecture, the whole study was conducted with three different standard architectures: Resnet50 [6], Densenet121 [7] and squeezenet1.1 [8]. After the internal and external evaluation, next each metrics per class children and control was analysed. In the remainder of the manuscript, we use AUC (area under the ROC curve), disease and control recall at 50% as representative evaluation metrics.

Experimental results
This section is structured as follows; In the first place, we presented the results of the seven combinations evaluated with the internal dataset to assess the network accuracy with respect to a dataset with the same confounding scenario, additionally in order to explore the generalisation capacity of the network, an external dataset that represents a balanced scenario was employed. Further, we present the result with more granularity to understand the different behaviour based on the dataset (children vs adult) and the class (disease vs control). Finally, we presented the variation of AUC, disease and control recall with respect to the balanced dataset. Since no significant differences were found across different architectures, we present the detailed results from Resnet50. In Figure 2, general AUC (on the left), we can observe an overall good performance across confounding combinations, with a general tendency to score higher in strong and total confounding levels in the internal dataset. Nonetheless, the values drop, especially on these combinations, when the model is evaluated using an external dataset. Slight differences in performance can be observed in both combinations of light and strong confounding. These stand out further in the next two charts (middle, and right), showcasing recall for control class and disease class, respectively.
Such variations are explained by analysing the children and adult classes separately. In Table 2, results from the children and the adult classes are presented. The dark colour indicates the majority class for each imbalanced combination and light for the minority. Here, we can observe that samples from the children datasets generally score higher than the adults', not only when they represent the majority class but also in minority conditions. Hence, it seems that the adults' class is more affected by the confounding degree in light and strong confounding situations.
In the two cases of total confounding, the disease and control cases proceeded from distinct age groups (i.e. datasets), respectively. In one combination, the disease cases originate from the children dataset and the control cases from the RSNA dataset. Conversely, the second total confounding combination employed disease cases from the RSNA dataset, and control cases from the children dataset. In both cases, the external test evaluation shows that the network fails to predict examples that were not present during the training. Therefore, the model was not able to generalise beyond the provided examples. This behaviour is further discussed in Section 4.
Next, to assess the variation between disease and control, and between each level of confounding, the variation with respect to the balanced network was analysed for each class as depicted in Figure 3. Negative values represent a drop in performance with respect to balanced. Such drops are more common for the adults class in light and strong scenarios, but similar in total confounding scenarios for both classes. Positive values represent an increase in performance, but at the cost of a reduced performance with respect to the other category.
Since many ML research papers commonly approach the performance optimisation on exploring the different architectures rather than datasets. We aim to understand if the behaviour of confounding is affected by the different architectures. Figure 4 depicts the AUC for Validation (Internal test) and Test (External test) as well as the difference between the first two charts. We can observe almost identical performance independently of the architecture employed to build the model.

Discussion
This section discusses the confounding effects produced by the seven different combinations of datasets in our experiments.

Confounding effect in internal evaluation.
Models reported acceptable scores when evaluated against the validation set (Fig. 2, dark blue), or another dataset with the same confounding characteristics. This is expected, but does not warrant similar performance in external datasets.

Confounding effect in external evaluation.
When evaluated against a external balanced dataset, the model's scores drop (Fig. 2, light blue), showcasing a general tendency to reduce the accuracy with respect to the confounding degree. This may explain why a model can perform well during the development phase but fail during the deployment phase.
Dataset behaviour with respect to the confounding degree. Each dataset has specific characteristics that conferred different robustness against confounding conditions. In our preliminary results, the children class presented a better adaptation to such changes than the adults class. In cases of total confounding (see Table 2) both classes presented similar effects but in the light and strong confounding conditions the more robust dataset (children) was better generalised by the model even   with fewer samples. More studies are recommended to better understand this phenomenon. We conjecture that, the more homogeneous and coherent the dataset, the less probably the confounding effect is learned as a strong feature, leading to better generalisation characteristics.
Effect of the confounding degree. In case of total confounding, the model fails with a class recall of zero for the unseen class and almost 1 for the seen class, suggesting that the models use the age or other dataset-specific characteristics as a learning shortcut (Table 2). This suggests that, in cases of total confounded datasets (for instance when each class stems from a different dataset), the model may have a strong probability of failing to predict the unseen class. This lack of generalisation can result in dangerous translation issues when deployed in real settings. A prominent case of this was the combination of controls from the Guangzhou Women and Children's dataset and adult disease samples (COVID-19 pneumonia), which has been reported to be the most common combination in peer-reviewed publications [4].
For the strong confounding scenario, all metrics report lower scores with respect to the balanced models, but the negative effect can differ based of the dataset robustness as discussed before. This affects the disease class more than the control, which might be explained by the higher variability in this class due to the disease manifestation induced diversity. The adults class effect is close to the total confounding scenario, suggesting that unknown confounders due to an uncontrolled class imbalance can lead to dangerous settings for training clinical models.
In light confounding conditions, the metrics report lower scores with respect to the balanced models but higher that in the strong confounder (when comparing against the same group age). In the adults group, the effect on disease recall when the training datasets has a presence of just 15% is similar to the total confounder, suggesting similar conclusions than for the strong confounding.
Architecture impact The employed architectures ( Figure 4) were not found to have an impact on the confounding effect. These results are in line with other works which emphasise the need for more data work [16] to improve ensure safe and generalisable models.
Model robustness and transferability are key requirements for safe clinical models. This work presents an experimental assessment of the effect of the confounder in CNN-based solution for medical image analysis. Both the confounding degree and the dataset characteristics seem to influence the effect of potential uncontrolled confounders in models. These results demonstrate the hazardous effect of confounders when applied to high-stake domains such as healthcare. While many papers focused on a model-centre approach employing benchmark datasets, it is also crucial to consider other datacentre aspects in other to develop safe solutions.
Additionally, these results could explain why some models perform well even in confounding scenarios when the test employed contains comparable confounding characteristics, but fail to translate to different settings such as a hospital or new sampling strategy where a model is deployed to. Importantly, the external evaluation is also susceptible to present confounding characteristics. A better understanding of the dataset generation process (e.g. through better documentation) could help mitigate these issues.
Overall, these problems highlight the necessity of designing an appropriate strategy for model testing and auditing for future clinical use. The employment of metadata seems to have a relevant role in the control for potential confounders. Metadata can be employed during the design and evaluation phases [17]. For the former, it can be used to ensure a balanced class presence in the datasets. In the latter, it can help conduct a disaggregated evaluation to ensure the model fairness and performance for each subgroup of interest.
To confirm and expand these preliminary results, we plan to extend our study, including more confounding factors, imaging modalities and medical disciplines. Further, we aim to have a better undertaking of the phenomena using some of the existing tools within the framework of explainable AI and model robustness.