Multi-input segmentation of damaged brain in acute ischemic stroke patients using slow fusion with skip connection

Time is a fundamental factor during stroke treatments. A fast, automatic approach that segments the ischemic regions helps treatment decisions. In clinical use today, a set of color-coded parametric maps generated from computed tomography perfusion (CTP) images are investigated manually to decide a treatment plan. We propose an automatic method based on a neural network using a set of parametric maps to segment the two ischemic regions (core and penumbra) in patients affected by acute ischemic stroke. Our model is based on a convolution-deconvolution bottleneck structure with multi-input and slow fusion. A loss function based on the focal Tversky index addresses the data imbalance issue. The proposed architecture demonstrates effective performance and results comparable to the ground truth annotated by neuroradiologists. A Dice coefficient of 0.81 for penumbra and 0.52 for core over the large vessel occlusion test set is achieved. The full implementation is available at: https://git.io/JtFGb.


Introduction
A cerebral stroke is the second most common cause of death among adults worldwide [21]. Cerebral stroke can be divided into two general categories: ischemic and hemorrhagic stroke. Ischemic stroke approximately represents 80% of the totality of the strokes [16]. The ischemic brain tissue is divided into two distinct regions during an ischemic * Corresponding Author: luca.tomasetti@uis.no stroke: the ischemic core (infarcted tissue) and the ischemic penumbra, a hypoperfused but viable tissue region. Fast and correct visualization of the salvageable penumbra and the irreversibly damaged core tissues can benefit medical doctors for treatment planning in acute stroke patients (AIS).
Computed Tomography (CT) or Magnetic Resonance Imaging (MRI) are the two of the modalities used to diagnose acute stroke patients [6]. CT is preferred in many centers due to its high sensitivity for detecting hemorrhage, rapid scan times, and widespread availability. Information about clinical severity are calculated, at hospital admission, using the National Institutes of Health Stroke Scale (NIHSS), and color-coded parametric maps (2D PMs) are generated using the 4D CT Perfusion (CTP) imaging usually performed immediately after hospital admission. PMs are estimated to evaluate the changes in the tissue density over the injection of a contrast agent over time. Time-to-peak (TTP), time-to-maximum (T Max ), cerebral blood flow (CBF), and cerebral blood volume (CBV), are all examples of PMs, derived from pixel information of a time density curve, generated from a CTP study [11]. Also, the maximum intensity projection (MIP) is found as the maximum Hounsfield unit value over the time sequence of the CTP providing a 3D volume. In addition to diagnosing acute stroke, CT is also necessary for treatment decisions, with CTP being an essential modality with the ability to assess the penumbra and core.
groups have focused their effort on the study of ischemic strokes, and some methods have been developed for classifying and segmenting the infarct core [2,5,9,13]. These methods rely on a set of PMs as input and ground truth images generated through follow-up images acquired hours after the stroke onset. Nevertheless, the methods mentioned above only segment core regions. However, it could be more beneficial to acquire knowledge also about the penumbra regions since it is crucial for the treatment decisions during the first stages of the ischemic stroke [15,20]. To the best of our knowledge, Tomasetti et al. was the first research group to segment both core and penumbra using machine learning and deep learning approaches [19,20].
It is highly time-consuming to collect and label medical data, and transfer learning is a popular approach to solve problems related to medical images [3,22]. Over the past years, there have been numerous examples of transfer learning architectures used for various tasks in disparate domains pretrained with the ImageNet dataset [4,22]. Additionally, early and slow fusion approaches, with or without inflation, have been proven to improve accuracy in video classification [4,8], and medical diagnosis [12,14]. An early fusion approach combines input information at the beginning of the process, allowing a network to increase the performances of the system using cross-correlation between data [7]. The slow fusion approach slowly merges input information throughout the model permitting higher layers to access more global information [8]. An inflating technique has been proposed to use pretrained weights from image classification networks in video classification models, expanding the filters from 2D (image-based) to 3D (video-based) [4].
A proper understanding of both ischemic regions is a major requisite for initial treatment decisions; however, it has not been fully explored in the previous researches; thus, in this work, we propose a DNN architecture to simultaneously segment both core and penumbra regions in AIS patients. We implemented a structure that was inspired by a multiscale model proposed by Wetteland et al. [22]; however, while Wetteland's model used the same image but with different magnifications as input, we propose a different approach. Our model uses a multi-input CNN with slow fusion, based on transfer learning from VGG-16 models pre-trained on ImageNet. We want to investigate if the usage of the PMs in combination with MIP volume and NIHSS as input can produce meaningful results in the segmentation of ischemic regions, as this is already calculated and in use in clinical settings. The paper contributes with: • A fully-automatic DNN method ( Fig. 1) to segment both core and penumbra, using color-coded PMs acquired shortly after hospital admission when an AIS is expected. • A slow fusion multi-input approach is tested combining the PMs with other images and/or patient information. • The model is trained and tested using a dataset of patients affected by different levels of vessel occlusion for generalizing the input. • Manual annotations based on experts' assessment of the CTP with PMs and MIP are used as ground truth.

Dataset
The study was approved by the Regional ethic committee project 2012/1499. The dataset included CTP scans from 152 patients collected at Stavanger University Hospital (SUS) between January 2014 and August 2020. NIHSS score was available for all patients. Based on the level of vessel occlusion using CT angiography, large vessel occlusion (LVO) was defined as occlusion of a large, proximal artery. Non-LVO was defined as patients with perfusion deficits with occlusion of a smaller, more distal artery or with perfusion deficits without visible artery occlusion. Patients were divided into three groups based on the vessel occlusion severity: 77 patients with LVO, 60 Non-LVO patients, and 15 without ischemic stroke (WIS). Two neuroradiologists with 16 and 3.5 years of clinical experience delineated ground truth core and penumbra regions in an in-house developed software tool. These delineations were based on visual information in the CBF, CBV, TTP, T Max and MIP images acquired directly after the CTP at hospital admission. PMs and MIP have a 512 × 512 pixel resolution, displaying only the brain region. Each patient study contains a number of brain slices between 13 and 27 for each scan. Furthermore, the MRI examination performed within one to three days after the CT examination was studied and used in assistance to generate the ground

Method
The baseline model takes as input four PMs (CBF, CBV, TTP, T Max ) of each 2D brain slice, MIP images and, for some experiments, we also use the patient's NIHSS score as input. The slow fusion is done both by concatenating the feature vectors from the different inputs and introducing skip connections from the different levels of the convolutional part.
The architecture presents a structure that resembles the U-Net model [17]. Fig. 1 displays a representation of the architecture. The convolutional part of the network is based on four to five dis-tinct VGG-16 networks. This part is used to extract low-resolution features. A detailed overview of a single VGG-16 architecture is displayed on the right of Fig. 1. At the end of each VGG-16 network, two convolutional layers followed by a dropout layer are inserted to regularize the model. The VGG-16 models were pre-trained with the Im-ageNet dataset. The deconvolutional section is a series of transpose layers followed by convolutional layers. The transpose layers receive in input a concatenation of the previous layer and the skip connection of the last convolutional layer for each block in the VGG-16 architectures. This step is performed through a concatenation layer, where various inputs are concatenated together along the channel axis, providing a slow fusion of the low-level features and the features of the VGG-16 models. All the convolutional layers use a ReLU activation function, except the last layer, which uses a softmax activation function that generates a probability vector for the three classes involved (i.e., core, penumbra, and healthy brain).
The model's output is a single 2D brain slice image with the same height and width dimensions as the input PMs. The 3D volume is generated by concatenating all the 2D brain slice images sequentially. To evaluate the accuracy of the predicted outputs, we use three distinct metrics: the Dice Coefficient, the Hausdorff distance, and the difference in volume among the predictions (V p ) and the ground truth images (V g ): ∆V = |V g − V p |.

Class imbalance & Loss function
The dataset has imbalanced classes: 93.1% of the pixels belong to the healthy brain class, 6.2% penumbra, and the remaining 0.7% core. This issue is even more pronounced in the Non-LVO group, where 0.2% is core, 2.2% penumbra, and the remaining 97.6% belongs to healthy brain tissue. To overcome this issue, we build our model to focus on two aspects: the loss function and the Non-LVO group.
A generalized focal loss, based on the Tversky index (TI) [1,18] was adopted. The TI is a generalization of the Dice similarity coefficient. The selected loss function was developed to address the data imbalance problem in medical image segmentation, improving the trade-off between precision and recall when training on small structures. γ, α, and β are hyper-parameters of the Focal Tversky loss (FTL) [1]. Furthermore, during training, we emphasized the misclassification of penumbra and core class, in patients in the Non-LVO group, with a higher penalty in the loss because of the evident imbalance among classes in this sub-group. A post-processing step is performed before generating the predicted outcome: a binary mask of the entire brain slice is created based on the MIP image to force the segmentation inside a valid area (the brain tissue). Subsequently, from the softmax activation function, the highest probability value for each pixel was selected.

Experiments & Results
In the reminder of the paper we define the models as: SF w (input), where w ∈{Frozen, Unfrozen, Gradual fine-tuning} and the input are combination of PMs, MIP images (M), and NIHSS score (N). We use the FTL as the loss function for our network. We perform three different experiments: Exp-1) Hyper-parameter search for the poposed method (Fig. 1). Exp-2) combination of inputs and freeze/unfreeze variations of VGG-16; Exp-3) comparison of different input-fusion methods (Fig.  2). For all experiments, the same setting was used: Adam [10] was used as the optimizer function. The batch size was set to 2, and each model was trained for 1000 epochs. The validation FTL was monitored, and an early stopping was invoked if there was no improvement after 25 consecutive epochs.
In Exp-1 a hyper-parameter search is done running a large number of hyper-parameter combinations (over the same model) for finding the optimal values for the given task. We ran experiments with distinct values for FTL hyper-parameters γ ∈ [1,3], α ∈ [0, 1], and β ∈ [0, 1] as shown in Table 1, where each value represents the average of the patients in the validation set for the different severity levels.   Table 2 with the corresponding results. From the baseline input (PMs), we added MIP images as input or the NIHSS score or combined both MIP images and NIHSS score. Models for all these experiments were based on a multi-input with slow fusion. VGG-16 models were trained with frozen weights, unfrozen weights, and a gradual fine-tuning approach for each experiment. The latter setting was developed in three steps:

Slow Fusion
Early Fusion Early Fusion with Inflation Figure 2: Overview of the three models used for comparison in Exp-3 : the early fusion model (EF w (input)); the early fusion with inflation (EFI w (input)); SF w (input) is described in Fig. 1. first, the model was trained with all the VGG-16 weights frozen; secondly, after monitoring the validation loss having no improvements for 25 consecutive epochs, the bottom half of the weights were unfrozen, and the training continued; at last, when no improvement in the validation loss was detected again, all weights were unfrozen, the training continued, and the validation loss was monitored. We have selected SF G (PMs,N) as the proposed model. To understand if using multi-input and slow fusion is suitable for this task, in Exp-3 we compared it with two models: an early fusion (EF G (PMs,N)), and early fusion with inflation (EFI G (PMs,N)), adopting the same multi-input idea but with different fusion approacheas. The inflation approach converts 2D into 3D layers, adding a temporal dimension. The inflation followed the idea of the I3D network by Carreira et al. [4], where they introduced video classification models with an inflated ImageNet pre-trained image classification architecture. The setting and hyper-parameters of these two new architectures are the same as the se-lected model to maintain a fair comparison. Fig.  2 presents an overview of the model architectures. For EF G (PMs,N), the four input parametric maps are concatenated together over the channel axis to generate a single input volume passing through a 2D convolutional layer to reduce the channel dimension to feed it to a single VGG-16 backbone. Differently, EFI G (PMs,N)'s inputs are concatenated over the time dimension; the filters and pooling kernels of the VGG-16 architecture are inflated, remodeling squared filters into cubic filters. The bottom of Table 2 shows the results of these two models in comparison with the other experiments performed.
SF G (PMs,N) is seen as having good overall performance and is chosen as the proposed model. The model is tested with a previously unseen test set, and the performance is compared with manual annotations from two different experts. Table  3 presents the test results and the inter-observer variability in comparison with two expert neuroradiologists. Fig. 3 shows examples of predictions from six patients from the test set using the proposed model.

Discussion
We developed a multi-input CNN with early and slow fusion using transfer learning in this work. The proposed network aims to simultaneously segment dead (core) and salvageable tissues (penumbra) in AIS patients.
The proposed method is learned on patients with or without ischemic stroke and for different vessel occlusion severities. This generalization helps the model correctly segment most ischemic regions, regardless of the patient group. After a series of experiments, multiinput including NIHSS, by a slow fusion approach, SF G (PMs,N), is chosen as the proposed method based on the total of the results.
A hyper-parameter search was performed for the first experiment. Table 1 shows the average Dice coefficient for selecting the optimal hyperparameters for the FTL function. Our observations showed that γ = 4/3, α = 0.7, and β = 1 − α give satisfactory results for all the classes in the two   groups. Thus we apply these parameters in the following experiments. From the validation results of Table 2 it can be seen that unfreezing the weights of the VGG-16 feature extractors improves the models but gives an overestimation of the volume for both the classes and also that the gradual finetuning approach gives a slight improvement when compared to unfreezing all weights from the start. The choice of freezing the parameters reduces training time since a smaller set of parameters needs to be learned; however, the statistical results are less than satisfactory; this could be since PMs are not included in the ImageNet dataset, then the weights are not optimized for these images. Therefore, at the cost of a longer training time, fully unfreezing the weights or using a gradual fine-tuning technique will allow the model to familiarize and learn this particular dataset more accurately. From the validation results of the different gradual fine-tuned models, it is clear that multi-input fusion, including the NIHSS or MIP images, is better than only PMs. However, it is not entirely clear if including both NIHSS and MIP images improves the models compared to only including NIHSS in addition to PMs. Based on the results presented in Table 2, we select SF G (PMs,N) as the proposed model. One can argue that SF U (PMs,N) yields similar results, but SF G (PMs,N) is chosen because of its lower ∆V for the LVO group, high Dice coefficient over the entire dataset, and satisfactory results for all the other metrics. Furthermore, Exp-3 favored the proposed slow fusion approach over the two early fusion approaches, EF G (PMs,N) and EFI G (PMs,N).
Results from the 33 randomly selected patients constituting the test set using our selected proposed model (Table 3) show an average Dice coefficient of 0.34 for core and 0.63 for penumbra over the entire test set and, on the LVO set, 0.81 and 0.52, respectively. These results present higher or analogous values compared to the inter-observer variability in most of the metrics, regardless of the stroke's severity. A separate comparison of the predicted outputs with both the neuroradiologists' annotations demonstrates that the proposed architecture can achieve high statistic values, regardless of the neuroradiologist. This achievement can be considered valuable throughout the first stages of an AIS.
Predictions from six brain slices by six patients in the test set are shown in Fig. 3. Results display comparable regions as the ground truth images, both with LVO and Non-LVO groups, showing high Dice coefficient and promising results with the model. However, the third example in Fig. 3 shows false-positive regions in the brain: this is possibly due to artifacts present in the generated color-coded PMs. We have noticed similar falsepositive with all the other architectures and hyperparameters as well. These false-positive might be avoided using 4D raw CTP datasets instead of the pre-generated PMs.
Several researchers have proposed methods with promising results to segment the ischemic core [2,5,9]. They all use PMs derived from CTP studies for their architectures. Nevertheless, their only focus was to segment the ischemic core without considering the penumbra; thus, excluding a critical aspect for medical treatment decisions. Furthermore, there was no differentiation in the severity level of AIS for the patients involved in their research.

Conclusion
The ability to achieve comparable results with two expert neuroradiologists for all the segmented classes (core and penumbra) is valuable to neuroradiologists during the first stages of an ischemic stroke. The proposed model might be a supportive tool that can help doctors to make treatment decisions.