Efficient Self-Supervision using Patch-based Contrastive Learning for Histopathology Image Segmentation

Learning discriminative representations of unlabelled data is a challenging task. Contrastive self-supervised learning provides a framework to learn meaningful representations using learned notions of similarity measures from simple pretext tasks. In this work, we propose a simple and efficient framework for self-supervised image segmentation using contrastive learning on image patches, without using explicit pretext tasks or any further labeled fine-tuning. A fully convolutional neural network (FCNN) is trained in a self-supervised manner to discern features in the input images and obtain confidence maps which capture the network’s belief about the objects belonging to the same class. Positive-and negative-patches are sampled based on the average entropy in the confidence maps for contrastive learning. Convergence is assumed when the information separation between the positive patches is small, and the positive-negative pairs is large. We evaluate this method for the task of segmenting nuclei from multiple histopathology datasets, and show comparable performance with relevant self-supervised and supervised methods. The proposed model only consists of a simple FCNN with 10.8k parameters and requires about 5 minutes to converge on the high resolution microscopy datasets, which is orders of magnitude smaller than the relevant self-supervised methods to attain similar performance. 1


Introduction
Learning task specific representations without any -or with limited -labelled data continues to be an elusive goal of machine learning.Recent advancements in contrastive learning and self-supervised learning have shown promising results in obtaining discriminative representations of the data which can be useful for downstream applications such as image classification [2], object detection [9] and speech recognition [1].Contrastive self-supervised learning (CSL) has been successfully used as a form of pre-training to reduce the dependence on labeled data for more complex tasks such as image segmentation [10].Most self-supervised methods rely on pretext tasks for training them [6,8].Designing relevant pretext tasks can be challenging and even if a useful pretext task is obtained they may not easily generalise across datasets [5].
In this work, we present a self-supervised learning framework that contrastively learns an object detection model for segmenting nuclei in histopathology images.The model comprises a fully convolutional neural network (FCNN) that predicts one confidence map per output channel, which captures the confidence of each pixel belonging to a particular object class.The FCNN is contrastively trained using smaller positive-and negative patches stochastically sampled from the images.Patches within a training batch are sampled from an entropy based distribution, where the entropy is based on the patch-level confidence scores.The intuition behind the entropy-based sampling is to obtain positive patches that contain similar information and negative patches that contain contrasting information, with respect to features that can discriminate between objects of different classes.Through iterative training we are able to improve the information separation between the positive-and negative patches, resulting in an object detection model which can be used for segmentation.
We use a simple FCNN with 10.8k tunable parameters which converges in about 5 minutes on a stand-alone GPU workstation.Experimental evaluation on two histopathology datasets [4,3] show that our efficient, contrastive self-supervised learning method obtains performance comparable to relevant supervised and self-supervised baseline methods, using only a fraction of the compute time and resources.An illustration of how the confidence map evolves during the training process is illustrated in Figure 4.

Method
Segmentation is fundamentally the task of partitioning an image into areas of interest and a background class.Assuming that the areas of interest within the same class are similar in some feature space according to a similarity measure, we present our contrastive learning framework that results in an unsupervised segmentation model.A high level overview of the proposed framework is illustrated in Figure 2.

Notation
Consider a batch of M images, with the i'th image represented as the pair (U i ,I i ), where U i ⊂N d is a finite set representing the d-dimensional pixel coordinates and I i is a function I i : U i → [0, 1] λ , with λ ∈ N denoting the number of colours/channels, mapping such pixel coordinates into their respective values.
Next, we introduce the notation for denoting the patch locations sampled from the image i as the tuple (R,i), where R ⊂ U i is the set of pixel coordinates of the image patch.Note that the power set P(U i ) denotes the set of all possible smaller patches that can be sampled from image i. Denoting the set of all such patch tuples across images X , we consider the subset S ⊂ X of regular (square) and spatially connected patches in this work.
The proposed framework uses a confidence network (Section 2.2), f θ , parameterised by θ which for a given input image (U i , I i ) computes the confidence map, C i , i.e., f θ : indicates the belief, or confidence, that the pixel u∈U i in image i belongs to class k, for k =1,...,K, with K ≥1.
The contrasting of patches depends on a similarity measure (Section 2.4), s: X ×X →R, which compares the similarity of two patches.This similarity computation relies on (I i ,I j ) and (C i ,C j ), where i and j are the images from which the two compared patches originate from.

Confidence Network
The aim of the confidence network, f θ , is to learn to detect objects in images without any pixel-level supervision, such that areas with high confidence of the k'th class actually belong to a specific type of object in the images.Conversely, objects detected with high confidence of the j'th class, when k ̸ =j, should belong to a different class.The confidence network, f θ , can be any trainable model which given an input image produces a set of confidence maps, describing the confidence that each pixel of the image belongs to a corresponding class.This, for instance, can be implemented as a fully convolutional neural network (FCNN), with multi-class support i.e., with multiple output channels such that each channel would correspond to a different output class.The confidence network is trained to discriminate between objects that could belong to different classes by contrasting patches of images against each other.Obtaining meaningful positive-and negative patches for the contrastive learning is the foundational problem in this framework.We next describe a novel approach to mine for such contrastive patches for self-supervision of the confidence network.

Entropy-based Patch Sampler
For each output class k = 1,...,K, patches believed to contain (a part of) an object belonging to a specific class are treated as positive patches, whereas patches believed to not contain (a part of) an object of that class are treated as negative patches.Appropriate sampling of such positive-and negative patches is of vital importance, as the optimisation of the confidence network, f θ , is performed according to the contrastive loss computed based on those patches.Improved object detection shall ideally correlate with positive and negative sampled patches being more distinct.Effectively, the appropriate sampling of patches becomes a form of pretext/auxiliary task for training the confidence network, f θ .
Assume a set of candidate patches, S ⊂X .For each patch (R,i) ∈ S with |R| pixels and for each class k, the average patch confidence is computed as: Notice that the higher the average patch confidence, the stronger the belief that the patch contains the type of object belonging to class k.
Recall that C i has range [0,1] K , that is, for each class, k = 1,...,K, a C k i (u) value closer to 1 indicates a higher probability of u belonging to the k'th class.Likewise, a value closer to 0 indicates that u likely does not belong the k'th class, whereas values in between indicate varying degrees of (un)certainty in either direction.Using this intuition about the confidence maps, we model a Bernoulli distributed random variable, X(u), with confidence value as its parameter, denoted as X(u) ∼ Bern(C k i (u)).Using this, we can now define the average patch entropy as where which is the entropy of the Bernoulli distributed random variable X(u).The motivation for this choice of Bernoulli random variable is so that good choices of positive-and negative patches, according to the contrastive loss, would correlate with higher certainty from the confidence network.Therefore sampling a set W k ⊂ S of n patches according to the unnormalised distribution 1−B k (R,i) for each class k ensures a stochastic correlation between the confidence map and the appropriateness of patch sampling.Finally, partitioning of these W k into sets of positive samples, P k and negative samples N k is performed according to for each patch (R,i) ∈ W k , such that patches with high confidence of belonging to class k are likely to be selected as positive for class k.

Similarity Measures
Ideally, the similarity measure, s: X ×X →R, should yield higher values when two input patches are more similar.In this work, for any two patches (R 1 ,i),(R 2 ,j)∈S, the similarity measure performs the comparison on the image scaled by the corresponding confidence map, given by the following pixel-wise products We employ two pixel-level2 similarity measures; mean squared error, or difference, (MSE) and mean cross-entropy (CE), and treat them as a model hyperparameter and report the best performing measures for each of the datasets.We envision density based similarity measures to be desirable, but in our limited experiments they performed poorly.

Backpropagation:
The scaling of the input image with the confidence map also serves the purpose of connecting the gradients between the sampling step and the confidence network for backpropagating the contrastive loss to optimise the confidence network.

Contrastive Loss
Within a channel k of the confidence map, we seek to maximise similarity between positive patches while minimising the similarity between positiveand negative-patches.This is facilitated using the intra-channel contrastive loss This contrastive loss, inspired by SimCLR [2], rewards f θ for learning to detect a single feature in the images for each class in the confidence map, when patches are sampled appropriately.However, this lacks any mechanism for enforcing the individual classes to learn distinct features.In a multi-class scenario, each of the classes shall ideally detect different features.This is achieved by introducing an inter-channel contrastive loss Finally, the combined loss is defined as 3 Data & Experiments

Data
As the proposed CSL framework is trained exclusively using unlabelled images, without exposure to the labels during training, we can also use the labels from the training set for validation of the segmentation performance.That is, we train the weights of the FCNN without using any labels but can employ the labels for model selection.
MoNuSeg: This dataset [4]3 from MICCAI 2018 contains 37 training images at 1000 × 1000 pixels resolution, obtained at 40× magnification, as well as 14 testing images at the same resolution.Dense annotations of all nuclei in all images are provided.
There are dense annotations of all nuclei in all images.

Experimental Set-up
Baseline Methods: The simplest supervised baseline method is to obtain the optimal intensity threshold using the training images.The images are converted to grey-scale and an optimal threshold to obtain binary segmentation is obtained.Additionally, a CNN of the same architecture as the confidence network, f θ , is used as a supervised baseline.And finally, these results are also compared to the self-supervised method that uses scale prediction as a pretext [8], which is referred to as the Scale Pretext method.All baselines are evaluated on both mentioned datasets.
Model Hyperparameters: The default configuration uses a confidence network, f θ , closely inspired by [8].Entropy based stochastic sampling described in Section 2.3 and K =4 classes are used.The patch size of 50×50 pixels and 10 patches from each image in a batch are chosen based on experiments on the MoNuSeg dataset, as reported in Table 1.
The proposed framework was trained for a maximum of 300 epochs.To limit RAM usage, the input images are cropped into 300×300 pixels, where the location is selected uniformly at random as to add some data variance.No other pre-processing nor data augmentation is applied.Batch size of 10 is used for all experiments.
Implementation: The proposed framework is implemented in PyTorch [7], with support to be trained on GPUs; all training has been performed on a system with an NVIDIA GeForce RTX 3060 and intel-i7 processor with 32GB memory.
Experiments: With the objective of comparing the segmentation performance of the proposed CSL framework with the baselines, we perform experiments on each of the datasets described in Section 3.1.
Our CSL framework is initialised with the hyperparameters described in Section 3.2 on each of the datasets and the weights of the confidence network are randomly initialised.A single training epoch consists of contrasting 10 patches of size 50×50 (see Table 1) sampled from one random crop from each training image.At the end of each epoch, the confidence maps obtained from the confidence network are thresholded5 at p=0.5 and the segmentation performance is evaluated using the labels for the training images by computing the Dice score.We limit the training of the models for a maximum of 300 epochs, as for both datasets we noticed that the learning plateaued around 200 epochs.
After training for 300 epochs, the model at epoch with highest validation Dice score is selected for evaluation on the test set.These Dice scores are reported and discussed further in Section 4. Simple post-processing comprising two iterations of morphological opening and closing operations with radius 3 and 1, respectively, are performed on the thresholded segmentation masks; this primarily helps remove some high frequency noise in the predicted segmentation masks.
The proposed CSL framework is trained 10 times with different initialisations on each dataset, in order to illustrate and quantify its inherent variance in segmentation performance.

Results
Segmentation performance of our method and the baselines are reported in Table 2 on both the datasets.The first two rows show the performance of two supervised methods and the third row shows the self-supervised Scale Pretext baseline method [8].We ran ten random repeats of our model for each dataset configuration and report the mean, maximum and the 80'th percentile Dice score over these runs.Dice score for the Scale Pretext method is obtained from the paper [8].In all cases, the supervised CNN method performs better than the self-supervised class of methods, which is to be expected, as no further fine-tuning of these methods with labels is performed.
MoNuSeg: Our method outperforms the supervised optimal threshold method when trained on the MoNuSeg dataset using MSE-similarity.Qualitative results on four randomly selected test set images from the MoNuSeg dataset for the model trained on WSI data are shown in Figure 3.
CoNSeP: A similar performance trend is observed on the CoNSeP dataset, where we notice the mean Dice score of our method with the CE similarity measure performs better than the Scale Pretext method.However, the intensity thresholding method performs better than all self-supervised methods.
The number of parameters of our method (10.8k) is several orders of magnitude fewer than Scale Pretext method (21M), and our method takes only a fraction of time until convergence (5m) compared to theirs (>12h) as shown in Table 2.This fast convergence time has implications in overcoming the strong influence of the confidence maps obtained at initialisation (epoch-0) which can result in convergence to poor solutions.While some runs result in poor performance -dragging the mean down -most runs yield good performance.We capture this as the 80'th percentile performance over multiple runs, which is better than the mean performance.Thus running the model several times (which is still fast) will likely result in good performing solutions.As a standard practice, we therefore recommend training multiple repeats of our model on any dataset, and selecting the best performing one for inference.

Discussions & Conclusions
Confidence maps to segmentation masks: The efficient CSL framework presented in this work outputs multiple confidence maps corresponding to objects believed to belong to the same class.Going from these confidence maps to the segmentation masks by definition requires expert input.In this work, we alleviate this by using the labels for model selection/validation.For other unlabelled datasets, some other form of validation would be required to align the concept of a biomedical image foreground to that of a confidence map of a self-supervised framework.

Self-supervised segmentation performance:
The main baseline we compared our method to is the self-supervised Scale Pretext method [8].The results of this method are not significantly different from ours.This is even with specific preprocessing (stain-normalisation) and elaborate post-processing followed in [8], which are altogether left out in our work.Further, compared to the Scale Pretext method ours is significantly simpler, both in terms of the model complexity, the elaborate regularisation of their objective function, and training time (5m versus 12h).

Mechanism:
The aim of the proposed CSL framework is to increase the information separation between positive-and negative-patches.In Figure 4 the learning mechanism is illustrated, where the distribution of average patch confidence and entropy at each of the first 25 epochs of a training process are shown.Conceptually, and in an idealised setting, we would expect the distribution to evolve from being a unimodal distribution centered around 0.5 We believe this to be an insight into the mechanism behind the proposed framework.
to a bimodal distribution with peaks close to 0 and 1.While the average patch confidence in Figure 4 does not quite reach a bimodal distribution, it does arguably seem to evolve towards it.We conjecture that this training evolution plot illustrates the fundamental mechanism underlying the proposed CSL framework.
Limitations: The main limitation in the proposed framework coincides with its simplicity -there are no constraints encouraging the model to actually segment the desired objects of interest, potentially resulting in non-useful segmentation maps.Increasing K, the number of classes, may increase the chance of one of them being useful, though this comes at a considerable computational cost; hence the pragmatic choice of K = 4 throughout the paper.To further increase the chance of obtaining a useful model, we suggest a standard practice of training the model multiple times, and subsequently selecting the best performing one; the framework is quite sensitive to initialisation, resulting in considerable variance in performance.Fortunately, the 80-percentile scores in Table 2 indicate that the framework often yields useful models, at least on these data sets.And training multiple times is tenable due to the efficiency of the framework.
Finally, we assume that the similarity measure plays an important role in model performance.We experimented with a few of these measures.However, thorough investigations of more refined similarity measures and their influence on multi-class segmentation remains future work.

Conclusions:
We presented a self-supervised framework for segmenting nuclei in histopathology data, which uses patch-based contrastive learning.We introduced a novel technique to mine for positive-and negative patches for contrasting, based on the average entropy of the confidence maps.This approach encourages the trainable confidence network to discern objects of different classes, increasing information separation.The resulting method with only 10.8k trainable parameters takes under 5 min.to converge, yielding useful segmentation masks.We foresee interesting research directions for this work that can make it better suited for diverse image data.

Figure 1 :
Figure 1: Overview of the training evolution of the proposed contrastive self-supervised learning model.At each of the five illustrated epochs, predicted confidence map (from which segmentation masks are derived by thresholding) for a single validation set image is shown, along with the patches sampled as positive (white squares) and negative (red squares).

Figure 2 :
Figure 2: High level overview of the proposed patch-based contrastive self-supervision method.The input color image, I, is input to the fully convolutional neural network, f θ (•), to obtain a pixel level foreground confidence map, C. The confidence map is used by the patch sampler which returns locations of positive and negative patches.The patches are then obtained from the attended image, C•I, to obtain the inter-and intra-class similarity measures, s(•).A contrastive loss, L(•), is computed based on these similarities which is then back-propagated through the entire pipeline.The trainable weights, θ, are tuned until the confidence map, C, corresponds to segmentation masks of interest.

Figure 3 :
Figure 3: Four test set images from the MoNuSeg dataset.The rightmost column is colour-coded, where blue indicates correct segmentation (true positives), green indicates false negatives, red indicates false positives, and the white background corresponds to true negatives.The predictions are obtained from the confidence map by a 0.5 threshold.

Figure 4 :
Figure 4: Evolution of confidence network inferences during training process.The two plots illustrate the distribution of average patch confidence and -entropy, respectively, stratified over epoch.This visualises how the distributions change during training; the broadening of the average patch confidence distribution corresponds to an increased information separation.Correspondingly, the average patch entropy decreases.We believe this to be an insight into the mechanism behind the proposed framework.

Table 1 :
The median validation Dice score over three runs on MoNuSeg is used to select number of patches and patch sizes (shown in boldface).

Table 2 :
Dice scores on test sets of MoNuSeg and CoNSeP datasets for the different methods along with maximum, mean+sd and 80'th percentile scores for our method.Optimal intensity threshold method (Int.Thresh.),supervised CNN baseline, self-supervised method using scale prediction pretext task (Scale Pretext) and the proposed CSL framework.The number of trainable parameters and convergence time per method are also reported.