Robust Deep Interpretable Features for Binary Image Classiﬁcation

The problem of interpretability for binary image classiﬁcation is considered through the lens of kernel two-sample tests and generative modeling. A feature extraction framework coined Deep Interpretable Features is developed, which is used in combination with IntroVAE, a generative model capable of high-resolution image synthesis. Experimental results on a variety of datasets, including COVID-19 chest x-rays demonstrate the beneﬁts of combining deep generative models with the ideas from kernel-based hypothesis testing in moving towards more robust interpretable deep generative models.


Introduction
While Machine Learning (ML) has enjoyed tremendous growth in both academia and industry, it has also caused concern in regards to its transparency and interpretability [5,23] in high-stakes decision making in areas such as medicine and engineering. In this paper, we consider interpretability for binary image classification and focus on extracting interpretable features that affect the classification boundary. In this context, convolutional neural networks (CNN) [2] are often used as they exhibit superior performance.
However, due to their complexity, they are often considered black-box models [12,8]. In contrast, simple models such as logistic classification applied directly to the pixel representation would give an interpretation in terms of pixel-wise effects, but these are not meaningful for the overall classification task. A compromise between CNNs and linear models (see Figure 1) would be to consider lower-dimensional latent representations for each image, where each  Figure 1: Logistic classifier applied to latent representations element in the latent representations corresponds to an extracted feature that is interpretable. Such latent representations can be recovered by using deep generative models, in particular when the latent representations are linearly separated on class. To impose such a structure on the latent space, one can add a regularization objective to the loss function of the generative model. In that scenario, we can extract these features using a linear model and visualize them using the generative component of our model. For the remainder of this paper, we will consider features to be computable quantities given any input and our goal is to discover such quantities that are class-discerning. Our motivation for building such an interpretable framework is to extend the existing post-hoc methods [22,18] in the deep learning setting by providing a more thorough analysis of latent features and exploring deep generative models as fundamentally interpretable models [3,9]. We consider a supervised approach when finding latent features, as unsupervised disentanglement methods are demonstrated to be incapable of recovering these features [16]. A supervised method has the added benefit of imposing structure on the latent space, potentially improving the robustness of the generative model to adversarial attacks [6].
We pose the following questions to guide us: 1. How can we find linearly separated latent representations? 2. What implications on performance do such latent representations have? 3. How can we interpret a model using linearly separated latent representations? The rest of the paper is organized as follows. We start by providing desiderata for a generative model that can be used for our purposes, leading to the choice of Variational AutoEncoders (VAE) and its variants. We then present our main contribution: the framework termed Deep Interpretable Features (DIF) and its key components. The following sections describe our experimental setup and evaluation criteria, present empirical results and ablation studies, followed by concluding remarks.

Background: Generative Modeling
We present desiderata based on our application needs and provide examples and motivation for feasible and unfeasible generative model choices.
Desiderata. Consider generative models G : z → x and the latent representations z of images x in the context of interpretability: 1. z is a low dimensional representation of x.
2. Semantically dissimilar images are encoded into semantically dissimilar latent representations z. 3. Semantically dissimilar latent representations are decoded back into semantically dissimilar images through G. 4. All latent representations have a visually meaningful map G(z).
If desideratum 1 does not hold, G(z) fails to generalize pixels into features. If desideratum 2 does not hold, distinguishing images on a feature-wise basis will fail. If desideratum 3 or 4 fail, then features found in latent space would not bear any visual meaning when decoded back. We briefly review the generative model literature in the context of our desiderata below.
VAE [13] based models are generally the most feasible, as they have an encoder component that can be trained to directly map images to latent representations.
In particular they encourage desideratum 3 by training on the reconstruction error objective L recon (x, G(z)) = G(z) − x 2 . The only shortcoming of VAEs is the quality of the synthesized images, which many different variations address [10,19].
Generative Adversarial Network (GAN) [7] based models work well due to the adversarial training procedure which encourages desideratum 4. However, they become computationally infeasible; as GANs do not have an encoder, one needs to reverse optimize [15] a GAN generator to obtain the latent representation of each image.
Normalizing flows [21] are not feasible since they do not yield low dimensional latent representations due to their bijection property.
We conclude that VAE-based models are the most suitable choice wherein the problem lies in finding a suitable variation capable of achieving our aims.

Main Result: Deep Interpretable Features (DIF)
We propose our framework Deep Interpretable Features (DIF), which at its core gives any generative model the three following properties: 1. Encourages the generative model desiderata through a training objective 2. Finds interpretable features As the majority of the technical contribution lies in property 1, we begin by reviewing the necessary literature in kernel two-sample testing.

The Mean Embedding test
Non-parametric hypothesis testing considers whether two independently and identically drawn samples ∈ R d are drawn from different distributions P and Q based on observed samples X ∼ P and Y ∼ Q. The null hypothesis H 0 : P = Q is tested vs. the general alternative H 1 : P = Q. We focus on the Mean Embedding test (ME test) of [11], which introduces a Hotelling's T-squared statistic computed using a kernel mean embedding expressed asλ n = nh n S −1 The statistic depends on a positive definite kernel k : The ME-test is executed by calculatingλ n and rejects H 0 ifλ n > T α , where the test threshold T α is given by the (1 − α)-quantile of the asymptotic null distribution χ 2 (J) with α denoting the significance of the test. When computingλ n we modify the statistic with a regularization parameter γ n > 0, givingλ n = nh n (S n + γ n I) −1 h n , for numerical stability. We use a slightly modified version of the ME-test statistic proposed in [11] since our application considers n 1 = n 2 . For different sample sizes we use the pooled covariance version of the ME-test statistic defined aŝ We are now ready to present our first technique in linearly separating the latent space.

Generative modeling objective:
Mean Embedding-objective In principle, the VAE encourages all desiderata except desideratum 2, which establishes the need of a correction. To encourage desideratum 2 we introduce the ME-objective for generative models. We will denote a mapping parametrized by θ as T θ (·), labeled objects we wish to separate as ζ X i , ζ Y j and batches of latent The overall idea is to use T θ to map ζ X and ζ Y away from each other such that z X and z Y are separated in latent space as shown in Figure 2.
Intuitively, we calculate the ME-test statistic on batches of latent representations with randomly subsampled prototypes for each label and train T θ to "pull" latent representations away from each other by maximizing the ME-test statistic as an objective. By maximizing Λ(b X , b Y , V) with respect to b X , b Y we are maximizing the statistical difference between z X and z Y resulting in T θ mapping apart ζ X and ζ Y . T θ is updated by first calculating Λ(b X , b Y , V) using subsampled prototypes v X , v Y from b X , b Y and then taking the gradient of Λ(b X , b Y , V) with respect to θ, detailed in Figure 2. We train T θ on all objectives related to the generative model, denoted L G , to ensure that z X and z Y are separated under the condition that they still retain a meaningful representation to the generator G. The loss for T θ is then where X are any input required of the generative model objective and η Λ < 0 is a hyperparameter controlling the effect of the ME-objective. In a VAE context, T θ can be the encoder or a mapping between the encoder and decoder.
To prevent unbounded Λ(b X , b Y , V), the domain of z is bounded to [−C, C] d , C ∈ R by applying a C tanh( · C ) transform to T θ (ζ).

Choice of generative model: In-troVAE
We apply the ME-objective to the VAE variation IntroVAE [10], which is a VAE with an additional modified GAN objective. IntroVAE has demonstrated to offer high quality image synthesis on larger images coupled with a minimalistic VAE architecture. This was favored over [19,20] which either offered too complicated setups or convoluted latent representations. IntroVAE extends the training procedure of VAEs by training the encoder and generator on where [m − x] + = ReLU(0, m − x) serves as a saddle point objective for the encoder p θ and decoder q φ using the KL-divergence between approximative posterior and prior of z D KL (q φ (z | x) p(z)). DIF is then applied by adding the ME-objective to the encoder loss: It should be noted that we take T θ to be the encoder and consequently ζ's are taken to be images. When training DIF, we use both losses specified in eq. (3) and eq. (5).

Isolating Features
To isolate features, we take our binary classifier f (z) to be a L 1 regularized logistic regression classifier. We can then exploit the shrinkage properties of the L 1 regularization to isolate the most important features. We then select features by their magnitude |w i |, where w i is the i:th element in the linear weights w = [w 1 , . . . , w n ].
We linearly separate the latent representations by choosing a linear kernel k(x, y) = νx y with hyperparameter ν > 0 for our ME-objective. With z bounded by a tanh(·) transform, it follows that Λ(b X , b Y , V) with a linear kernel also is bounded [11]. All datasets are divided into training (90%) and testing (10%) sets. We use the same architecture for all generative models and their respective components.

Evaluation
We evaluate our experiments in the following way on test data: 1. Generative model performance: We estimate the log-likelihood and calculate the ELBO on test data together with FID scores [1] for fake samples. 2. Latent traversals: We traverse from an image in P to an image in Q and judge if the transition occurs smoothly without artifacts. This evaluation is done to ensure the generative model does not overfit. 3. Predictive performance We compare our L 1 regularized classifiers to a CNN classifier (using the identical architecture as the encoder) fitted directly on images and measure performance in Area Under the Curve (AUC).

Isolating features:
We judge the extracted features qualitatively on how distinct the features are and how the latent transitions look. We analyze the 3 largest features w i by magnitude |w i | for a fix λ L 1 . Strong classification performance for a large λ L 1 indicates validity of the isolated features. We also report % of non sparse features, which is calculated as number ofwi≥0.1 max |w| total features .

Benchmarks
Our benchmarks include vanilla IntroVAE, as well as a method inspired by [17], detailed below.

A logistic separation objective
As an ablation study in order to demonstrate importance of using an ME-based objective, we also consider an objective akin to logistic regression applied to the latent space. Namely, instead of using ME, we take Λ to be a logistic regression objective (i.e. binary cross-entropy loss) , where we keep the weight vector fixed w = 1 1×d but optimize the encoder parameters θ. Here, y P/Q ∈ {0, 1} is the class indicator for ζ. We want T θ to map {ζ X i } n1 i=1 and {ζ Y j } n2 j=1 to their respective sides of a constant hyperplane w and thus minimize this objective. In contrast to the ME-objective, this method imposes the desired structure directly through a fix hyperplane rather than "pulling" latent representations apart through a kernel.

Results
We present the quantitative comparisons in Table 1. As classification performance tends to drop when λ L 1 = 0.1 for vanilla IntroVAE, we investigate sparsity levels at this value. facial hair, the width of numbers respectively, and non-organ matter (non-heart grey matter) in the lungs. These changes occur in an isolated fashion compared to the other methods where one element seemingly affect multiple visual features. In particular, we find that the non-organ matter is consistent with medical results reported in [4,14]. We present the test accuracy of a L 1 -regularized classification trained on the latent representations for different regularization values λ L 1 ∈ {0.0, 0.001, 0.01, 0.1} in Table 1. We find that lasso regression surprisingly outperforms a CNN for CelebHQ and the COVID-19 dataset for both DIF and the logistic benchmark, where DIF yields the most gains in classification performance. Models trained with DIF also retain performance for sparser classifiers. As the percentage of non-sparse features are higher for DIF, this suggests that DIF finds more meaningful and robust features that dis-  Figure 5: UMAP visualization of z for test data. Red points belong to P , blue points to Q and black points denote prototypes.
Is DIF more robust to adversarial attacks? As DIF correctly imposes a linearly separable structure on the latent space, we hypothesize this improves resilience towards adversarial attacks and report this in Figure 6. We use the adversarial attacks proposed in [6] to test the robustness.

Conclusion
We combined deep generative modeling with a supervised objective based on kernel-based twosample testing (ME objective) to encourage the separation between the classes in the latent space. As a result, individual latent dimensions are more interpretable and capture class-discerning image features. Latent traversals for DIF between images of different classes exhibit smooth transitions, indicating a smooth latent space. We note that a linear classifier can then be applied to the resulting latent representation, in some cases resulting in the performance superior to CNNs on test data. Further, imposing a linear structure on the latent space improves robustness against adversarial attacks.