Evaluating the Robustness of Defense Mechanisms based on AutoEncoder Reconstructions against Carlini-Wagner Adversarial Attacks

Adversarial Examples represent a serious problem affecting the security of machine learning systems. In this paper we focus on a defense mechanism based on reconstructing images before classification using an autoencoder. We experiment on several types of autoencoders and evaluate the impact of strategies such as injecting noise in the input during training and in the latent space at inference time. We test the models on Carlini-Wagner adversarial examples for the stacked system, composed by the autoencoder and the classifier, in the white-box scenario. Denoising autoencoders as well as injecting noise in the dataset before training and in the latent space at test time are effective strategies to improve the robustness of classifiers.


Introduction
Adversarial examples are a serious threat for machine learning systems. They can be divided in two main categories, white box-attacks [5,12,4,15] in which the attacker has complete access to the model (topology of the network and its weights) and black-box attacks [16,7,20] in which the attacker has only access to the predictions of the network. Middle ground solutions are also present in the literature in which the model is partially hidden. Defense mechanisms can be found in the literature both in the form of adversarial training [5,13] or of an input transformation at inference time, such as * {hlihor,volpi,malago}@rist.ro compression, random croppings, and/or reconstructions [6,8,19,14]. Gu and Rigazio [6] proposed the use of an autoencoder to preprocess the images in input to a classifier, with the aim of cleaning input images from possible adversarial perturbations. Lately, Huang et al. studied this defense using Variational AutoEncoders [9]. Reconstructing adversarial examples generated for a classifier with an autoencoder yields performance close to the original one on clean examples [6,9]. However, while the system is now robust to white-box adversarial examples for the classifier, the defense fails against white-box adversarial examples computed for the composite system formed by the stacked autoen-coder+classifier [6]. Athalye et al. [1] showed more generally how obfuscating the gradients through transformations of the input (an example of which is the reconstruction through an autoencoder) does not constitute an effective defense strategy. In order to make claims about the robustness of a preprocessing strategy based on autoencoders, it is necessary to study the worst case scenario, where the attacker has full access to both networks. For this reason, in our study we use the Carlini-Wagner (CW) L 2 attack [4], considered one of the strongest in the literature [2,3].
In this paper we study the desirable properties for a defense mechanism based on autoencoders. We present a detailed analysis of the robustness of the stacked network when using different types of autoencoders, such as vanilla AutoEncoders and Variational AutoEncoders. We evaluate the impact of denoising and contractive regularization on the latent space of the autoencoder and will show how this affects the robustness of the full system.

Autoencoders
In this section we present the different types of Au-toEncoders considered in our study. Variational Autoencoders [10,17] are a particular kind of autoencoders with a loss function derived from variational inference principles. The ELBO, which provides a lower bound for the log-likelihood is composed of two terms, the reconstruction loss term and a Kullback-Leibler divergence penalty term. The reconstruction term encourages mapping similar images close to each other in the latent space, while the KL penalty term concentrates all hidden representations in regions where the prior has high probability density. The trade-off between these two terms provides a regularization of the space during training.
Denoising autoencoders [26] are trained to reconstruct corrupted images to the corresponding clean ones, differently from regular autoencoders which are trained to compute the identity map. Denoising is thus a more difficult task, but the reconstructions generated by these models are less blurry and better resemble natural images, which can make them easier to classify.
Contractive Autoencoders [18] penalize sharp changes in the latent representations caused by small changes in the input. The regularization term is the squared L 2 -norm of the Jacobian of the function x → h, where x is the input and h is the hidden representation. We propose to adapt the contractive penalty for Variational Autoencoders, so that the loss function becomes where α and β are parameters. For the purposes of this paper, we will set β = 0, since having sharp changes in the mean is a behavior that we believe is more likely to lead to differences in the reconstructions of two similar images. We experiment with vanilla AutoEncoders (AE), Variational AutoEncoders (VAE), as well as with their denoising (DAE, DVAE) and contractive versions (CAE, CVAE).

The Carlini-Wagner Attack
We generate adversarial examples using the Carlini Wagner L 2 attack (CW) [4], which is considered to be a strong attack in the literature, having broken many defenses [2,3]. CW is formulated as an optimization problem with a loss function composed of two terms, the first one is the L 2 norm of the perturbation, therefore encouraging finding subtle adversarial examples, and the second one is an adversarial loss which encourages the resulting image to be as harmful as possible. In the literature of adversarial examples [5,13], the threat model usually considered is where the attacker is allowed to modify the input with a perturbation of a certain maximal L ∞ norm equal to . This is equivalent to being able to modify each pixel of an image by a maximum amount . We created CW adversarial examples that obey this constraint. At each iteration of gradient descent, we project the current perturbation on the L ∞ -box, centered in zero. For VAEs, we consider the worst-case scenario attack, where the attacker has disabled sampling in the latent layer when generating adversarial examples. By enabling sampling, noisy gradients would be obtained resulting in a weaker attack.

Defense Mechanisms
The defense strategy studied in this paper is based on preprocessing the input to a classifier with an autoencoder [6,9]. The idea is that a well trained autoencoder should be able to learn the relevant features of an image and reconstruct it properly, by removing the adversarial perturbation. Moreover, the autoencoder should also be robust enough not to become the target of an adversarial attack itself [21,11].

Injection of Noise in the Input
When training VAEs, adding noise to the input is not just a form of data augmentation, but it is also a recommended practice to guarantee numerical stability and to avoid overfitting [24,25]. Due to the KL regularization in the latent space of a VAE, the approximate posterior is encouraged to be close to the Gaussian prior N (0, 1). If we do not train with noise, most clean samples will be mapped to latent representations having high probability density with respect to the Gaussian prior, while perturbed images will be mapped outside this region. Training with noise is thus necessary to be able to learn a

Increasing the Variance in the Latent Layer
The idea of including stochastic layers in inference in order to protect from adversarial examples has been explored in the literature [28,22]. The encoder of a VAE has a stochastic layer which learns the parameters of a multivariate Gaussian distribution with diagonal covariance matrix from which the latent representation is sampled. The diagonal covariance matrix of the multivariate Gaussian can be scaled by a certain factor which is a hyperparameter of the defense. Increasing the entries of the covariance matrix by a factor corresponds to adding Gaussian noise to the latent representation after sampling, perturbing the internal representation of a VAE. By doing this, we hope to find an advantageous trade-off between quality of reconstruction and robustness of the classifier to an attack, by perturbing it enough to escape the adversarial region.

Results
We made experiments on the MNIST, FashionM-NIST [27], and Belgian Traffic Signs for Classification (BTSC) [23] datasets. For each dataset, we trained classifiers with 4 convolutional layers, followed by a fully connected layer, and a softmax layer. We trained AEs and VAEs whose encoders and decoders have 2 convolutional layers with 32 channels each, 3x3 filters, latent size 128 and ReLU activation. In the training of AEs and VAEs, we injected Gaussian noise with standard deviation in the range [0.1, 0.4], also referred to as the stochastic parameter (stp). For each combination of stacked autoen-coder+classifier, we created sets of adversarial examples, one for each value of in the range [0.05, 0.3] for the MNIST and FashionMNIST datasets and in [0.005, 0.03] for the BTSC dataset. We ran the CW attack over the first 1,000 images in the test sets of MNIST and FashionMNIST and over 1,000 randomly selected images from the test set of BTSC, for 100 iterations, with learning rate 0.1, and parameters γ = 1, k = 0. In Figs. 1-4 we present the accuracy as a function of the maximum L ∞ norm ( ) allowed for the attack. Training with denoising significantly improves the accuracy in some cases (DAE on MNIST and FashionMNIST, and DVAE on MNIST), while in others does not seem to negatively influence the results. Autoencoders trained with high stochastic parameter (stp) make the stacked networks more robust, both for AEs and for VAEs. The contribution of the stochastic parameter (stp) seems more pronounced for VAE, DVAE, and DAE, especially on MNIST and FashionMNIST (Fig. 1, 2). Rescaling the covariance of the approximate posterior (Fig. 3) improves the robustness of the stacked network for large perturbations. A trade-off exists though, since large scaling factors degrade performance on clean examples. A contractive penalty can be slightly beneficial, for example in the case of CVAE on MNIST and FashionMNIST (Fig. 4). However, in the other cases, this did not bring a significant improvement or it decreased the accuracy of the system. In Fig. 5 we show the impact of adversarial perturbations on the latent space analyzing two quantities of interest: the cosine between µ (mean of clean) and µ * (mean of adversarial), and the ratio of the norms µ * 2 µ 2 . We choose as the value of interest for each dataset in which performance starts to degrade. The diameter of the set (i.e., the maximum distance between points) of the encoded µ is about 3 times bigger for AE than for VAE. This diameter is growing for DAE while stays about constant for DVAE, since the latent representation of VAE is regularized by the prior. The perturbations in the latent space have a similar magnitude instead, hence the cosine of the angle and the ratio of the norms are more stable for AEs than for VAEs (Fig. 5). DVAE attenuates this by reducing the displacements in the latent space and seem to provide a more stable representation.

Conclusions
We studied the efficacy of a defense based on reconstructing images with different types of autoencoders and explored the role of some hyperparame-ters. Denoising models are more robust than their corresponding non-denoising versions. Increasing the magnitude of noise used to corrupt images in training leads to more robust stacked networks. The scaling factor in the latent space introduces a tradeoff between the accuracy on clean images and the robustness on adversarial examples. The contractive penalty can improve the performance in some cases, while in others it is not beneficial. The diameter of the set of the encodings is much larger in AEs than in VAEs (due to the KL regularization term). The relative adversarial perturbations in the latent space are smaller in AEs than in VAEs. While we could not clearly correlate this fact with a neat increase in the accuracies for AEs, we plan to investigate better this aspect to leverage the stability of the latent space to our advantage. Possible limitations of this study include having used relatively easyto-learn datasets, instead of more complex ones such as CIFAR-10 or ImageNet, not having tested residual networks as classifiers and not having used other well-known strong attacks, such as BIM [12] or DeepFool [15]. These will be all topics for future study.