Evaluating the Robustness of Defense Mechanisms based on AutoEncoder Reconstructions against Carlini-Wagner Adversarial Attacks


  • Petru Hlihor Romanian Institute of Science an Technology
  • Riccardo Volpi Romanian Institute of Science an Technology
  • Luigi Malagò Romanian Institute of Science an Technology




adversarial examples, variational autoencoders, denoising autoencoders


Adversarial Examples represent a serious problem affecting the security of machine learning systems. In this paper we focus on a defense mechanism based on reconstructing images before classification using an autoencoder. We experiment on several types of autoencoders and evaluate the impact of strategies such as injecting noise in the input during training and in the latent space at inference time.
We tested the models on adversarial examples generated with the Carlini-Wagner attack, in a white-box scenario and on the stacked system composed by the autoencoder and the classifier.


A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.

N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. InProceedings of the 10thACM Workshop on Artificial Intelligence and Security, pages 3–14. ACM, 2017.

N. Carlini and D. Wagner. Magnet and” efficient defenses against adversarial attacks” are not robust to adversarial examples. arXivpreprint arXiv:1711.08478, 2017.

N. Carlini and D. A. Wagner. Towards evaluating the robustness of neural networks.CoRR,abs/1608.04644, 2016.

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014.

S. Gu and L. Rigazio. Towards deep neural network architectures robust to adversarial examples.arXiv preprint arXiv:1412.5068, 2014.

C. Guo, J. R. Gardner, Y. You, A. G.Wilson, and K. Q. Weinberger. Simple black-box adversarial attacks.arXiv preprintarXiv:1905.07121, 2019.

C. Guo, M. Rana, M. Cisse, and L. van der Maaten. Countering adversarial images using input transformations.arXiv preprintarXiv:1711.00117, 2017.

U. Hwang, J. Park, H. Jang, S. Yoon, and N. I. Cho. Puvae: A variational autoencoder to purify adversarial examples.arXiv preprintarXiv:1903.00585, 2019.

D. P. Kingma and M. Welling.Auto-encoding variational Bayes.arXiv preprintarXiv:1312.6114, 2013.

J. Kos, I. Fischer, and D. Song. Adversarial examples for generative models.arXiv preprintarXiv:1702.06832, 2017.

A. Kurakin, I. J. Goodfellow, and S. Bengio.Adversarial examples in the physical world.CoRR, abs/1607.02533, 2016.

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprintarXiv:1706.06083, 2017.

D. Meng and H. Chen. Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 135–147. ACM, 2017.

S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks.CoRR, abs/1511.04599,2015.

A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks.arXiv preprint arXiv:1601.06759, 2016.

N. Papernot, P. McDaniel, S. Jha, M. Fredrik-son, Z. B. Celik, and A. Swami. The Limitations of Deep Learning in Adversarial Settings. arXiv: 1511.07528 [cs, stat], Nov. 2015. arXiv:1511.07528.

N. Papernot, P. D. McDaniel, I. J. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Prac-tical black-box attacks against deep learning systems using adversarial examples.CoRR,abs/1602.02697, 2016.[19] D. J. Rezende, S. Mohamed, and D. Wierstra.Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.

S. Rifai, P. Vincent, X. Muller, X. Glorot, andY. Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. InProceedings of the 28th International Conference on International Conference on MachineLearning, ICML’11, pages 833–840, USA, 2011.Omnipress.

P. Samangouei, M. Kabkab, and R. Chellappa.Defense-gan: Protecting classifiers against adversarial attacks using generative models.arXivpreprint arXiv:1805.06605, 2018.

A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein. Adversarial training for free!arXiv preprint arXiv:1904.12843, 2019.

J. Su, D. V. Vargas, and S. Kouichi. One pixel attack for fooling deep neural networks.arXivpreprint arXiv:1710.08864, 2017.

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus.Intriguing properties of neural networks. CoRR,abs/1312.6199, 2013.

P. Tabacof, J. Tavares, and E. Valle. Adversarial images for variational autoencoders.arXivpreprint arXiv:1612.00155, 2016.

O. Taran, S. Rezaeifar, T. Holotyak, and S. Voloshynovskiy. Defending against adversarial attacks by randomized diversification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages11226–11233, 2019.

R. Timofte and L. Van Gool. Sparse representation based projections. InProceedings of the22nd British machine vision conference-BMVC2011, page 61. BMVA Press, 2011.

B. Uria, I. Murray, and H. Larochelle. Rnade: The real-valued neural autoregressive density-estimator. In Advances in Neural Information Processing Systems pages 2175–2183, 2013.

P. Vincent, H. Larochelle, Y. Bengio, and P.-A.Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conferenceon Machine learning, pages 1096–1103. ACM,2008.

H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.arXiv preprintarXiv:1708.07747, 2017.

C. Xie, J. Wang, Z. Zhang, Z. Ren, andA. Yuille.Mitigating adversarial effects through randomization.arXiv preprint arXiv:1711.01991, 2017.