Mitigating Adversarial Examples by Re-Constructing the Input Images


(T) So far the best-known technique to mitigate adversarial examples is adversarial training. Adversarial examples are images that have been modified to fool a generative adversarial network (GAN). And in order to have a GAN that will not be fooled by a modified image,  the best technique so far is to train the GAN with adversarial images.

Professor Geoffrey Hinton with a team of researchers from Google Brain and UC San Diego is proposing a new approach to mitigate adversarial examples by reconstructing the input image, and classify if that image is an adversarial example. His approach makes use of Capsule Networks, that he pioneered, which replace the pooling layer of convolutional neural network (CNN).

Following is the abstract of the paper published recently on as arXiv:1907.02957:

“Adversarial examples raise questions about whether neural network models are sensitive to the same visual features as humans. Most of the proposed methods for mitigating adversarial examples have subsequently been defeated by stronger attacks. Motivated by these issues, we take a different approach and propose to instead detect adversarial examples based on class-conditional reconstructions of the input. Our method uses the reconstruction network proposed as part of Capsule Networks (CapsNets), but is general enough to be applied to standard convolutional networks. We find that adversarial or otherwise corrupted images result in much larger reconstruction errors than normal inputs, prompting a simple detection method by thresholding the reconstruction error. Based on these findings, we propose the Reconstructive Attack which seeks both to cause a misclassification and a low reconstruction error. While this attack produces undetected adversarial examples, we find that for CapsNets the resulting perturbations can cause the images to appear visually more like the target class. This suggests that CapsNets utilize features that are more aligned with human perception and address the central issue raised by adversarial examples.”

One of the authors of the papers Nicholas Frosst gave a good summary in a few tweets:




The paper gives a detailed background of the state of the art of adversarial examples:

Adversarial examples were first introduced in [Biggio et al., 2013, Szegedy et al., 2014], where a given image was modified by following the gradient of a classifier’s output with respect to the image’s pixels. Importantly, only an extremely small (and thus imperceptible) perturbation was required to cause a misclassification. Goodfellow et al. [2015] then developed the more efficient Fast Gradient Sign method (FGSM), which can change the label of the input image X with a similarly imperceptible perturbation which is constructed by taking a ε step in the direction of the gradient. Later, the Basic Iterative Method (BIM) [Kurakin et al., 2017] and Project Gradient Descent [Madry et al., 2018] can generate stronger attacks improved on FGSM by taking multiple steps in the direction of the gradient and clipping the overall change to ε after each step. In addition, Carlini and Wagner [2017b] proposed another iterative optimization-based method to construct strong adversarial examples with small perturbations. An early approach to reducing vulnerability to adversarial examples was proposed by [Goodfellow et al., 2015], where a network was trained on both clean images and adversarially perturbed ones. Since then, there has been a constant “arms race” between better attacks and better defenses; Kurakin et al. [2018] provide an overview of this field.

A recent thread of research focuses on the generation of (and defense against) adversarial examples which are not simply slightly-perturbed versions of clean images. For example, several approaches were proposed which use generative models to create novel images which appear realistic but which result in a misclassification [Samangouei et al., 2018, Ilyas et al., 2017, Meng and Chen, 2017]. These adversarial images are not imperceptibly close to some existing image, but nevertheless, resemble members of the data distribution to humans and are strongly misclassified by neural networks. [Sabour et al., 2016] also consider adversarial examples which are not the result of pixel-space perturbations by manipulating the hidden representation of a neural network in order to generate an adversarial example.

Another line of work, surveyed by [Carlini and Wagner, 2017a], attempts to circumvent adversarial examples by detecting them with a separately-trained classifier [Gong et al., 2017, Grosse et al., 2017, Metzen et al., 2017] or using statistical properties [Hendrycks and Gimpel, 2017, Li and Li, 2017, Feinman et al., 2017, Grosse et al., 2017]. However, many of these approaches were subsequently shown to be flawed [Carlini and Wagner, 2017a, Athalye et al., 2018]. Most recently, [Schott et al., 2018] investigated the effectiveness of a class-conditional generative model as a defense mechanism for MNIST digits. In comparison, our method does not increase the computational overhead of the classification and tries to detect adversarial examples by attempting to reconstruct them. “

And, it leverages the CapsNet architecture detailed by Geoffrey Hinton and his team in arXiv:1710.0929 to train a reconstruction network:

“Capsule Networks (CapsNets) are an alternative architecture for neural networks [Sabour et al., 2017, Hinton et al., 2018]. In this work, we make use of the CapsNet architecture detailed by [Sabour et al., 2017]. Unlike a standard neural network which is made up of layers of scalar-valued units, CapsNets are made up of layers of capsules, which output a vector or matrix. Intuitively, just as one can think of the activation of a unit in a normal neural network as the presence of a feature in the input, the activation of a capsule can be thought of as both the presence of a feature and the pose parameters that represent attributes of that feature. A top-level capsule in a classification network, therefore, outputs both a classification and pose parameters that represent the instance of that class in the input. This high-level representation allows us to train a reconstruction network.”

And, with this reconstruction network, adversarial images can be detected:

“To detect adversarial images, we make use of the reconstruction network proposed in [Sabour et al., 2017], which takes pose parameters v as input and outputs the reconstructed image g(v). The reconstruction network is simply a fully connected neural network with two ReLU hidden layers with 512 and 1024 units respectively, with a sigmoid output with the same dimensionality as the dataset. The reconstruction network is trained to minimize the l2 distance between the input image and the reconstructed image. This same network architecture is used for all the models and datasets we explore. The only difference is what is given to the reconstruction network as input.”


Note: The picture above is one of the Japanese Bridge paintings from Claude Monet.

Copyright © 2005-2019 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.