Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

What Doesn’t Kill You Makes You Robust(er):
How to Adversarially Train against
Data Poisoning

Jonas Geiping
University of Siegen
&Liam Fowl
University of Maryland &Gowthami Somepalli
University of Maryland &Micah Goldblum
University of Maryland &Michael Moeller
University of Siegen &Tom Goldstein
University of Maryland
Abstract

Data poisoning is a threat model in which a malicious actor tampers with training data to manipulate outcomes at inference time. A variety of defenses against this threat model have been proposed, but each suffers from at least one of the following flaws: they are easily overcome by adaptive attacks, they severely reduce testing performance, or they cannot generalize to diverse data poisoning threat models. Adversarial training, and its variants, are currently considered the only empirically strong defense against (inference-time) adversarial attacks. In this work, we extend the adversarial training framework to defend against (training-time) data poisoning, including targeted and backdoor attacks. Our method desensitizes networks to the effects of such attacks by creating poisons during training and injecting them into training batches. We show that this defense withstands adaptive attacks, generalizes to diverse threat models, and incurs a better performance trade-off than previous defenses such as DP-SGD or (evasion) adversarial training.

1 Introduction

As machine learning systems consume more and more data, the data curation process is increasingly automated and reliant on data from untrusted sources. Breakthroughs in image classification (Russakovsky et al., 2015) as well as language processing (Brown et al., 2020) are built on large corpora of data scraped from the internet. Automated scraping, in which data is collected directly from online sources, leaves practitioners vulnerable to data poisoning in which bad actors tamper with the data so that models trained on this data perform poorly or contain backdoors embedded in them (Gu et al., 2019; Shafahi et al., 2018). These attacks present security vulnerabilities that persist even if the data is labeled and checked by crowd-sourced human supervision. In essence, entire machine learning pipelines can be compromised if the input data is modified maliciously - even if the modification appears minor and inconspicuous to a human observer. This mounting threat has instilled fear especially in industry practitioners whose business models rely on powerful neural networks trained on massive volumes of scraped data (Kumar et al., 2020).

Refer to caption
Figure 1: Data poisoning attacks require a new approach to adversarial training to robustify machine learning models against this threat model.

In response to this growing threat, recent works have proposed a number of defenses against data poisoning attacks (Li et al., 2021b; Goldblum et al., 2020). Existing defense strategies suffer from up to three primary shortcomings:

  1. 1.

    In exchange for robustness, they trade off test accuracy to a degree that is intolerable to real-world practitioners (Geiping et al., 2021).

  2. 2.

    They are only robust to specific threat models but not to adaptive attacks specially designed to circumvent the defense (Koh et al., 2018; Tan & Shokri, 2020).

  3. 3.

    They apply only to a specific threat model and do not lend a generally applicable framework to practitioners (Wang et al., 2019), that could extend to novel attacks.

We instead propose a variant of adversarial training that harnesses adversarially poisoned data in the place of (test-time) adversarial examples. We show that this strategy exhibits both an improved robustness-accuracy trade-off as well as greater flexibility for defending against a wide range of threats including adaptive attacks.

Adversarial training desensitizes neural networks to test-time adversarial perturbations by augmenting the training data with on-the-fly crafted adversarial examples (Madry et al., 2018). Similarly, we modify training data in order to desensitize neural networks to the types of perturbations caused by data poisoning - yet adapting this robust training framework to data poisoning requires special consideration of this new threat model. In contrast to contemporary work on training against data poisoning such as Li et al. (2021a) the defense dynamically trains a robust model without the need to identify poisoned data points and provide benefits by adapting specifically to targeted data poisoning instead of using default adversarial training as in Tao et al. (2021) or poisoned data generated offline as in Radiya-Dixit & Tramer (2021). We demonstrate the effectiveness of this framework at defending against a range of data tampering threat models across several training regimes and visualize the impact of the defense in feature space. We further compare the proposed defenses to a range of related defense strategies.

2 Related Work

Data poisoning is a class of threat scenarios focused on malicious modifications to the training data of a machine learning model. See Goldblum et al. (2020) for an overview of dataset security. Data poisoning attacks can either focus on denial-of-service attacks on model availability that reduce the overall model performance or on backdoor attacks that introduce malicious behavior into an otherwise inconspicuous model which is triggered by a specific visual pattern or target image, thus breaking model integrity (Barreno et al., 2010).

In this work, we focus on attacks against model integrity In comparison to denial-of-service attacks, which can be noticed before deployment, integrity attacks can insert undetectable backdoors even into models that later pass into production and are used and relied upon in real-world scenarios. These attacks which can be further distinguished by the nature of their trigger mechanism. In backdoor trigger attacks (Gu et al., 2019; Turner et al., 2018), the attack is triggered by a specific backdoor pattern or patch that can be added to target images at test time, whereas targeted data poisoning (Shafahi et al., 2018; Zhu et al., 2019) is triggered by a predefined target image. In contrast to targeted poisoning, backdoor trigger attacks can be applied to multiple target images but require target modifications to be active during inference, while targeted attacks are activated by specific, but unmodified targets.

2.1 Data Poisoning Attacks:

Attacks can be further categorized by the precise training setup they anticipate their victims to employ. Some attacks assume that the victim will only fine-tune their model on the poisoned data or will train a linear classifier on top of a pre-trained feature extractor (Saha et al., 2020; Zhu et al., 2019; Shafahi et al., 2018). These methods are effective against practitioners engaging in transfer learning with pre-trained models like those found in popular repositories such as Paszke et al. (2017). Other attacks work even when the victim trains their model from scratch on the poisoned data (Huang et al., 2020; Geiping et al., 2021). Generally, the simpler the victim’s training procedure, the more readily the attacker can anticipate the effect of their perturbations and thus, the easier the attacker’s job. Here, we briefly detail prominent attacks against which we test our defense:

Feature Collision: Shafahi et al. (2018) present a targeted poisoning attack that perturbs training data so that its deep features collide with those corresponding to a target image. At the same time, the attack penalizes the size of perturbations applied to training data in order to maintain visual similarity between the original and perturbed images.

Aghakhani et al. (2020), as opposed to colliding images in feature space, surround target images in feature space to increase the success of feature collision. This method improves on the work of Zhu et al. (2019), which constructs a convex polytope around the target based on an ensemble of models.

Bilevel Optimization: MetaPoison (Huang et al., 2020) generates poisoned data based on unrolling the bilevel objective encountered in targeted data poisoning for several steps and optimizing the unrolled objective over an ensemble of models at different stages in training, leading to an attack that is robust to new initializations in from-scratch training as well as to changes in model architectures.

Gradient Matching: Witches’ Brew (Geiping et al., 2021) instead approximates the bilevel objective by gradient matching, leading to a computationally efficient attack (in comparison to MetaPoison (Huang et al., 2020)), combining the efficiency of feature collision attacks with the success of bilevel approximations. This attack was already adapted to be effective against data augmentation and differential privacy, showing the need for strong defenses against adaptive attacks.

Hidden Trigger Backdoor Attacks: Saha et al. (2020) present a backdoor trigger attack wherein the attacker modifies training data within an superscript\ell^{\infty} bound to cause a vulnerability for a pre-selected 0subscript0\ell_{0} patch added to target validation images. This way, the attack is generalize from a chosen target to only a chosen patch with can be added to any data.

2.2 Defenses against data poisoning:

Defenses can be broadly classified into filter defenses which attempt to detect and remove or sanitize malicious training data, robust training algorithms which use a training routine that yields robust models even on malicious training data, and model repair methods that train models on poisoned data and attempt to repair the poisoned models after training. Filter defenses are easy to deploy as they simply add a pre-processing step, but they require extensive hyperparameter tuning and rely on the assumption that only a small fraction of the dataset is poisoned. Furthermore, filter defenses can often be overcome by adaptive attacks (Koh et al., 2018; Tan & Shokri, 2020). Any defense reduces model performance (in terms of validation accuracy); filter defenses reduce performance as a result of training the new model on fewer samples, robust training methods do so by deviating from standard training practices which are tuned for accuracy in order to increase robustness, and model repair methods harm accuracy by pruning away potentially important neurons.

Numerous options have been proposed for detecting poisoned data. Tran et al. (2018a) detect spectral signatures associated with backdoor triggers based on their correlation with the top right singular vector of the covariance matrix formed by feature representations, and additional detection scores can be found in Paudice et al. (2018). Peri et al. (2020) detect poisoned data by clustering based on deep KNN, re-labeling data based on the nearest neighbors in feature space. Chen et al. (2019) cluster training data based on activation patterns in feature space. Another method detects images containing backdoor triggers by flagging images whose corresponding predictions do not change when they are combined with clean samples (Gao et al., 2019). Yet, any measure of “anomaly” that is used to filter images can also be used to generate poisoned data which minimizes the anomalous property, thus adaptively defeating the defense (Koh et al., 2018).

Robust training algorithms may incorporate strong data augmentations (Borgnia et al., 2020), randomized smoothing (Weber et al., 2020), or may partition data into disjoint pieces and train individual models on the partitions, performing classification via majority voting at test-time (Levine & Feizi, 2021). Another popular robust training strategy harnesses differentially private SGD (Abadi et al., 2016; Ma et al., 2019), as differentially private models are inherently insensitive to small changes on a subset of the training set. Differentially private SGD is applied by clipping and adding noise to the model parameter gradients during training. Hong et al. (2020) note that the addition of noise in differentially private SGD is the primary factor controlling robustness to poisoning. However, differential privacy is an extreme and general definition of robustness to data manipulations, compared to robustness specific to data poisoning. This strategy consequently incurs a significant performance penalty (Jayaraman & Evans, 2019), and these algorithms can even be adaptively attacked by modifying gradient signals during poison generation in the same manner as in the defense (Veldanda & Garg, 2020). Other robust training schemes are proposed in Li et al. (2021a); Tao et al. (2021); Radiya-Dixit & Tramer (2021).

Model repair strategies, primarily designed to defend against backdoor attacks, may reconstruct the backdoor trigger and nullify its effects (Wang et al., 2019) or prune away neurons which are inactive on clean samples (Liu et al., 2018). Yet, adaptive attacks can bypass these defenses by creating poisoned data whose activation patterns mimic those of clean data. In order to counteract the loss of performance induced by pruning, some methods fine-tune the pruned model on clean data (Liu et al., 2018; Chen et al., 2020). But this process is only effective when the defender possesses large quantities of trustworthy clean data.

While defenders against model availability are at a natural advantage as they can measure the missing model availability and optimize their training setup to mitigate the attack (Radiya-Dixit & Tramer, 2021), defenses against model integrity have no real way of noticing attacks and have to function without case-specific adaptation against attacks.

3 Understanding Adversarial Training for Data Poisoning

Adversarial training (Madry et al., 2018; Sinha et al., 2018) reduces the impact of test-time adversarial attacks and is generally considered the only strong defense against adversarial examples. Adversarial training solves the saddle-point problem,

minθ𝔼(x,y)𝔻[maxΔSθ(x+Δ,y)],subscript𝜃subscript𝔼similar-to𝑥𝑦𝔻delimited-[]subscriptΔ𝑆subscript𝜃𝑥Δ𝑦\min_{\theta}\mathbb{E}_{(x,y)\sim\mathbb{D}}\left[\max_{\Delta\in S}\mathcal{L}_{\theta}(x+\Delta,y)\right], (1)

where θsubscript𝜃\mathcal{L}_{\theta} denotes the loss function of a model with parameters θ𝜃\theta, and the adversary perturbs inputs x𝑥x from a data distribution 𝔻𝔻\mathbb{D}, subject to the constraint that perturbation ΔΔ\Delta is in S𝑆S. Peri et al. (2020) notes that adversarial training against test-time evasion attacks already confers a degree of robustness against data poisoning at a performance cost. Our proposed strategy is an adaptation of adversarial training to poisoning, resulting in a stronger defense that degrades performance less than differentially private SGD or adversarial training against evasion attacks. In our adversarial training paradigm, two parties engage in a mini-max game; the attacker maliciously poisons the training data to cause the model to mis-classify targets, while the defender trains the model to correctly classify both poisons and targets. The capabilities of an attacker depend on its knowledge of the defender’s training setup, so we now enumerate a series of assumptions concerning the knowledge of the attacker and defender before presenting our framework in precise detail.

Preparing for a strong threat model.

In order to harden the model against a wide range of poisons, we train against a strong surrogate attacker. The differences between the surrogate threat model and that of a real-world attacker concern the attacker’s access to the defender’s training routine. The surrogate attacker in our training algorithm is aware of the defender’s training protocol (e.g. learning rate, optimization algorithm), architecture, and defense strategy but can neither influence training nor intercept random factors such as initialization and mini-batch sampling. In cases where the defender only re-trains a component of a model or fine-tunes the model, the exact baseline pre-trained model, including its parameters, is known to both parties. The attacker’s trigger (a target image for targeted poisoning or a specific patch for backdoor attacks) is unknown to the defender, and we do not assume that the defender possesses additional, vetted, clean data. In order to constrain the attacker, the defender chooses a ||||||\cdot||-norm perturbation budget ε𝜀\varepsilon against which they seek robustness.

Since the attacker possesses such strong knowledge concerning the defender’s training routine, the threat it poses constitutes a near worst-case analysis. If more factors, such as model definition or parts of the training protocol, are hidden from the attacker, then the quality of the defense can only improve. On the other hand, the defender needs to set a ε𝜀\varepsilon bound within which to be robust against attacks - if there were no such limit, then the attacker could arbitrarily modify data.

Other defenses such as Xu et al. (2020) have additionally considered the existence of clean data or a given clean model, but this presents the obvious problem that if the security of the data pipeline is breached, then there may be no way of knowing a-priori what data is clean and what is not. As such, here we are explicitly looking for defenses that require no clean data.

3.1 Adversarially training Poison Immunity

In contrast to adversarial attacks at test-time, the objective for targeted data poisoning is itself already a bilevel objective. For a given target xtsubscript𝑥𝑡x_{t} with an intended adversarial label ytsubscript𝑦𝑡y_{t}, targeted data poisoning optimizes poisoned data points xpsubscript𝑥𝑝x_{p}, so that models trained with them exhibit low loss on the adversarial label ytsubscript𝑦𝑡y_{t} instead of its original label yosubscript𝑦𝑜y_{o}:

minxp(xt,yt,θ(xp)) s.t. θ(x)=argminθi=1N(xi,yi,θ).subscriptsubscript𝑥𝑝subscript𝑥𝑡subscript𝑦𝑡𝜃subscript𝑥𝑝 s.t. 𝜃𝑥subscriptargmin𝜃superscriptsubscript𝑖1𝑁subscript𝑥𝑖subscript𝑦𝑖𝜃\min_{x_{p}}\mathcal{L}(x_{t},y_{t},\theta(x_{p}))\operatorname*{\quad\textnormal{ s.t. }}\theta(x)=\operatorname*{arg\,min}_{\theta}\sum_{i=1}^{N}\mathcal{L}(x_{i},y_{i},\theta). (2)

The optimal defense against this attack then develops into a two-player game between attacker and defender. While the attacker minimizes the loss on the target, the defender maximizes it with (θ(x)𝜃𝑥\theta(x) as defined above),

minxpmaxΔS(xt,yt,θ(xp+Δ)).subscriptsubscript𝑥𝑝subscriptΔ𝑆subscript𝑥𝑡subscript𝑦𝑡𝜃subscript𝑥𝑝Δ\min_{x_{p}}\max_{\Delta\in S}\mathcal{L}(x_{t},y_{t},\theta(x_{p}+\Delta)). (3)

However, this formulation reveals the central trick of adversarially training against data poisoning: Both attacker and defender modify the same variable x=xp+Δ𝑥subscript𝑥𝑝Δx=x_{p}+\Delta. This means that any known algorithm used to approximate the bilevel poisoning objective for an attack (e.g. Huang et al. (2020) or Geiping et al. (2021)), is valid as an approximation for a defense.

In the formulation above, the solution for the defender is simple, the defender can optimize the poisoning objective Equation 2 for xpsubscript𝑥𝑝x_{p} and then set Δ=xpΔsubscript𝑥𝑝\Delta=-x_{p}. Yet, in practice, the defender has neither knowledge of the specific data point xtsubscript𝑥𝑡x_{t} targeted by the attack, nor can change their strategy in response to the attacker. The optimal choice is hence to sample surrogate targets (xt,yt)subscript𝑥𝑡subscript𝑦𝑡(x_{t},y_{t}) from the data distribution, optimize xpsubscript𝑥𝑝x_{p}, and take steps to maximize (xt,yt,θ)subscript𝑥𝑡subscript𝑦𝑡𝜃\mathcal{L}(x_{t},y_{t},\theta):

maxΔS𝔼(xt,yt)𝒟[minxp(xt,yt,θ(xp+Δ))].subscriptΔ𝑆subscript𝔼subscript𝑥𝑡subscript𝑦𝑡𝒟delimited-[]subscriptsubscript𝑥𝑝subscript𝑥𝑡subscript𝑦𝑡𝜃subscript𝑥𝑝Δ\max_{\Delta\in S}\mathbb{E}_{(x_{t},y_{t})\in\mathcal{D}}\left[\min_{x_{p}}\mathcal{L}(x_{t},y_{t},\theta(x_{p}+\Delta))\right]. (4)

Instead of maximizing the adversarial loss, the defender can equivalently minimize the loss of the true label for the sampled target, i.e. (xt,yo,θ)subscript𝑥𝑡subscript𝑦𝑜𝜃\mathcal{L}(x_{t},y_{o},\theta).

To implement the objective Equation 4, we sample mini-batches of data and then first split these batches into two subsets of data, (xp,yp)subscript𝑥𝑝subscript𝑦𝑝(x_{p},y_{p}) and (xt,yt)subscript𝑥𝑡subscript𝑦𝑡(x_{t},y_{t}), with probability s𝑠s for a data point to be placed in the first ("poison") split. For the sampled target xtsubscript𝑥𝑡x_{t} we then approximately minimize (xt,yt,θ(xp+Δ)\mathcal{L}(x_{t},y_{t},\theta(x_{p}+\Delta) through a known data poisoning attack, which yields xpsubscript𝑥𝑝x_{p}. We can then update the model parameters to minimize (xt,yo,θ)subscript𝑥𝑡subscript𝑦𝑜𝜃\mathcal{L}(x_{t},y_{o},\theta) and (xp,yp,θ)subscript𝑥𝑝subscript𝑦𝑝𝜃\mathcal{L}(x_{p},y_{p},\theta), i.e. we train the model on the concatenated output of surrogate poisons and targets, as seen in Figure 1. This way we alternate between both steps in Equation 4 effectively.

Interestingly, contemporary work in Li et al. (2021a) instead maximizes adversarial loss in the outer objective and instead of optimizing in expectation detects and maximizes over detected poisoned data points xpsubscript𝑥𝑝x_{p}, leading to a different approximation to Equation 3, whereas work in Tao et al. (2021) focuses on adversarial training against attacks on model availability, which can be understood by replacing Equation 2 by a maximization of loss over all data points, leading to the min-max structure of Equation 1 as a defense. Work in Radiya-Dixit & Tramer (2021) instead proposes to optimize Equation 4 by first generating additional poisoned data xpsubscript𝑥𝑝x_{p} for a fixed pretrained model and then training with all data, which corresponds to approximating the objective Equation 4 in an offline fashion.

These considerations hold not only for targeted data poisoning, but likewise apply to poisoning with other modalities, such as backdoor triggers. In those cases, both xpsubscript𝑥𝑝x_{p} and xtsubscript𝑥𝑡x_{t} are optimized (or sampled) instead of only xpsubscript𝑥𝑝x_{p}.

Example: Defending against Gradient Matching.

While our methodology can be applied to any data poisoning attack, there are several considerations to make when adapting the attack into a format that is applicable and practical to run in each mini-batch of training. We detail these considerations for a recent attack (Geiping et al., 2021).

The cosine similarity objective of the attack is originally evaluated on a clean surrogate model trained by the attacker and the attack is optimized for a significant number of iterations (n=250𝑛250n=250 in the original work). To apply it during training, we first replace the clean model used in the attack with the current model in the current state of training - this is actually an advantage for the defender. While the attacker needs to create poisons on a surrogate model, the defender can use the exact model, making it easier to create effective poisons. Secondly, similar to adversarial training, the number of attack iterations can be reduced. In practice, we choose n=5𝑛5n=5 during the defense, as a compromise between creating a strong attack and spending a limited time budget, as the attack is naturally stronger due to its basis in the current state of training. Third, we need to choose malicious labels ytsubscript𝑦𝑡y_{t}. These labels could be chosen entirely at random, however then the average gradient over all targets would likely be small. However, poisoned data points in Geiping et al. (2021) are in practice chosen from the same class as the target adversarial label, and this choice can be replicated for the randomly chosen subset of poisoned data points xpsubscript𝑥𝑝x_{p} with labels ypsubscript𝑦𝑝y_{p} by choosing ytsubscript𝑦𝑡y_{t} as the label that appears most often in ypsubscript𝑦𝑝y_{p}.

Other defenses can be viewed as special cases of poison immunity.

This methodology generalizes and explains previous work on defenses against poisoning. In Borgnia et al. (2020), strong data augmentations such as mixup (Zhang et al., 2018) and cutout (DeVries & Taylor, 2017) are proposed as defenses against data poisoning. These defenses are special cases of the proposed poison immunity; when Algorithm 1 is used to defend against a watermarking attack (which superimposes the target data onto poison data with low opacity), then the attack is equivalent to mixup data augmentation with mixing factor α=1ε𝛼1𝜀\alpha=1-\varepsilon. Likewise, implementing Algorithm 1 against a patch attack attack reduces to patching randomly selected pairs of data points with a random patch. If this patch is chosen to be uniformly gray, then this defense is exactly equivalent to the cutout data augmentation.

Algorithm 1 Modified iterative training routine for poison immunity.
  Input: Split probability s(0,1)𝑠01s\in(0,1).
  repeat
     Sample mini-batch of data {xi,yi}i=1nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛\{x_{i},y_{i}\}_{i=1}^{n},
     Split data randomly into two subsets xpsubscript𝑥𝑝x_{p}, xtsubscript𝑥𝑡x_{t} with probability s𝑠s
     Draw malicious labels ytsubscript𝑦𝑡y_{t} for xtsubscript𝑥𝑡x_{t}
     Apply a data poisoning attack to minimize (xt,yt,θ(xp))subscript𝑥𝑡subscript𝑦𝑡𝜃subscript𝑥𝑝\mathcal{L}(x_{t},y_{t},\theta(x_{p})) via xpsubscript𝑥𝑝x_{p}
     Concatenate xp,xtsubscript𝑥𝑝subscript𝑥𝑡x_{p},x_{t} into a new batch xmsubscript𝑥𝑚x_{m} with unchanged labels {yi}i=1nsuperscriptsubscriptsubscript𝑦𝑖𝑖1𝑛\{y_{i}\}_{i=1}^{n}
     Update model based on new data xmsubscript𝑥𝑚x_{m}
  until training finished

3.2 Adaptive Attack Scenarios

Crucial for the design of new defense algorithms is their ability to withstand adaptive attacks, i.e. attacks that can be modified to respond to a defense algorithm when the attacker is aware of the defense. While this principle has been well-regarded in literature about adversarial attacks at test-time Carlini et al. (2019); Tramer et al. (2020), it has not been applied as rigorously for data poisoning.

The defense proposed in this work is exceedingly effective against non-adaptive models (evaluating the exemplary case of gradient matching), as the difference in training regimes leads to incorrect perturbations computed by the attacker that relies on a pre-trained surrogate model. However, this would also be the case for most modifications to the training procedure, such as adding data augmentations or changing learning rates or optimizer settings. As such, we find that the optimal way to attack this defense is for the attacker to re-train their pre-trained model with exactly the same defense and the same hyperparameters. The attacker can then more accurately estimate the target gradient (for gradient matching) or target features (for feature collision). We also investigated the possibility of applying Algorithm 1 during the optimization of poisoned data itself as an additional stochastic input modification. However, this modification weakens the attack by gradient masking, making it too difficult for the attacker to optimize the poisoned data. This behavior mirrors (test-time) adversarial attacks, where it is non-optimal to add additional perturbations during the creation of an adversarial perturbation. As we will find in the next section, the defense has a major impact on the feature space of a model, which may make it difficult to bypass it with better adaptive attacks.

4 Analysis

To understand the effect of the proposed poison immunity scheme qualitatively, we conduct an analysis of feature space visualizations for several attacks. Shafahi et al. (2018), who introduced feature collision attacks, illustrate their poisoning method by visualizing the feature space collisions between poisoned data and the target. These experiments are carried out in the transfer setting, where the feature representation of the model is fixed and known to both parties. We run Bullseye Polytope, a recent and improved feature collision attack (Aghakhani et al., 2020), in a strong attack setting of 500 poisoned examples for a ResNet-18 pre-trained on CIFAR-10 and re-trained on the poisoned dataset. We visualize the feature space by plotting the projections of feature vectors of data on to the vector connecting the centroids of the poison (base) class and the target class and it’s orthogonal (generated using PCA) in the x𝑥x-y𝑦y plane and the softmax output for poison (base) class for each of the points on the z𝑧z axis. This way we expect to see both classes to form separate clusters in feature space (x𝑥x-y𝑦y plane) and to be further separated by the poison class probability, which is low for images from the target class and high for images from the poison class.

Refer to caption
Refer to caption
Refer to caption
(a) Undefended model, clean (left) and fine-tuned on poisoned data containing feature collisions. (right).
Refer to caption
Refer to caption
(b) Defended model, clean (left) and fine-tuned on poisoned data containing feature collisions. (right).
Refer to caption
Refer to caption
(c) Undefended model, clean (left) and retrained on poisoned data containing gradient matching attacks (right).
Refer to caption
Refer to caption
(d) Defended model, clean (left) and retrained on poisoned data containing gradient matching attacks (right).
Figure 2: Visualization of the effects of poisoning attacks against an undefended and a defended model. Top: Feature collisions via Aghakhani et al. (2020). Bottom: Gradient matching as in Geiping et al. (2021). The target image is marked by a black triangle and is originally part of the class colored blue. The poisoned images are colored red and are part of the class colored green. The x𝑥x-y𝑦y axis in each diagram corresponds to a projection of the principal direction separating both classes, while the confidence in the original target class is marked on the z𝑧z-axis. The defended models generate a feature space in which poisons (red) behave consistently with the robust model, preventing collisions in Figure 2(b) and alignment in Figure 2(d), so that the target (black) remains correct.

Figure 2(a) shows the effects of an attack on a baseline model. The poisoned images (red) move to collide with the target (black triangle) in feature space as seen by their overlap in the x𝑥x-y𝑦y plane, while maintaining their original label (z𝑧z-axis), subsequently leading to a misclassification of the target image as image from the poison class - this is a feature collision attack. Figure 2(b) however contrasts this collision with the effects of poisons on a defended model. Two effects stand out: First, the poisons, which were optimized to collide with the target, no longer cluster around the target (refer also to the 2D visualization in Figure 12), indicating that straightforward collisions are difficult to achieve against the robust model. Second, poisoned images close to the target are now predicted as the target base class shown by their descend on the z𝑧z-axis; While we would expect the first effect for any defense, the second effect really breaks the attack. the defended model is robust enough to assign poisoned images a label that agrees with their feature representation, even though this assignment contradicts the given labels of these images. In essence, the poisoned images are treated like images from the target class but with a noisy label. This property reverses the attack. Instead of moving the target into the poison class, the poisoned images are drawn into the target class as it matches their feature representation. The model stays consistent and is able to defend against the strong attack analyzed here.

In addition to feature collision attacks, in Figure 2, we analyze the defense against the gradient matching attack of Geiping et al. (2021) in the from-scratch setting, where the model is fully re-trained. The attack can be seen to be effective in Figure 2(c), changing the decision boundary of the model to fit the target without collisions by clustering poisons opposite to the target in feature space, significantly moving the target. However, this is prevented by the defense as seen in Figure 2(d). The robust model is not modified by the clustering of poisoned images, and outliers seen in the undefended model are again reclassified as the target class leading to a consistent decision. An interesting side effect of the defense for both attacked and clean models is that the model itself is generally less over-confident in its clean predictions. We compute similar outcomes also for other attacks, such as patch attacks, which we show in the appendix in Figure 6 to Figure 15.

Table 1: Quantitative result for several attacks and their defense by poison immunity with s=0.75𝑠0.75s=0.75, showing avg. poison success with standard error (where all trials have equal outcomes, we report the worst-case error estimate 5.59%percent5.595.59\%). Additional details about each attack threat model can be found in the Section D.2. The proposed defense significantly decreases success rates over a wide range of attacks and scenarios without any hyperparameter changes. This table evaluates gradient matching (GM) with both squared error (SE) and cosine similarity (CS). The evaluation column with the strongest attack uses gradient matching with cosine similarity. An extended table included natural accuracies and runtimes can be found in the appendix in Table 4.
Attack Scenario Undefended Defended
Same Attack Strongest Attack
MetaPoison From-scratch 69.50% (±9.34plus-or-minus9.34\pm 9.34) 10.00% (±9.49plus-or-minus9.49\pm 9.49) 10.00% (±9.49plus-or-minus9.49\pm 9.49)
Gradient Matching (CS) From-scratch 90.00% (±6.71plus-or-minus6.71\pm 6.71) 0.00% (±5.59plus-or-minus5.59\pm 5.59) 0.00% (±5.59plus-or-minus5.59\pm 5.59)
Bullseye Polytope Fine-tuning 80.00% (±8.94plus-or-minus8.94\pm 8.94) 0.00% (±5.59plus-or-minus5.59\pm 5.59) 8.33% (±7.98plus-or-minus7.98\pm 7.98)
Bullseye Polytope Transfer 100.00% (±5.59plus-or-minus5.59\pm 5.59) 10.00% (±6.71plus-or-minus6.71\pm 6.71) 0.00% (±5.59plus-or-minus5.59\pm 5.59)
Poison Frogs Transfer 100.00% (±5.59plus-or-minus5.59\pm 5.59) 15.00% (±7.98plus-or-minus7.98\pm 7.98) 0.00% (±5.59plus-or-minus5.59\pm 5.59)
Gradient Matching (SE) Transfer 95.00% (±4.87plus-or-minus4.87\pm 4.87) 0.00% (±5.59plus-or-minus5.59\pm 5.59) 5.00% (±4.87plus-or-minus4.87\pm 4.87)
Hidden Trigger Backdoor Transfer 55.59% (±5.65plus-or-minus5.65\pm 5.65) 24.78% (±6.82plus-or-minus6.82\pm 6.82) 3.32% (±0.79plus-or-minus0.79\pm 0.79)

5 Experiments

This section details a quantitative analysis of the proposed defense for the application of image classification with deep neural networks. To fairly evaluate all attacks and defenses, especially in light of Schwarzschild et al. (2020) discussing the difficulty in comparing attacks across different evaluation settings, we implement all attacks and defenses in a single unified framework, which we will make publicly available. For all experiments, we measure avg. poison success over 20 trials, where each trial represents a randomly-chosen attack trigger from a random class and a separately attacked and trained model. The sampling of randomized attack triggers is crucial to estimate the average performance of poisoning attacks, which are generally more effective for related class labels. We discuss additional experimental details in Appendix D.

5.1 Defending against superscript\ell^{\infty} threat models

We focus on defending against threat models, in which the attacker may modify some percentage of training data within an superscript\ell^{\infty} bound, i.e. change every pixel slightly. This covers all mentioned targeted data poisoning attacks as well as hidden trigger backdoor attacks (Saha et al., 2020).

Defending in Diverse Scenarios

To evaluate the proposed defense mechanism thoroughly, we consider a variety of attacks in different scenarios and distinguish three scenarios with increased difficulty for the attacker, transfer where the defender only re-trains the last linear layer of a model, fine-tuning, where the defender re-trains all layers, and from-scratch where the defender trains a completely new model.

We first apply the proposed defense against a range of attacks and settings in Table 1, choosing s=0.75𝑠0.75s=0.75 for all targeted data poisoning attacks and no additional modifications. All attacks shown are adaptive, if possible. In the fine-tuning and transfer scenarios, the pre-trained model is defended but known to the attacker exactly. In all cases, we observe that while the attacks are highly effective against an undefended model, our defense steeply reduces their effectiveness. These encouraging results suggest that the proposed methodology is a strong strategy that can be robustly applied across a range of attacks and may also generalize to future attacks.

A natural question to ask is whether this defense, which trains against one specific surrogate attack, can be circumvented when the real attacker utilizes a different attack. Surprisingly, we find in Table 1 that using gradient matching, a strong attack, as surrogate during training successfully defends against a range of other attacks. This corroborates findings in Madry et al. (2017) where showing that training with the strongest test-time adversarial attack, also defends against weaker test-time attacks - an mechanism which appears to also hold when defending against data poisoning.

Defending against Large Poison Budgets

We investigate the robustness of the defense to large poisoning budgets in Figure 3 for the case of a defense against gradient matching in the from-scratch setting on a ResNet-18. In this scenario, the attack already succeeds when 1%percent11\% of the data is poisoned, but the attacker can increase their budget further and poison more of the data. The defense is still effectively reduces poison success even with large levels of poisoned data with a roughly logarithmic scaling. Note that 10%percent1010\% is the limit for this attack on CIFAR-10, for which all data points from the poisoned class are modified. At this limit 5000 poisoned data points are inserted to change the behavior of a single fixed target data point. This scenario is unlikely to occur in practice where the poison attack can usually poison only a fraction of the training data, yet poisoning accuracy is still meaningfully reduced. Also interesting is the comparison to feature defenses in this setting: Filter attacks would not be able to distinguish poisoned and real data when too much of the dataset is poisoned leading to a catastrophic failure of these defenses.

Refer to caption
Figure 3: Effectiveness of the Defense against large attack budgets for the case of a defense against gradient matching in the from-scratch setting on a ResNet-18.

Comparison to Other Defenses

In this subsection, we compare the proposed defense to other existing defense strategies against data poisoning including differentially private SGD, adversarial training, various data augmentations, and filter defenses. For differentially private SGD and adversarial training, we test several noise levels and perturbation budgets, respectively. When comparing to filtering defenses, we allow an optimal hyperparameter choice by supplying the exact number of poisons in the training set, although this information would be unknown in practice. We analyze poison immunity training with varying levels of s𝑠s to show the trade-off of performance and security.

Refer to caption
Figure 4: Avg. Poison Success versus validation accuracy for various defenses against the gradient matching attack of (Geiping et al., 2021) in the from-scratch setting. The baseline undefended model is shown in blue, the proposed defense in red. The differentially private SGD is shown for noise values from 0.00010.00010.0001 to 0.010.010.01. The proposed defense is a strong trade-off of robustness and accuracy.
Table 2: Defenses against feature collision via (Aghakhani et al., 2020) in the transfer setting for a budget of 1%percent11\% and bound of ε=16𝜀16\varepsilon=16.

Defense Poison Success Val. Acc. Time
None 100.00% (±5.59plus-or-minus5.59\pm 5.59) 91.97% 0:08:04
Random Noise 90.00% (±6.71plus-or-minus6.71\pm 6.71) 90.45% 0:08:33
Deep K-NN 75.00% (±9.68plus-or-minus9.68\pm 9.68) 91.94% 3:29:13
Activation Clustering 0.00% (±5.59plus-or-minus5.59\pm 5.59) 91.34% 0:07:08
Spectral Signatures 0.00% (±5.59plus-or-minus5.59\pm 5.59) 92.09% 0:09:10
Diff. Priv. SGD (n=0.0001𝑛0.0001n=0.0001) 100.00% (±5.59plus-or-minus5.59\pm 5.59) 92.72% 0:08:08
Diff. Priv. SGD (n=0.001𝑛0.001n=0.001) 100.00% (±5.59plus-or-minus5.59\pm 5.59) 91.26% 0:08:07
Diff. Priv. SGD (n=0.01𝑛0.01n=0.01) 85.00% (±7.98plus-or-minus7.98\pm 7.98) 69.78% 0:08:12
Adv. Training (ε=8𝜀8\varepsilon=8) 5.00% (±4.87plus-or-minus4.87\pm 4.87) 77.80% 0:42:20
Adv. Training (ε=16𝜀16\varepsilon=16) 5.00% (±4.87plus-or-minus4.87\pm 4.87) 58.98% 0:42:37
Adv. Poisoning (s=0.25𝑠0.25s=0.25) 90.00% (±6.71plus-or-minus6.71\pm 6.71) 91.24% 0:30:44
Adv. Poisoning (s=0.5𝑠0.5s=0.5) 70.00% (±10.25plus-or-minus10.25\pm 10.25) 89.83% 0:35:06
Adv. Poisoning (s=0.75𝑠0.75s=0.75) 10.00% (±6.71plus-or-minus6.71\pm 6.71) 88.64% 0:39:53

We test the gradient matching attack proposed in Geiping et al. (2021), for a ResNet-18 trained on CIFAR-10 with budget 1%percent11\% and ε=16𝜀16\varepsilon=16 (the same setting as proposed in that work). While previous defenses were shown to be ineffective in Geiping et al. (2021), we now show in Figure 4 that the proposed poison immunity defense is an extremely effective defense in the from-scratch setting, yielding a much stronger protection than filter defenses, but with only mild trade-off in validation accuracy compared to differential privacy and classical adversarial training.

Comparison to Other Defenses for Transfer Learning

We additionally compare poison immunity to other existing defenses in the transfer setting, where only the last layer is re-trained on poisoned data. We test feature collisions via Bullseye Polytope (Aghakhani et al., 2020), also for a ResNet-18 trained on CIFAR-10 with a budget of 1%percent11\% poisoned data within an superscript\ell^{\infty} bound of ε=16𝜀16\varepsilon=16. This is a setting that is ideal for commonly used filtering defenses, as a large number of poisoned data is collided with the target image, which can be detected and filtered, while the setting is difficult for robust training methods, both due to the large perturbations and due to the limitation that only the last layer is re-trained - leaving less control over the model. We record results in Table 2, finding that poison immunity can be effective even in a scenario that favors filter defenses, matching filter defenses, while beating adversarial training significantly in the trade-off against validation performance.

Releasing Robust Models

So far, we have considered scenarios in which the defense described in Algorithm 1 is always active, even during the fine-tuning procedure in the transfer setting. However, especially in the transfer setting, we are interested in the inherent robustness of models and its transferability. We thus analyze a setting in which the base model is trained robustly via poison immunity, but the last layer is re-trained non-robustly on poisoned data. For the bullseye polytope attack in the transfer setting, this approach leads to an avg. poison success of only 20.00%(±8.94)percent20.00plus-or-minus8.9420.00\%(\pm 8.94) and validation accuracy of 88.6688.6688.66 when using a base model trained robustly with poison immunity via s=0.75𝑠0.75s=0.75, compared to the best defense of 10.00%(±6.71)percent10.00plus-or-minus6.7110.00\%(\pm 6.71) - a significant part of the overall robustness is already encoded into the base model. Figure 5 visualizes that the target confidence remain consistent, even if the fine-tuning is non-robust.

Refer to caption
Refer to caption
Figure 5: Feature Defense. A robust base model can withstand feature collision attacks even when fine-tuning non-robustly on poisoned data. Note how no collision forms around target and the model remains robust.

5.2 Defending against 0superscript0\ell^{0} threat models

In this section we further investigate patch attacks as in Gu et al. (2019) who trigger backdoors by adding an 0subscript0\ell_{0} bounded patch to training images in a given class, causing a network trained on these images to associate the patch with the given class. Then, the attacker patches test-time images (of a different class) with the same patch in the hopes that the network mis-classifies the images. Most backdoor attacks fall under an 0superscript0\ell^{0} threat model, or more specifically a limited 0superscript0\ell^{0} norm for connected rectangular region, a patch. Defending against such attacks through adversarial training techniques is not as well-understood, even for test-time adversarial attacks (Rao et al., 2020), as superscript\ell^{\infty} threat models. Although Equation 4 remains valid, finding the argminargmin\operatorname*{arg\,min}, i.e. the worst-case patch to apply, appears difficult from an optimization perspective. However, from a theoretical standpoint, Equation 4 holds also in this threat model for a different constraint set S𝑆S. However, finding the argminargmin\operatorname*{arg\,min}, i.e. the worst-case patch to apply, appears difficult from an optimization perspective.

We evaluate this threat model for the example of a challenging backdoor trigger attack in Table 3, where we insert a 4x4 trigger patch into 5%percent55\% of the CIFAR-10 training data. This patch either is a noisy checkerboard, and as such out-of-distribution for CIFAR-10, or a firefox logo, i.e. a potentially in-distribution semantic feature. The evaluated filtering defenses do not defend well against both patches, whereas differentially private SGD provides some protection against the noise patch. When comparing different instances of the proposed poison immunity strategy, we can compare several strategies to approximate Equation 4: We can sample randomized patches of varying sizes ([large] noise patch). We can optimize for a worst-case patch content at a random position (optimized patch). We can sample a randomized, but in-distribution, patch (image patch) by sampling a patch from another image in the batch, which is equivalent to CutMix (Yun et al., 2019). The results show that optimizing the worst-case patch fails to significantly improve upon a well-chosen sampled patch. For the noisy checkerboard, we reach the best results when like-wise sampling random noise patches during training, although sampling image patches is also competitive. With the semantically meaningful firefox logo however, we find that training with noisy patches actually only incurs robustness to such noisy patterns, and only limited robustness to other patterns. In contrast, the image-based patch can succeed in both settings.

Table 3: Avg. poison success for various defenses against backdoor triggers attacks, for an attack via a 4x4 patch on 5%percent55\% of training data. First group of rows: Baseline and defenses via filtering and differential privacy. Second group: Variations of adversarial training against poisons. We evaluate each on a noisy checkerboard patch (Noise) and a patch with semantic meaning, a firefox logo (Sem).
Patch: Noise Semantic
Poison Acc. Nat.Acc. Time Poison Acc. Nat.Acc. Time
Undefended 69.89% (±4.47plus-or-minus4.47\pm 4.47) 91.77% 0:16:47 79.68% (±4.50plus-or-minus4.50\pm 4.50) 91.67% 0:16:49
Spectral. Sign. 68.62% (±7.97plus-or-minus7.97\pm 7.97) 77.31% 0:23:09 83.93% (±4.55plus-or-minus4.55\pm 4.55) 77.42% 0:23:06
Deep-KNN 73.34% (±5.25plus-or-minus5.25\pm 5.25) 91.27% 2:46:10 76.96% (±5.07plus-or-minus5.07\pm 5.07) 90.97% 2:45:59
Act. Clustering 79.72% (±6.82plus-or-minus6.82\pm 6.82) 81.06% 0:28:34 86.84% (±3.86plus-or-minus3.86\pm 3.86) 80.43% 0:28:36
DP-SGD 50.24% (±9.41plus-or-minus9.41\pm 9.41) 69.33% 0:17:47 94.40% (±3.53plus-or-minus3.53\pm 3.53) 69.24% 0:17:43
Noise Patch 29.77% (±6.75plus-or-minus6.75\pm 6.75) 92.00% 0:20:20 66.06% (±5.17plus-or-minus5.17\pm 5.17) 91.88% 0:20:21
Large Noise Patch 33.87% (±5.64plus-or-minus5.64\pm 5.64) 88.91% 0:19:43 54.45% (±6.11plus-or-minus6.11\pm 6.11) 89.02% 0:19:28
Optimized Patch 62.70% (±12.28plus-or-minus12.28\pm 12.28) 91.77% 6:08:03 79.86% (±6.24plus-or-minus6.24\pm 6.24) 92.01% 6:00:42
Image Patch 25.80% (±3.93plus-or-minus3.93\pm 3.93) 91.41% 0:16:45 25.62% (±3.85plus-or-minus3.85\pm 3.85) 91.15% 0:16:40

6 Conclusions

In this work, we adapt adversarial training to defend against data poisoning attacks. In addition to demonstrating the strong defensive capabilities of our method, poison immunity, we analyze the feature space of defended models and observe mechanisms of defense. We further evaluate the proposed defense against a variety of attacks on deep neural networks for image classification, successfully adapting to and defending against various poisoning attacks. We stress that we believe this strategy to be a general paradigm for defending against data tampering attacks that can extend to novel future attacks.

Acknowledgements

This work was supported by the DARPA GARD and DARPA YFA programs. Additional support was provided by DARPA QED and the National Science Foundation DMS program.

References

  • Abadi et al. (2016) Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pp.  308–318, Vienna, Austria, October 2016. Association for Computing Machinery. ISBN 978-1-4503-4139-4. doi: 10.1145/2976749.2978318.
  • Aghakhani et al. (2020) Aghakhani, H., Meng, D., Wang, Y.-X., Kruegel, C., and Vigna, G. Bullseye Polytope: A Scalable Clean-Label Poisoning Attack with Improved Transferability. arXiv:2005.00191 [cs, stat], April 2020. URL http://arxiv.org/abs/2005.00191.
  • Barreno et al. (2010) Barreno, M., Nelson, B., Joseph, A. D., and Tygar, J. D. The security of machine learning. Machine Language, 81(2):121–148, November 2010. ISSN 0885-6125. doi: 10.1007/s10994-010-5188-5.
  • Borgnia et al. (2020) Borgnia, E., Cherepanova, V., Fowl, L., Ghiasi, A., Geiping, J., Goldblum, M., Goldstein, T., and Gupta, A. Strong Data Augmentation Sanitizes Poisoning and Backdoor Attacks Without an Accuracy Tradeoff. arXiv:2011.09527 [cs], November 2020. URL http://arxiv.org/abs/2011.09527.
  • Borgnia et al. (2021) Borgnia, E., Geiping, J., Cherepanova, V., Fowl, L., Gupta, A., Ghiasi, A., Huang, F., Goldblum, M., and Goldstein, T. DP-InstaHide: Provably Defusing Poisoning and Backdoor Attacks with Differentially Private Data Augmentations. arXiv:2103.02079 [cs], March 2021. URL http://arxiv.org/abs/2103.02079.
  • Brown et al. (2020) Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language Models are Few-Shot Learners. In 34th Conference on Neural Information Processing Systems (NeurIPS 2020), December 2020. URL http://arxiv.org/abs/2005.14165.
  • Carlini et al. (2019) Carlini, N., Athalye, A., Papernot, N., Brendel, W., Rauber, J., Tsipras, D., Goodfellow, I., Madry, A., and Kurakin, A. On Evaluating Adversarial Robustness. arXiv:1902.06705 [cs, stat], February 2019. URL http://arxiv.org/abs/1902.06705.
  • Chen et al. (2019) Chen, B., Carvalho, W., Baracaldo, N., Ludwig, H., Edwards, B., Lee, T., Molloy, I., and Srivastava, B. Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering. In SafeAI@AAAI, January 2019. URL https://openreview.net/forum?id=BkVNRJb_ZB.
  • Chen et al. (2020) Chen, X., Wang, W., Bender, C., Ding, Y., Jia, R., Li, B., and Song, D. REFIT: A Unified Watermark Removal Framework for Deep Learning Systems with Limited Data. arXiv:1911.07205 [cs], January 2020. URL http://arxiv.org/abs/1911.07205.
  • DeVries & Taylor (2017) DeVries, T. and Taylor, G. W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv:1708.04552 [cs], August 2017. URL http://arxiv.org/abs/1708.04552.
  • Gao et al. (2019) Gao, Y., Xu, C., Wang, D., Chen, S., Ranasinghe, D. C., and Nepal, S. STRIP: A defence against trojan attacks on deep neural networks. In Proceedings of the 35th Annual Computer Security Applications Conference, ACSAC ’19, pp.  113–125, New York, NY, USA, December 2019. Association for Computing Machinery. ISBN 978-1-4503-7628-0. doi: 10.1145/3359789.3359790.
  • Geiping et al. (2021) Geiping, J., Fowl, L. H., Huang, W. R., Czaja, W., Taylor, G., Moeller, M., and Goldstein, T. Witches’ Brew: Industrial Scale Data Poisoning via Gradient Matching. In International Conference on Learning Representations, April 2021. URL https://openreview.net/forum?id=01olnfLIbD.
  • Goldblum et al. (2020) Goldblum, M., Tsipras, D., Xie, C., Chen, X., Schwarzschild, A., Song, D., Madry, A., Li, B., and Goldstein, T. Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses. arXiv:2012.10544 [cs], December 2020. URL http://arxiv.org/abs/2012.10544.
  • Gong et al. (2020) Gong, C., Ren, T., Ye, M., and Liu, Q. MaxUp: A Simple Way to Improve Generalization of Neural Network Training. arXiv:2002.09024 [cs, stat], February 2020. URL http://arxiv.org/abs/2002.09024.
  • Gu et al. (2019) Gu, T., Liu, K., Dolan-Gavitt, B., and Garg, S. BadNets: Evaluating Backdooring Attacks on Deep Neural Networks. IEEE Access, 7:47230–47244, 2019. ISSN 2169-3536. doi: 10.1109/ACCESS.2019.2909068.
  • He et al. (2015) He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs], December 2015. URL http://arxiv.org/abs/1512.03385.
  • Hong et al. (2020) Hong, S., Chandrasekaran, V., Kaya, Y., Dumitraş, T., and Papernot, N. On the Effectiveness of Mitigating Data Poisoning Attacks with Gradient Shaping. arXiv:2002.11497 [cs], February 2020. URL http://arxiv.org/abs/2002.11497.
  • Huang et al. (2020) Huang, W. R., Geiping, J., Fowl, L., Taylor, G., and Goldstein, T. MetaPoison: Practical General-purpose Clean-label Data Poisoning. In Advances in Neural Information Processing Systems, volume 33, Vancouver, Canada, December 2020. URL https://proceedings.neurips.cc//paper_files/paper/2020/hash/8ce6fc704072e351679ac97d4a985574-Abstract.html.
  • Jaderberg et al. (2015) Jaderberg, M., Simonyan, K., Zisserman, A., and kavukcuoglu, k. Spatial Transformer Networks. In Advances in Neural Information Processing Systems 28, pp.  2017–2025. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5854-spatial-transformer-networks.pdf.
  • Jayaraman & Evans (2019) Jayaraman, B. and Evans, D. Evaluating Differentially Private Machine Learning in Practice. In 28th {}USENIX{} Security Symposium ({}USENIX{} Security 19), pp. 1895–1912, 2019. ISBN 978-1-939133-06-9. URL https://www.usenix.org/conference/usenixsecurity19/presentation/jayaraman.
  • Koh et al. (2018) Koh, P. W., Steinhardt, J., and Liang, P. Stronger Data Poisoning Attacks Break Data Sanitization Defenses. arXiv:1811.00741 [cs, stat], November 2018. URL http://arxiv.org/abs/1811.00741.
  • Krizhevsky (2009) Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. 2009. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
  • Kumar et al. (2020) Kumar, R. S. S., Nyström, M., Lambert, J., Marshall, A., Goertzel, M., Comissoneru, A., Swann, M., and Xia, S. Adversarial Machine Learning-Industry Perspectives. In 2020 IEEE Security and Privacy Workshops (SPW), pp.  69–75, May 2020. doi: 10.1109/SPW50608.2020.00028.
  • Levine & Feizi (2021) Levine, A. and Feizi, S. Deep Partition Aggregation: Provable Defenses against General Poisoning Attacks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YUGG2tFuPM.
  • Li et al. (2021a) Li, Y., Lyu, X., Koren, N., Lyu, L., Li, B., and Ma, X. Anti-Backdoor Learning: Training Clean Models on Poisoned Data. In Thirty-Fifth Conference on Neural Information Processing Systems, May 2021a. URL https://openreview.net/forum?id=cAw860ncLRW.
  • Li et al. (2021b) Li, Y., Wu, B., Jiang, Y., Li, Z., and Xia, S.-T. Backdoor Learning: A Survey. arXiv:2007.08745 [cs], February 2021b. URL http://arxiv.org/abs/2007.08745.
  • Liu et al. (2018) Liu, K., Dolan-Gavitt, B., and Garg, S. Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks. In Research in Attacks, Intrusions, and Defenses, Lecture Notes in Computer Science, pp.  273–294, Cham, 2018. Springer International Publishing. ISBN 978-3-030-00470-5. doi: 10.1007/978-3-030-00470-5_13.
  • Ma et al. (2019) Ma, Y., Zhu, X., and Hsu, J. Data Poisoning against Differentially-Private Learners: Attacks and Defenses. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pp.  4732–4738, Macao, China, August 2019. International Joint Conferences on Artificial Intelligence Organization. ISBN 978-0-9992411-4-1. doi: 10.24963/ijcai.2019/657.
  • Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv:1706.06083 [cs, stat], June 2017. URL http://arxiv.org/abs/1706.06083.
  • Madry et al. (2018) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In International Conference on Learning Representations, February 2018. URL https://openreview.net/forum?id=rJzIBfZAb.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in PyTorch. In NIPS 2017 Autodiff Workshop, Long Beach, CA, 2017. URL https://openreview.net/forum?id=BJJsrmfCZ.
  • Paudice et al. (2018) Paudice, A., Muñoz-González, L., Gyorgy, A., and Lupu, E. C. Detection of Adversarial Training Examples in Poisoning Attacks through Anomaly Detection. arXiv:1802.03041 [cs, stat], February 2018. URL http://arxiv.org/abs/1802.03041.
  • Peri et al. (2020) Peri, N., Gupta, N., Huang, W. R., Fowl, L., Zhu, C., Feizi, S., Goldstein, T., and Dickerson, J. P. Deep k-NN Defense Against Clean-Label Data Poisoning Attacks. In Computer Vision – ECCV 2020 Workshops, Lecture Notes in Computer Science, pp.  55–70, Cham, 2020. Springer International Publishing. ISBN 978-3-030-66415-2. doi: 10.1007/978-3-030-66415-2_4.
  • Radiya-Dixit & Tramer (2021) Radiya-Dixit, E. and Tramer, F. Data Poisoning Won’t Save You From Facial Recognition. In ICML 2021 Workshop on Adversarial Machine Learning, June 2021. URL https://openreview.net/forum?id=__sp5PEix2H.
  • Rao et al. (2020) Rao, S., Stutz, D., and Schiele, B. Adversarial Training against Location-Optimized Adversarial Patches. arXiv:2005.02313 [cs, stat], December 2020. URL http://arxiv.org/abs/2005.02313.
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211–252, December 2015. ISSN 1573-1405. doi: 10.1007/s11263-015-0816-y.
  • Saha et al. (2020) Saha, A., Subramanya, A., and Pirsiavash, H. Hidden Trigger Backdoor Attacks. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):11957–11965, April 2020. ISSN 2374-3468. doi: 10.1609/aaai.v34i07.6871.
  • Schwarzschild et al. (2020) Schwarzschild, A., Goldblum, M., Gupta, A., Dickerson, J. P., and Goldstein, T. Just How Toxic is Data Poisoning? A Unified Benchmark for Backdoor and Data Poisoning Attacks. arXiv:2006.12557 [cs, stat], June 2020. URL http://arxiv.org/abs/2006.12557.
  • Shafahi et al. (2018) Shafahi, A., Huang, W. R., Najibi, M., Suciu, O., Studer, C., Dumitras, T., and Goldstein, T. Poison frogs! targeted clean-label poisoning attacks on neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pp.  6106–6116, Red Hook, NY, USA, December 2018. Curran Associates Inc.
  • Sinha et al. (2018) Sinha, A., Namkoong, H., and Duchi, J. Certifying Some Distributional Robustness with Principled Adversarial Training. In International Conference on Learning Representations, February 2018. URL https://openreview.net/forum?id=Hk6kPgZA-.
  • Tan & Shokri (2020) Tan, T. J. L. and Shokri, R. Bypassing Backdoor Detection Algorithms in Deep Learning. In 2020 IEEE European Symposium on Security and Privacy (EuroS P), pp.  175–183, September 2020. doi: 10.1109/EuroSP48549.2020.00019.
  • Tao et al. (2021) Tao, L., Feng, L., Yi, J., Huang, S.-J., and Chen, S. Better Safe Than Sorry: Preventing Delusive Adversaries with Adversarial Training. In Thirty-Fifth Conference on Neural Information Processing Systems, May 2021. URL https://openreview.net/forum?id=I39u89067j.
  • Tramer et al. (2020) Tramer, F., Carlini, N., Brendel, W., and Madry, A. On Adaptive Attacks to Adversarial Example Defenses. In 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, December 2020. URL http://arxiv.org/abs/2002.08347.
  • Tran et al. (2018a) Tran, B., Li, J., and Madry, A. Spectral Signatures in Backdoor Attacks. In Advances in Neural Information Processing Systems 31, pp.  8000–8010. Curran Associates, Inc., 2018a. URL http://papers.nips.cc/paper/8024-spectral-signatures-in-backdoor-attacks.pdf.
  • Tran et al. (2018b) Tran, B., Li, J., and Madry, A. Spectral Signatures in Backdoor Attacks. arXiv:1811.00636 [cs, stat], November 2018b. URL http://arxiv.org/abs/1811.00636.
  • Turner et al. (2018) Turner, A., Tsipras, D., and Madry, A. Clean-Label Backdoor Attacks. openreview, September 2018. URL https://openreview.net/forum?id=HJg6e2CcK7.
  • Veldanda & Garg (2020) Veldanda, A. and Garg, S. On Evaluating Neural Network Backdoor Defenses. arXiv:2010.12186 [cs], October 2020. URL http://arxiv.org/abs/2010.12186.
  • Wang et al. (2019) Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., and Zhao, B. Y. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In 2019 IEEE Symposium on Security and Privacy (SP), pp.  707–723, San Francisco, CA, USA, May 2019. IEEE. ISBN 978-1-5386-6660-9. doi: 10.1109/SP.2019.00031.
  • Weber et al. (2020) Weber, M., Xu, X., Karlas, B., Zhang, C., and Li, B. RAB: Provable Robustness Against Backdoor Attacks. arXiv:2003.08904 [cs, stat], June 2020. URL http://arxiv.org/abs/2003.08904.
  • Xu et al. (2020) Xu, X., Wang, Q., Li, H., Borisov, N., Gunter, C. A., and Li, B. Detecting AI Trojans Using Meta Neural Analysis. arXiv:1910.03137 [cs], October 2020. URL http://arxiv.org/abs/1910.03137.
  • Yun et al. (2019) Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  6023–6032, 2019. URL https://openaccess.thecvf.com/content_ICCV_2019/html/Yun_CutMix_Regularization_Strategy_to_Train_Strong_Classifiers_With_Localizable_Features_ICCV_2019_paper.html.
  • Zhang et al. (2018) Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations, February 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb.
  • Zhu et al. (2019) Zhu, C., Huang, W. R., Li, H., Taylor, G., Studer, C., and Goldstein, T. Transferable Clean-Label Poisoning Attacks on Deep Neural Nets. In International Conference on Machine Learning, pp. 7614–7623. PMLR, May 2019. URL http://proceedings.mlr.press/v97/zhu19a.html.

Appendix A Appendix

This appendix contains additional ablation studies, evaluation of the defense against 0superscript0\ell^{0} threat models, technical details for each implemented attack and defense and additional visualizations in 1D and 2D.

Appendix B Ablation Studies

This section contains ablation studies and additional information. Table 4 contains information on clean validation accuracy and runtimes for all experiments in Table 1.

Table 4: This is the same table as Table 1 in the main body, with additional information of natural validation accuracy and timing for each run. Best viewed on screen. All shown timings are wall-clock time for an NVIDIA GTX-2080ti with 4 assigned CPUs. Some timings are missing (due to machine heterogeneity), but can be inferred from other rows in the same block (i.e. all transfer experiments take roughly the same amount of time for the "Strongest Attack" defense).
Attack Scenario Undefended Defended
Same Attack Strongest Attack
Poison Acc. Nat.Acc. Time Poison Acc. Nat.Acc. Time Poison Acc. Nat.Acc. Time
MetaPoison From-scratch 69.50% (±9.34plus-or-minus9.34\pm 9.34) 86.11% 0:28:18 10.00% (±9.49plus-or-minus9.49\pm 9.49) 81.34% 2:06:48 10.00% (±9.49plus-or-minus9.49\pm 9.49) 78.40% 1:09:47
Gradient Matching (CS) From-scratch 90.00% (±6.71plus-or-minus6.71\pm 6.71) 92.01% 0:15:48 0.00% (±5.59plus-or-minus5.59\pm 5.59) 88.32% 4:13:51 0.00% (±5.59plus-or-minus5.59\pm 5.59) 88.32% 4:13:51
Bullseye Polytope Fine-tuning 80.00% (±8.94plus-or-minus8.94\pm 8.94) 91.93% 0:16:31 0.00% (±5.59plus-or-minus5.59\pm 5.59) 88.49% 0:59:58 8.33% (±7.98plus-or-minus7.98\pm 7.98) 88.14% 4:20:03
Bullseye Polytope Transfer 100.00% (±5.59plus-or-minus5.59\pm 5.59) 91.97% 0:08:04 10.00% (±6.71plus-or-minus6.71\pm 6.71) 88.64% 0:39:53 0.00% (±5.59plus-or-minus5.59\pm 5.59) 90.34% -
Poison Frogs Transfer 100.00% (±5.59plus-or-minus5.59\pm 5.59) 91.93% 0:08:46 15.00% (±7.98plus-or-minus7.98\pm 7.98) 88.50% 0:39:53 0.00% (±5.59plus-or-minus5.59\pm 5.59) 90.54% -
Gradient Matching (SE) Transfer 95.00% (±4.87plus-or-minus4.87\pm 4.87) 92.02% 0:08:08 0.00% (±5.59plus-or-minus5.59\pm 5.59) 87.68% 0:41:08 5.00% (±4.87plus-or-minus4.87\pm 4.87) 90.62% -
Hidden Trigger Backdoor Transfer 55.59% (±5.65plus-or-minus5.65\pm 5.65) 86.07% - 24.78% (±6.82plus-or-minus6.82\pm 6.82) 86.44% - 3.32% (±0.79plus-or-minus0.79\pm 0.79) 87.94% 0:40:26
Table 5: Additional quantitative result for several attacks and their defense by poison immunity with s=0.75𝑠0.75s=0.75, showing avg. poison success with standard error (where all trials have equal outcomes, we report the worst-case error estimate 5.59%percent5.595.59\%). Additional details about each attack threat model can be found in the Section D.2.

Attack Scenario Dataset Model Undefended Defended
Same Attack Strongest Attack
Gradient Matching From-Scratch GTSRB ResNet18 40.00% (±15.49plus-or-minus15.49\pm 15.49) 10.00% (±9.49plus-or-minus9.49\pm 9.49) 10.00% (±9.49plus-or-minus9.49\pm 9.49)
Bullseye Polytope Transfer GTSRB ResNet18 10.00% (±9.49plus-or-minus9.49\pm 9.49) 0.00% (±5.59plus-or-minus5.59\pm 5.59) 0.00% (±5.59plus-or-minus5.59\pm 5.59)

Appendix C Visualizations

We repeat the three-dimensional visualizations shown in Figure 2 in the main work for additional attacks. For easy comparison, we also repeat the figures appearing in the main work. We show feature collisions via Poison Frogs in Figure 6 and repeat Bulleye Polytope in Figure 7. We then repeat gradient matching in Figure 8, in comparison to backdoor trigger in Figure 9 and gradient matching (SE) in Figure 10.

All three-dimensional visualizations are also shown in two dimensions, showing Poison Frogs in Figure 11, Bullseye Polytope in Figure 12, gradient matching in Figure 13, backdoor triggers in Figure 14 and gradient matching (SE) in Figure 15.

Refer to caption
Refer to caption
Refer to caption
(a) Undefended model, clean (left) and fine-tuned on poisoned
data containing feature collisions. (right).
Refer to caption
Refer to caption
(b) Defended model, clean (left) and retrained on poisoned
data containing gradient matching attacks (right).
Figure 6: 3D Visualization of a feature collision attack (via Poison-Frogs) against an undefended and a defended model. The defended model significantly hinders feature collisions. The target image is marked by a black triangle and is originally part of the class marked in blue. The poisoned images are marked in red and are part of the class marked in green.
Refer to caption
Refer to caption
(a) Undefended model, clean (left) and fine-tuned on poisoned
data containing feature collisions. (right).
Refer to caption
Refer to caption
(b) Defended model, clean (left) and retrained on poisoned
data containing gradient matching attacks (right).
Figure 7: 3D Visualization of the effects of a feature collision (Bullseye Polytope) attack against an undefended and a defended model. The defended model significantly hinders feature collisions. The target image is marked by a black triangle and is originally part of the class marked in blue. The poisoned images are marked in red and are part of the class marked in green. Notably the strong collision seen in the baseline is inhibited by the defense.
Refer to caption
Refer to caption
Refer to caption
(a) Undefended model, clean (left) and retrained on poisoned
data containing gradient matching attacks (right).
Refer to caption
Refer to caption
(b) Defended model, clean (left) and retrained on poisoned
data containing gradient matching attacks (right).
Figure 8: 3D Visualization of the effects of a gradient matching attack (Witches’ Brew) against an undefended and a defended model. The defended model significantly hinders feature collisions. The target image is marked by a black triangle and is originally part of the class marked in blue. The poisoned images are marked in red and are part of the class marked in green.
Refer to caption
Refer to caption
(a) Undefended model, clean (left) and retrained on poisoned
data containing backdoor patches. (right).
Refer to caption
Refer to caption
(b) Defended model, clean (left) and retrained on poisoned
data containing backdoor patches (right).
Figure 9: 3D Visualization of the effects of a backdoor trigger patch attack against an undefended and a defended model. The target trigger is applied to a number of target images shown in black and is originally part of the class marked in blue. The poisoned images are marked in red and are part of the class marked in green. Note how the black datapoints are associated with the poison class in the undefended case, but correctly associate with the target class in the defended case.
Refer to caption
Refer to caption
(a) Undefended model, clean (left) and fine-tuned on poisoned
data containing gradient matching attacks (right).
Refer to caption
Refer to caption
(b) Defended model, clean (left) and fine-tuned on poisoned
data containing gradient matching attacks (right).
Figure 10: 3D Visualization of the effects of a gradient matching attack (Witches’ Brew with squared loss) against an undefended and a defended model. The defended model significantly hinders feature collisions. The target image is marked by a black triangle and is originally part of the class marked in blue. The poisoned images are marked in red and are part of the class marked in green.
We see that the attack effectively moves the decision boundary opposite of the target in feature space. However this is completely prevented in the defended model.
Refer to caption
Refer to caption
Refer to caption
(a) Undefended model, clean (left) and fine-tuned on poisoned
data containing feature collisions. (right).
Refer to caption
Refer to caption
(b) Defended model, clean (left) and retrained on poisoned
data containing gradient matching attacks (right).
Figure 11: 2D Visualization of a feature collision attack (via Poison-Frogs) against an undefended and a defended model. The defended model significantly hinders feature collisions. The target image is marked by a black triangle and is originally part of the class marked in blue. The poisoned images are marked in red and are part of the class marked in green.
Refer to caption
Refer to caption
(a) Undefended model, clean (left) and fine-tuned on poisoned
data containing feature collisions. (right).
Refer to caption
Refer to caption
(b) Defended model, clean (left) and retrained on poisoned
data containing gradient matching attacks (right).
Figure 12: 2D Visualization of the effects of a feature collision (Bullseye Polytope) attack against an undefended and a defended model. The defended model significantly hinders feature collisions. The target image is marked by a black triangle and is originally part of the class marked in blue. The poisoned images are marked in red and are part of the class marked in green. Notably the strong collision seen in the baseline is inhibited by the defense.
Refer to caption
Refer to caption
(a) Undefended model, clean (left) and retrained on poisoned
data containing gradient matching attacks. (right).
Refer to caption
Refer to caption
(b) Defended model, clean (left) and retrained on poisoned
data containing gradient matching attacks (right).
Figure 13: 2D Visualization of the effects of a gradient matching attack (Witches’ Brew) against an undefended and a defended model. The defended model significantly hinders feature collisions. The target image is marked by a black triangle and is originally part of the class marked in blue. The poisoned images are marked in red and are part of the class marked in green.
Refer to caption
Refer to caption
Refer to caption
(a) Undefended model, clean (left) and retrained on poisoned
data containing backdoor patches. (right).
Refer to caption
Refer to caption
(b) Defended model, clean (left) and retrained on poisoned
data containing backdoor patches (right).
Figure 14: 2D Visualization of the effects of a backdoor trigger patch attack against an undefended and a defended model. The target trigger is applied to a number of target images shown in black and is originally part of the class marked in blue. The poisoned images are marked in red and are part of the class marked in green. Note how the black datapoints are associated with the poison class in the undefended case, but correctly associate with the target class in the defended case.
Refer to caption
Refer to caption
(a) Undefended model, clean (left) and fine-tuned on poisoned
data containing gradient matching attacks. (right).
Refer to caption
Refer to caption
(b) Defended model, clean (left) and fine-tuned on poisoned
data containing gradient matching attacks (right).
Figure 15: 2D Visualization of the effects of a gradient matching attack (Witches’ Brew with squared error) against an undefended and a defended model. The defended model significantly hinders feature collisions. The target image is marked by a black triangle and is originally part of the class marked in blue. The poisoned images are marked in red and are part of the class marked in green.
We see that the attack effectively moves the decision boundary by placing poisoned data opposite of the target in feature space. However this is completely prevented in the defended model.

Appendix D Experimental Setup

In general terms, the goal of our experimental setup is to standardize the experimental conditions encountered in various works in data poisoning to a degree that allows for convenient comparisons across attacks. Furthermore previous works have focused on showcasing the smallest possible adversarial modifications that still achieve a substantially malicious effect. However, such setups "on the edge" are broken too easily by any defenses, so that we generally consider stronger attacks in this work than in their original implementations. On the other hand, there is an upper limit to this design because attacks have to be sufficiently realistic, modifying only parts of the dataset within limits - unlimited adversarial modifications would allow for unlimited attack strength.

For all experiments shown in this work, we standardize the machine learning model that is attacked to a deep neural network for image classification, namely always a ResNet-18 model trained on the CIFAR-10 dataset. The ResNet-18 model follows (He et al., 2015), with the customary CIFAR-10 modification of replacing the initial 7x7 convolution and max-pooling with a 3x3 convolution. We train this model using SGD with Nesterov momentum (m=0.9𝑚0.9m=0.9) with a batch size of 128128128 for 40 epochs with an initial learning rate of 0.10.10.1, which is reduced by a factor of 10 after 38,583858\frac{3}{8},\frac{5}{8} and 7878\frac{7}{8} of all epochs. The model is additionally regularized by weight decay with weight 5×1045superscript1045\times 10^{-4}. The CIFAR-10 dataset is augmented with horizontal flips and continuous random crops from images with a zero padding of 4.

We directly train with these hyperparameters in the from-scratch setting. For the transfer experiments we first train a ResNet-18 with this setup, which we call the base model, and then freeze its feature representation. We then retrain the linear layer from a random initialization with the same hyperparameters. For the fine-tuning experiments we drop the learning rate by 0.0010.0010.001 before fine-tuning from the base model, likewise reinitializing the linear layer, but not freezing the feature representation. This setup for fine-tuning and transfer is arguably easier to attack than a transfer to an unknown dataset (as investigated for example in (Shafahi et al., 2018)), as all features are already optimized to be relevant to the given task, and as such we do not recommend it as the only evaluation of an attack, but believe it is an appropriate worst-case setting for the defense experiments considered in this work. Note that we apply a slightly different setting for the convex polytope attack of Zhu et al. (2019) which we will detail together with the attack in the next section and mark by transfer* in the main table.

D.1 Measuring poison effectiveness

We run data poisoning attacks with the goal of maliciously classifying a single target image for all targeted data poisoning attacks, and with the goal of maliciously classifying 1000 images patched with a single target trigger for backdoor attacks. In both cases these images are drawn from the validation set without replacement. We report success for an attack if the target image, or patched image is classified with the adversarial label, we do not count mis-classifications into a third label. In all experiments, we then report avg. poison success. This metric represents the average success over N𝑁N random trials, where each trial consists of a randomly drawn target trigger or image, randomly chosen adversarial class, randomly chosen subset of images to be poisoned (from the adversarial class) and random model initializations. We control these trials with by specifying their random seed. All experiments in this work are based on the same 202020 fixed trials, which we list by their seed within the supplemented code submission and as such comparable.

For all attacks we consider an superscript\ell^{\infty} bound of ε=16𝜀16\varepsilon=16 for targeted data poisoning attacks.

D.2 Attack settings for superscript\ell^{\infty} threat models

For all attacks we optimize the adversarial perturbation through projected descent (PGD). We found signed Adam with a step size of 0.10.10.1 to be a robust first-order optimization tool to this end which we run for 240 steps, reducing the step size by a factor of 10 after 38,583858\frac{3}{8},\frac{5}{8} and 7878\frac{7}{8} of total steps. Basic data poisoning attacks are often brittle when encountering simple data augmentations (Schwarzschild et al., 2020). As we include data augmentations in our experimental setup, we also include them during the attack algorithm for all iterative attacks, sampling a random augmentation in every update step and differentiating through the resulting transformation via grid sampling (Jaderberg et al., 2015). Some works such as Geiping et al. (2021) consider multiple restarts of the attack algorithm, however we consistently run all attacks with a single restart, mostly due to computational constraints, and given that restarts appear to confer only a minor benefit.

Poison Frogs.

We implement the feature collision objective as proposed in Shafahi et al. (2018). However, while perturbation bounds were weakly enforced in the original version by an additional penalty, we instead optimize the objective directly by projected (signed) gradient descent in line with other attacks. We find this to be at least equally effective.

Convex Polytope.

Poisoned data created by Convex Polytope can be brittle in terms of the amount of training data and optimizer settings (Schwarzschild et al., 2020). Therefore, to get an idea of how well our defense works against this attack, we implement a modified setting wherein the attack succeeds. This includes using a feature extractor trained on CIFAR-100, and training the last linear layer for only 101010 epochs using the Adam optimizer with lr=0.1absent0.1=0.1. We otherwise applied the attack as proposed in Zhu et al. (2019). This experiment is found in Table 6.

Table 6: Transfer* refers to the explicit setting of Zhu et al. (2019). The proposed defense significantly decreases success rates, even in this setting.

Attack Scenario Undefended Defended
Convex Polytope Transfer* 90.00% (±10.00plus-or-minus10.00\pm 10.00) 40.00 % (±16.32plus-or-minus16.32\pm 16.32)

Bullseye Polytope.

We directly re-implement the attack based on eq.(2) in Aghakhani et al. (2020).

Witches’ Brew.

We implement gradient matching as in Geiping et al. (2021). However, we modify the original attack in the transfer setting. The original attack is posed for from-scratch attacks on large models and the objective of cosine similarity of parameter gradients does not scale well to small models. For small models (such as the transfer case, where only the last linear layer is retrained), we instead measure similarity in the squared Euclidean norm. We refer to this variant as gradient matching with squared error, e.g. Gradient Matching (SE) in table 1.

MetaPoison

We download premade poisoned datasets for MetaPoison from https://github.com/wronnyhuang/metapoison, validate their effectiveness and and then deploy our proposed defense using our own surrogate attack at the batch level, which we unroll for 222 steps as also proposed for the attacks in Huang et al. (2020). We defend using only the current estimate of the model as discussed in previous sections (instead of replicating the ensemble of 24 models used to create the poisoned dataset in some fashion). We train the usual ResNet-18 model on the downloaded poisoned CIFAR-10 for 40 epochs without data augmentations, conforming to the training setup without augmentations based on which the poisons were created. We download poisoned datasets for a budget of 1%percent11\% for the bird-dog setting with bird target ids 00 to 999 with perturbations bounded by ε=8𝜀8\varepsilon=8 in superscript\ell^{\infty}-norm and perturbed by 4%percent44\% in color space.

Hidden Trigger Backdoor

For the hidden trigger backdoor attack of Saha et al. (2020) we use triggers identical to the ones used in the original work (see https://github.com/UMBCvision/Hidden-Trigger-Backdoor-Attacks). The number of poisons we use is indicated in the main body experiments, and is on same order as the number used in the original work. Specifically, we evaluate the attack on 1000 patched target images and choose a budget of 5%percent55\%. The adversarial perturbations to the subset of poisoned data are optimized as described in the general attack settings, minimizing the hidden trigger objective of matching poison features to features of patched images.

D.3 Defense settings

Input Noise

As a sanity check we include a comparison to input noise. We draw random noise from the boundary of the set of allowed perturbations by independently sampling from a Bernoulli distribution for each value, and assigning either ε𝜀-\varepsilon or ε𝜀\varepsilon to each value.

CutMix and Maxup

We apply CutMix (Yun et al., 2019) as a defense against data poisoning as proposed in Borgnia et al. (2020). We attack this defense adaptively by creating poisoned data based on a clean model trained with CutMix as well. The same considerations apply for Maxup (Gong et al., 2020). We use Cutout (DeVries & Taylor, 2017) as a base augmentation for Maxup, and select the worst-case augmentation from four examples.

Spectral Signatures

We implement the defense as proposed in Tran et al. (2018b), using the provided overestimation factor of 1.5. We supply the attack budget as additional info for this defense.

Deep K-NN

We implement the defense as proposed in Peri et al. (2020), using the provided overestimation factor of 2. We supply the attack budget as additional info for this defense.

Activation Clustering

We run the defense of Chen et al. (2019), clustering the available training data into two clusters.

Differentially private SGD

We implement a variant of differentially private SGD with gradient clipping to a value of 1 on a mini-batch level (as suggested in Hong et al. (2020)), and varying levels of Gaussian noise applied to the mini-batch gradient. Attacks can adapt to this defense by adding gradient noise to their surrogate estimation of gradients (this is mostly relevant for gradient matching where surrogate gradients appear explicitly).

Adversarial Training

We implement straightforward adversarial training, starting from a randomly initialized perturbation and maximizing cross entropy for 5 steps via signed descent. Interestingly, for small ε𝜀\varepsilon values, this defense can be overcome by the poisoner by creating poisoned data while adversarial noise is sampled and added during the poison optimization. This however this only helpful when the adversarial training ε𝜀\varepsilon is smaller than the attack ε𝜀\varepsilon.

Poison Immunity

For all implementations of adversarial poisoning we replicate the original objective of the attack in the mini-batch setting, but optimize for only 5 steps, based on features or gradients from the current model. The surrogate attacks are optimized via signed Adam descent with the same parameters as described in the attack section. For the hidden trigger backdoor attack we draw random patches as surrogate targets, and then optimize the superscript\ell^{\infty} perturbations for 5 steps as usual.

D.4 Attack settings for 0superscript0\ell^{0} threat models

For backdoor triggers, we allow triggers with a size of 4 by 4 pixels, i.e. a rectangular arrangement of the 0superscript0\ell^{0} bound of ε=16𝜀16\varepsilon=16. We allow a budget of 5%percent55\% of the dataset to be modified. We then imprint this patch in the lower right corner of all images in the poison set. During evaluation we imprint the same patch in the same location for 1000 target images. We first evaluate a "noisy checkerboard" patch, which is computed by sampling a Bernoulli variable for each patch pixel and RGB channel independently and assigning either 00 or 255255255. This patch is arguably not part of the distribution of CIFAR-10 images. Secondly, we select a resized firefox logo as the second patch, which leads to a patch that appears semantically similar to CIFAR-10 content.

We evaluate several possible defenses:

Poison Immunity - Noise Patch

We sample a random noisy checkerboard pattern (Bernoulli samples in each pixel and channel as above) with a random rectangular shape with lengths within [3,12]312[3,12] for an approximate 0<45superscript045\ell^{0}<45 (overestimating the actual 0superscript0\ell^{0} bound for a gray-box setting) as well as a random location. We sample such a patch from for every class in the dataset and then apply them to randomly chosen pairs of classes, replicating the attack without knowing the targeted class.

Poison Immunity - Large Noise Patch

We repeat the previous setup with random noisy checkerboards sampled with lengths within [8,28]828[8,28], but otherwise the same setup as above. Note that even these large patches improve upon the patches with lengths in [3,12]312[3,12] only in minor ways for the semantic patch.

Poison Immunity - Optimized Patch

Previously we only sampled random patches during the defense. This mirrors the methodology of previous sections on targeted attacks, in the sense that a data poisoning attack is used to attack each mini-batch in the same way that it would be used by an attacker - and the backdoor trigger attack also samples these patches randomly as described above. However, this is arguably a non-optimal approximation of the objective described in Equation (2) in the main body and further we can conjecture a more optimal backdoor trigger attack that would optimize patch contents. As such we optimize the contents of a patch with random rectangular shape as described for the noise patch, while still drawing the location at random. However, we need to choose a surrogate objective on which to optimize the patch. For this task we choose gradient matching and as such optimize the patch so that the gradients of patched target data and patched poisoned data are optimally aligned. We apply the same procedure as otherwise in this work and optimize with signed Adam for 5 steps with step size τ=0.01𝜏0.01\tau=0.01.

As we find in Table 2 though, this approach does not improve upon the noise patch sampling (and even performs worse). Possible reasons for this behavior are the choice of surrogate objective and optimization setting or the question of whether the patch location and shape should also be optimized, which would however further complicate the optimization.

Poison Immunity - Image Patch

In contrast to sampling a noise patch, we can also sample this patch from in-distribution data, i.e. CIFAR-10 content. To do so we sample these patches from other images in the mini-batch. We choose a larger, but fixed size of 16×16161616\times 16 patches with a random location. To correct for the semantic impact of these patches on unrelated images we have to adjust the label information to match, so that finally, this option reduces exactly to CutMix (Yun et al., 2019), also proposed as a defense in Borgnia et al. (2021).

Filtering Defenses

We apply activation clustering, spectral signatures and deep KNN as described for targeted attacks above.

Differentially private SGD

We apply differentially private SGD as described for targeted attacks above with a noise level of 0.010.010.01 and a gradient clipping to 0.10.10.1.

D.5 Computational Requirements

We run all reported experiments using Nvidia GEFORCE RTX 2080 Ti GPUs, using one GPU per experimental run which are scheduled from an internal SLURM system. We also ran select ablation pre-trials using a Nvidia V100 setup. Training a robust model in the (most expensive) from-scratch setting requires approx. 444 hours and 101010 minutes for the evaluated ResNet-18 on CIFAR-10 for the RTX 2080 Ti. After training a clean model (for the adaptive attack), the attack takes approx. 555 minutes. Finally the model is trained again on the poisoned data and its resilience against the poisoning is measured. In total this requires approx 888h of compute per experiment, of which we run 20 experiments for every data point with random poison-target class pairs and samples as described above. Training time reduces to approx. 2×1212\times 1h for the fine-tuning setting and 2×352352\times 35 minutes for the transfer setting.

D.6 Asset Licences

We use CIFAR-10 data (Krizhevsky, 2009) as found at https://www.cs.toronto.edu/~kriz/cifar.html. The code submission is based on forked repositories of the projects of Geiping et al. (2021) and Schwarzschild et al. (2020) as described in detail in the attached code submission folder.

D.7 Ethics Statement

Data poisoning attacks have the potential to disrupt machine learning pipelines and hinder data collection and curation. Especially when collected data is of unknown quality and possibly contaminated with poisoned samples, we hope that defense strategies like we propose can be useful in mitigating harmful effects.