diff --git a/.gitignore b/.gitignore index d727055c..e6e1bd60 100644 --- a/.gitignore +++ b/.gitignore @@ -45,4 +45,5 @@ checkpoints/ # vim *.swp -.vscode \ No newline at end of file +.vscode +*.csv diff --git a/README.md b/README.md index 0ae9e2d1..c9530f8b 100644 --- a/README.md +++ b/README.md @@ -71,7 +71,11 @@ or a specific command using, for example, textattack attack --help ``` -The [`examples/`](examples/) folder includes scripts showing common TextAttack usage for training models, running attacks, and augmenting a CSV file. The [documentation website](https://textattack.readthedocs.io/en/latest) contains walkthroughs explaining basic usage of TextAttack, including building a custom transformation and a custom constraint.. +The [`examples/`](examples/) folder includes scripts showing common TextAttack usage for training models, running attacks, and augmenting a CSV file. + + +The [documentation website](https://textattack.readthedocs.io/en/latest) contains walkthroughs explaining basic usage of TextAttack, including building a custom transformation and a custom constraint.. + ### Running Attacks: `textattack attack --help` @@ -88,7 +92,7 @@ textattack attack --recipe textfooler --model bert-base-uncased-mr --num-example *DeepWordBug on DistilBERT trained on the Quora Question Pairs paraphrase identification dataset*: ```bash -textattack attack --model distilbert-base-uncased-qqp --recipe deepwordbug --num-examples 100 +textattack attack --model distilbert-base-uncased-cola --recipe deepwordbug --num-examples 100 ``` *Beam search with beam width 4 and word embedding transformation and untargeted goal function on an LSTM*: @@ -323,7 +327,9 @@ For example, given the following as `examples.csv`: "it's a mystery how the movie could be released in this condition .", 0 ``` -The command `textattack augment --csv examples.csv --input-column text --recipe embedding --pct-words-to-swap .1 --transformations-per-example 2 --exclude-original` +The command +```textattack augment --input-csv examples.csv --output-csv output.csv --input-column text --recipe embedding --pct-words-to-swap .1 --transformations-per-example 2 --exclude-original +``` will augment the `text` column by altering 10% of each example's words, generating twice as many augmentations as original inputs, and exclude the original inputs from the output CSV. (All of this will be saved to `augment.csv` by default.) @@ -453,7 +459,7 @@ create a short file that loads them as variables `model` and `tokenizer`. The ` be able to transform string inputs to lists or tensors of IDs using a method called `encode()`. The model must take inputs via the `__call__` method. -##### Model from a file +##### Custom Model from a file To experiment with a model you've trained, you could create the following file and name it `my_model.py`: @@ -488,14 +494,12 @@ which maintains both a list of tokens and the original text, with punctuation. W -#### Dataset via Data Frames (*coming soon*) +#### Dataset loading via other mechanism, see: [here](https://textattack.readthedocs.io/en/latest/api/datasets.html) ### Attacks and how to design a new attack -The `attack_one` method in an `Attack` takes as input an `AttackedText`, and outputs either a `SuccessfulAttackResult` if it succeeds or a `FailedAttackResult` if it fails. - We formulate an attack as consisting of four components: a **goal function** which determines if the attack has succeeded, **constraints** defining which perturbations are valid, a **transformation** that generates potential modifications given an input, and a **search method** which traverses through the search space of possible perturbations. The attack attempts to perturb an input text such that the model output fulfills the goal function (i.e., indicating whether the attack is successful) and the perturbation adheres to the set of constraints (e.g., grammar constraint, semantic similarity constraint). A search method is used to find a sequence of transformations that produce a successful adversarial example. diff --git a/README_ZH.md b/README_ZH.md index ead97523..95fd7b79 100644 --- a/README_ZH.md +++ b/README_ZH.md @@ -82,7 +82,7 @@ textattack attack --recipe textfooler --model bert-base-uncased-mr --num-example *对 Quora 问句对数据集上训练的 DistilBERT 模型进行 DeepWordBug 攻击*: ```bash -textattack attack --model distilbert-base-uncased-qqp --recipe deepwordbug --num-examples 100 +textattack attack --model distilbert-base-uncased-cola --recipe deepwordbug --num-examples 100 ``` *对 MR 数据集上训练的 LSTM 模型:设置束搜索宽度为 4,使用词嵌入转换进行无目标攻击*: @@ -315,7 +315,7 @@ TextAttack 的组件中,有很多易用的数据增强工具。`textattack.Aug "it's a mystery how the movie could be released in this condition .", 0 ``` -使用命令 `textattack augment --csv examples.csv --input-column text --recipe embedding --pct-words-to-swap .1 --transformations-per-example 2 --exclude-original` +使用命令 `textattack augment --input-csv examples.csv --output-csv output.csv --input-column text --recipe embedding --pct-words-to-swap .1 --transformations-per-example 2 --exclude-original` 会增强 `text` 列,约束对样本中 10% 的词进行修改,生成输入数据两倍的样本,同时结果文件中不保存 csv 文件的原始输入。(默认所有结果将会保存在 `augment.csv` 文件中) 数据增强后,下面是 `augment.csv` 文件的内容: @@ -454,8 +454,6 @@ dataset = [('Today was....', 1), ('This movie is...', 0), ...] ### 何为攻击 & 如何设计新的攻击 -`Attack` 中的 `attack_one` 方法以 `AttackedText` 对象作为输入,若攻击成功,返回 `SuccessfulAttackResult`,若攻击失败,返回 `FailedAttackResult`。 - 我们将攻击划分并定义为四个组成部分:**目标函数** 定义怎样的攻击是一次成功的攻击,**约束条件** 定义怎样的扰动是可行的,**变换规则** 对输入文本生成一系列可行的扰动结果,**搜索方法** 在搜索空间中遍历所有可行的扰动结果。每一次攻击都尝试对输入的文本添加扰动,使其通过目标函数(即判断攻击是否成功),并且扰动要符合约束(如语法约束,语义相似性约束)。最后用搜索方法在所有可行的变换结果中,挑选出优质的对抗样本。 diff --git a/docs/0_get_started/command_line_usage.md b/docs/0_get_started/command_line_usage.md index 21fadd3c..e35af9e8 100644 --- a/docs/0_get_started/command_line_usage.md +++ b/docs/0_get_started/command_line_usage.md @@ -40,7 +40,7 @@ For example, given the following as `examples.csv`: The command: ``` -textattack augment --csv examples.csv --input-column text --recipe eda --pct-words-to-swap .1 \ +textattack augment --input-csv examples.csv --output-csv output.csv --input-column text --recipe eda --pct-words-to-swap .1 \ --transformations-per-example 2 --exclude-original ``` will augment the `text` column with 10% of words edited per augmentation, twice as many augmentations as original inputs, and exclude the original inputs from the diff --git a/docs/1start/attacks4Components.md b/docs/1start/attacks4Components.md index 0848d101..54030650 100644 --- a/docs/1start/attacks4Components.md +++ b/docs/1start/attacks4Components.md @@ -12,8 +12,16 @@ This modular design enables us to easily assemble attacks from the literature wh ![two-categorized-attacks](/_static/imgs/intro/01-categorized-attacks.png) - - +- You can create one new attack (in one line of code!!!) from composing members of four components we proposed, for instance: + +```bash +# Shows how to build an attack from components and use it on a pre-trained model on the Yelp dataset. +textattack attack --attack-n --model bert-base-uncased-yelp --num-examples 8 \ + --goal-function untargeted-classification \ + --transformation word-swap-wordnet \ + --constraints edit-distance^12 max-words-perturbed^max_percent=0.75 repeat stopword \ + --search greedy +``` ### Goal Functions diff --git a/docs/1start/multilingual-visualization.md b/docs/1start/multilingual-visualization.md index 9d04e235..e9400371 100644 --- a/docs/1start/multilingual-visualization.md +++ b/docs/1start/multilingual-visualization.md @@ -3,14 +3,66 @@ TextAttack Extended Functions (Multilingual) +## Textattack Supports Multiple Model Types besides huggingface models and our textattack models: + +- Example attacking TensorFlow models @ [https://textattack.readthedocs.io/en/latest/2notebook/Example_0_tensorflow.html](https://textattack.readthedocs.io/en/latest/2notebook/Example_0_tensorflow.html) +- Example attacking scikit-learn models @ [https://textattack.readthedocs.io/en/latest/2notebook/Example_1_sklearn.html](https://textattack.readthedocs.io/en/latest/2notebook/Example_1_sklearn.html) +- Example attacking AllenNLP models @ [https://textattack.readthedocs.io/en/latest/2notebook/Example_2_allennlp.html](https://textattack.readthedocs.io/en/latest/2notebook/Example_2_allennlp.html) +- Example attacking Kera models @ [https://textattack.readthedocs.io/en/latest/2notebook/Example_3_Keras.html](https://textattack.readthedocs.io/en/latest/2notebook/Example_3_Keras.html) + + ## Multilingual Supports -- see example code: [https://github.com/QData/TextAttack/blob/master/examples/attack/attack_camembert.py](https://github.com/QData/TextAttack/blob/master/examples/attack/attack_camembert.py) for using our framework to attack French-BERT. -- see tutorial notebook: [https://textattack.readthedocs.io/en/latest/2notebook/Example_4_CamemBERT.html](https://textattack.readthedocs.io/en/latest/2notebook/Example_4_CamemBERT.html) for using our framework to attack French-BERT. +- see tutorial notebook for using our framework to attack French-BERT.: [https://textattack.readthedocs.io/en/latest/2notebook/Example_4_CamemBERT.html](https://textattack.readthedocs.io/en/latest/2notebook/Example_4_CamemBERT.html) + +- see example code for using our framework to attack French-BERT: [https://github.com/QData/TextAttack/blob/master/examples/attack/attack_camembert.py](https://github.com/QData/TextAttack/blob/master/examples/attack/attack_camembert.py) . + + + +## User defined custom inputs and models + + +### Custom Datasets: Dataset from a file + +Loading a dataset from a file is very similar to loading a model from a file. A 'dataset' is any iterable of `(input, output)` pairs. +The following example would load a sentiment classification dataset from file `my_dataset.py`: + +```python +dataset = [('Today was....', 1), ('This movie is...', 0), ...] +``` + +You can then run attacks on samples from this dataset by adding the argument `--dataset-from-file my_dataset.py`. + + +#### Custom Model: from a file +To experiment with a model you've trained, you could create the following file +and name it `my_model.py`: + +```python +model = load_your_model_with_custom_code() # replace this line with your model loading code +tokenizer = load_your_tokenizer_with_custom_code() # replace this line with your tokenizer loading code +``` + +Then, run an attack with the argument `--model-from-file my_model.py`. The model and tokenizer will be loaded automatically. + + + +## User defined Custom attack components + +The [documentation website](https://textattack.readthedocs.io/en/latest) contains walkthroughs explaining basic usage of TextAttack, including building a custom transformation and a custom constraint.. + +- custom transformation example @ [https://textattack.readthedocs.io/en/latest/2notebook/1_Introduction_and_Transformations.html](https://textattack.readthedocs.io/en/latest/2notebook/1_Introduction_and_Transformations.html) + +- custome constraint example @[https://textattack.readthedocs.io/en/latest/2notebook/2_Constraints.html#A-custom-constraint](https://textattack.readthedocs.io/en/latest/2notebook/2_Constraints.html#A-custom-constraint) + + + + +## Visulizing TextAttack generated Examples; +- You can visualize the generated adversarial examples vs. see examples, following visualization ways we provided here: [https://textattack.readthedocs.io/en/latest/2notebook/2_Constraints.html](https://textattack.readthedocs.io/en/latest/2notebook/2_Constraints.html) -## We have built a new WebDemo For Visulizing TextAttack generated Examples; +- If you have webapp, we have also built a new WebDemo [TextAttack-WebDemo Github](https://github.com/QData/TextAttack-WebDemo) for visualizing generated adversarial examples from textattack.. -- [TextAttack-WebDemo Github](https://github.com/QData/TextAttack-WebDemo) \ No newline at end of file diff --git a/docs/3recipes/attack_recipes_cmd.md b/docs/3recipes/attack_recipes_cmd.md index c5a57b7a..5a5d451f 100644 --- a/docs/3recipes/attack_recipes_cmd.md +++ b/docs/3recipes/attack_recipes_cmd.md @@ -36,7 +36,7 @@ textattack attack --recipe textfooler --model bert-base-uncased-mr --num-example *DeepWordBug on DistilBERT trained on the Quora Question Pairs paraphrase identification dataset*: ```bash -textattack attack --model distilbert-base-uncased-qqp --recipe deepwordbug --num-examples 100 +textattack attack --model distilbert-base-uncased-cola --recipe deepwordbug --num-examples 100 ``` *Beam search with beam width 4 and word embedding transformation and untargeted goal function on an LSTM*: diff --git a/docs/3recipes/augmenter_recipes_cmd.md b/docs/3recipes/augmenter_recipes_cmd.md index 39c91b9f..c1d49614 100644 --- a/docs/3recipes/augmenter_recipes_cmd.md +++ b/docs/3recipes/augmenter_recipes_cmd.md @@ -38,7 +38,10 @@ and the number of augmentations per input example. It outputs a CSV in the same "it's a mystery how the movie could be released in this condition .", 0 ``` -The command `textattack augment --csv examples.csv --input-column text --recipe embedding --pct-words-to-swap .1 --transformations-per-example 2 --exclude-original` +The command +``` +textattack augment --input-csv examples.csv --output-csv output.csv --input-column text --recipe embedding --pct-words-to-swap .1 --transformations-per-example 2 --exclude-original +``` will augment the `text` column by altering 10% of each example's words, generating twice as many augmentations as original inputs, and exclude the original inputs from the output CSV. (All of this will be saved to `augment.csv` by default.) diff --git a/examples/attack/attack_camembert.py b/examples/attack/attack_camembert.py index 16d5d39e..f50fde38 100644 --- a/examples/attack/attack_camembert.py +++ b/examples/attack/attack_camembert.py @@ -4,6 +4,7 @@ import numpy as np from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, pipeline +from textattack import Attacker from textattack.attack_recipes import PWWSRen2019 from textattack.datasets import HuggingFaceDataset from textattack.models.wrappers import ModelWrapper @@ -20,11 +21,11 @@ class HuggingFaceSentimentAnalysisPipelineWrapper(ModelWrapper): [[0.218262017, 0.7817379832267761] """ - def __init__(self, pipeline): - self.pipeline = pipeline + def __init__(self, model): + self.model = model def __call__(self, text_inputs): - raw_outputs = self.pipeline(text_inputs) + raw_outputs = self.model(text_inputs) outputs = [] for output in raw_outputs: score = output["score"] @@ -55,7 +56,6 @@ def __call__(self, text_inputs): recipe.transformation.language = "fra" dataset = HuggingFaceDataset("allocine", split="test") -for idx, result in enumerate(recipe.attack_dataset(dataset)): - print(("-" * 20), f"Result {idx+1}", ("-" * 20)) - print(result.__str__(color_method="ansi")) - print() + +attacker = Attacker(recipe, dataset) +results = attacker.attack_dataset() diff --git a/examples/attack/attack_from_components.sh b/examples/attack/attack_from_components.sh index 5bbcbdd9..79544ff3 100755 --- a/examples/attack/attack_from_components.sh +++ b/examples/attack/attack_from_components.sh @@ -3,5 +3,5 @@ # model on the Yelp dataset. textattack attack --attack-n --goal-function untargeted-classification \ --model bert-base-uncased-yelp --num-examples 8 --transformation word-swap-wordnet \ - --constraints edit-distance^12 max-words-perturbed:max_percent=0.75 repeat stopword \ + --constraints edit-distance^12 max-words-perturbed^max_percent=0.75 repeat stopword \ --search greedy \ No newline at end of file diff --git a/examples/augmentation/augment.csv b/examples/augmentation/augment.csv index f970689f..670ec674 100644 --- a/examples/augmentation/augment.csv +++ b/examples/augmentation/augment.csv @@ -1,11 +1,11 @@ text,label -"the rock is destined to be the 21st century's novel conan and that he's go to make a splash yet greater than arnold schwarzenegger , jean- claud van damme or steven segal.",1 -"the rock is destined to be the 21st century's novo conan and that he's going to make a splash yet greater than arnold schwarzenegger , jean- claud van damme or stephens segal.",1 -the gorgeously elaborate continuation of 'the lord of the rings' triad is so massive that a column of words cannot adequately describe co-writer/director pete jackson's expanded vision of j . r . r . tolkien's middle-earth .,1 -the gorgeously elaborate continuation of 'the lordy of the rings' trilogy is so huge that a column of words cannot adequately describe co-writer/superintendent peter jackson's enlargements vision of j . r . r . tolkien's middle-earth .,1 -take care of my cat offers a cheerfully different slice of asian cinema .,1 -take care of my cat offers a refreshingly different slice of asian cinemas .,1 -a technically well-made suspenser . . . but its abrupt fall in iq points as it races to the finish line demonstrating simply too discouraging to let slide .,0 -a technologically well-made suspenser . . . but its abrupt dip in iq points as it races to the finish line proves simply too discouraging to let slide .,0 -it's a mystery how the cinematography could be released in this condition .,0 -it's a mystery how the movies could be released in this condition .,0 +"the rock is destined to be the new conan and that he's going to make a splash even greater than arnold , jean- claud van damme or steven segal.",1 +"the rock is destined to be the 21st century's new conan and that he's going to caravan make a splash even greater than arnold schwarzenegger , jean- claud van damme or steven segal.",1 +the gorgeously rarify continuation of 'the lord of the rings' trilogy is so huge that a column of give-and-take cannot adequately describe co-writer/director shaft jackson's expanded vision of j . r . r . tolkien's middle-earth .,1 +the gorgeously elaborate of 'the of the rings' trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded of j . r . r . tolkien's middle-earth .,1 +take care different my cat offers a refreshingly of slice of asian cinema .,1 +take care of my cat offers a different slice of asian cinema .,1 +a technically well-made suspenser . . . but its abrupt drop in iq points as it races to the finish IT line proves simply too discouraging to let slide .,0 +a technically well-made suspenser . . . but its abrupt drop in iq points as it races to the finish demarcation proves plainly too discouraging to let slide .,0 +it's pic a mystery how the movie could be released in this condition .,0 +it's a mystery how the movie could in released be this condition .,0 diff --git a/examples/augmentation/augment.sh b/examples/augmentation/augment.sh index 0cc0c524..3b3d3ee3 100755 --- a/examples/augmentation/augment.sh +++ b/examples/augmentation/augment.sh @@ -1,2 +1,2 @@ #!/bin/bash -textattack augment --csv examples.csv --input-column text --recipe eda --pct-words-to-swap .1 --transformations-per-example 2 --exclude-original --overwrite +textattack augment --input-csv examples.csv --output-csv output.csv --input-column text --recipe eda --pct-words-to-swap .1 --transformations-per-example 2 --exclude-original --overwrite diff --git a/examples/augmentation/example.csv b/examples/augmentation/example.csv deleted file mode 100644 index d408b7f9..00000000 --- a/examples/augmentation/example.csv +++ /dev/null @@ -1,2 +0,0 @@ -"text",label -"it's a mystery how the movie could be released in this condition .", 0 diff --git a/examples/train/train_lstm_imdb_sentiment_classification.sh b/examples/train/train_lstm_imdb_sentiment_classification.sh new file mode 100755 index 00000000..f65eeeca --- /dev/null +++ b/examples/train/train_lstm_imdb_sentiment_classification.sh @@ -0,0 +1,4 @@ +#!/bin/bash +# Trains `bert-base-cased` on the STS-B task for 3 epochs. This is a basic +# demonstration of our training script and `datasets` integration. +textattack train --model-name-or-path lstm --dataset imdb --epochs 50 --learning-rate 1e-5 \ No newline at end of file diff --git a/examples/train/train_lstm_rotten_tomatoes_sentiment_classification.sh b/examples/train/train_lstm_rotten_tomatoes_sentiment_classification.sh index 5bf74fd9..19d2f9b8 100755 --- a/examples/train/train_lstm_rotten_tomatoes_sentiment_classification.sh +++ b/examples/train/train_lstm_rotten_tomatoes_sentiment_classification.sh @@ -1,4 +1,4 @@ #!/bin/bash # Trains `bert-base-cased` on the STS-B task for 3 epochs. This is a basic # demonstration of our training script and `datasets` integration. -textattack train --model-name-or-path lstm --dataset rotten_romatoes --epochs 50 --learning-rate 1e-5 \ No newline at end of file +textattack train --model-name-or-path lstm --dataset rotten_tomatoes --epochs 50 --learning-rate 1e-5 \ No newline at end of file diff --git a/textattack/training_args.py b/textattack/training_args.py index 4f21b8a0..14b23327 100644 --- a/textattack/training_args.py +++ b/textattack/training_args.py @@ -360,6 +360,7 @@ def _add_parser_args(cls, parser): # Arguments that are needed if we want to create a model to train. parser.add_argument( "--model-name-or-path", + "--model", type=str, required=True, help='Name or path of the model we want to create. "lstm" and "cnn" will create TextAttack\'s LSTM and CNN models while'