You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For example: A claimed adversary from mr_bert.txt is:
orig sent (0): to portray modern women the way director davis has done is just unthinkable
adv sent (1): to portray modern women the way director davis has done is just imaginable
unthinkable and imaginable are antonyms which erroneously have high cosine similarity suggesting that those are synonyms. I suggest such examples should not be considered while evaluating the success rate of attack, as the human evaluation would clearly label it as positive (1) and not negative.
The text was updated successfully, but these errors were encountered:
The ~13% after-attack accuracy reported considers such examples as success , which actually is not. I guess Human evaluation filter should finally govern the after-attack accuracy. Please correct me if I am wrong. Thanks.
There seems to be a issue in a few adversaries.
For example: A claimed adversary from mr_bert.txt is:
orig sent (0): to portray modern women the way director davis has done is just unthinkable
adv sent (1): to portray modern women the way director davis has done is just imaginable
unthinkable and imaginable are antonyms which erroneously have high cosine similarity suggesting that those are synonyms. I suggest such examples should not be considered while evaluating the success rate of attack, as the human evaluation would clearly label it as positive (1) and not negative.
The text was updated successfully, but these errors were encountered: