Skip to content

Fake Text Detection Toy Project in ADS5035 (Data-driven Security and Privacy)

Notifications You must be signed in to change notification settings

kabbi159/fake-text-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fake-text-detection

Fake Text Detection Toy Project in ADS5035 (Data-driven Security and Privacy)
This is a experiment code on WebText and gpt2-output-dataset
functions in dataset.py and util.py is forked from gpt2-output-detector

Fine-tuned language model classifier

Model Train Top-k 40 Nucleus Random
BERT Top-k
Nucleus
Random
89.79%
82.68%
47.3%
72.22%
78.84%
53.9%
43.79%
64.23%
80.45%
RoBERTa Top-k
Nucleus
Random
98.35%
90.84%
51.17%
69.47%
88.36%
58.75%
49.22%
75.43%
91.34%

Token probability-base Classifier

Baseline Usage

Before run this code, construct dataset data/webtext.{train,dev,test}.jsonl, data/xl-1542M-{k40,nucleus}.{train,dev,test}.jsonl with this format.
You can run this code:

python baseline.py
--max-epochs=2 \
--batch-size=32 \
--max-sequence-length=128 \
--data-dir='data' \
--real-dataset='webtext' \
--fake-dataset='xl-1542M-nucleus' \
--save-dir='logs' \
--learning-rate=2e-5 \
--weight-decay=0 \
--model-name='bert-base-cased' \
--wandb=True

Probability/Rank Extractor Usage

Extract Probability & Rank of each token with 16 Threads num-train-pairs 50000 indicates "Real:Fake = 50,000:50,000"

python prob_extract.py
--batch-size=32 \
--max-sequence-length=128 \
--seed 10 \
--num-workers 16\
--num-train-pairs 50000

About

Fake Text Detection Toy Project in ADS5035 (Data-driven Security and Privacy)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages