Skip to content

testing-cs/vulnerability-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

software-vulnerability-detection-imbalance

This project is the Pytorch implementation for the paper An Empirical Study of the Imbalance Issue in Software Vulnerability Detection.

Project Overview

  1. Dataset
  2. Source code for CodeBERT
  3. Source code for GraphCodeBERT

Environment

 Python== 3.7
 pytorch==1.7.1
 torchvision==0.8.2
 tree-sitter==0.20.1
 transformers==4.24.0
 tqdm
 numpy

Dataset

All datasets provide function-level source code. Three open-source repositories:

CodeXGlue provides the devign dataset.

Devign provides the ffmpeg and qemu datasets.

Lin2018 provides the Asterisk, FFmpeg, LibPNG, LibTIFF, Pidgin, and VLC datasets.

Each dataset includes the training, validation, and test sets (*_trian.jsonl, *_valid.jsonl, *_test.jsonl).

Run

For GraphCodeBERT, we need to build the tree-sitter to parse code snippets and extract variable names. Build tree-sitter using the following command:

cd graphcoderbert/python_parser/parser_folder
bash build.sh

CodeBERT and GraphCodeBERT use the same commands for training/test. We use CodeBERT as an example.

Fine-tuning

python run.py \
    --do_train \
    --training standard\
    --data_root devign\
    --project_name qemu\
    --epochs 50 \
    --evaluate_during_training \
    --seed 123456 

Validation

python run.py \
    --do_eval \
    --training standard\
    --data_root devign\
    --project_name qemu\

Test

python run.py \
    --do_test \
    --training standard\
    --data_root devign\
    --project_name qemu\

Parameter setting:

  • --training: the solution used to address the imbalance issue.
    • Choices:
      • standard: use the default setting of CodeBERT and GraphCodeBERT.
      • weight: use the mean false error loss
      • cbl: use the class-balanced loss
        • augmentation: use the adversarial attack-based augmentation (re-sampled data are created in the dataset folder. You can also generate it by using the code in dataset/function-level/identifyP/augment.py)
      • down: use the random down-sampling
      • focal: use the focal loss
      • over: use the random over-sampling (re-sampled data are created in the dataset folder. You can also generate it by using the code in dataset/function-level/identifyP/augment_du.py)
      • threshold: use the threshold-moving
  • data_root: the source of data
    • Choices: codexglue, devign, lin2018
  • project_name: the name of dataset
    • Choices: please check the names in dataset/function-level/ for each source.