This is the source code for paper ComFact: A Benchmark for Linking Contextual Commonsense Knowledge.
Start with creating a python 3.6 venv and installing requirements.txt.
Our ComFact dataset can be downloaded from this link, please place data/ under this root directory.
Pretrained Glove embeddings can be downloaded from this link, please place glove/ under the data/ directory and unzip glove.6B.zip in it.
Data portions:
- Persona-Atomic data portion: persona/
- Mutual-Atomic data portion: mutual/
- Roc-Atomic data portion: roc/
- Movie-Atomic data portion: movie/
python data_preprocessing_main.py
Prepare directory:
mkdir pred
mkdir runs
Training:
bash train_baseline.sh
Parameters:
- language model ${lm}: "deberta-large" | "deberta-base" | "roberta-large" | "roberta-base" | "bert-large" | "bert-base" | "distilbert-base" | "lstm"
- data portion ${portion}: "persona" | "mutual" | "roc" | "movie" | "all" (training on the union of all four data portions)
- context window ${window}: "nlg" (half window without future context) | "nlu" (full context window)
- linking task ${task}: "fact_full" (direct setting) | "head" (head entity linking, sub-task in pipeline setting) | "fact_cut" (fact linking of relevant head entities, sub-task in pipeline setting)
- evaluation set ${eval_set}: "val" (validation set) | "test" (testing set)
Evaluating direct setting or sub-tasks in pipeline setting:
bash run_baseline.sh
parameters refer to Training.
Fine-grained analysis on fact linking results (after evaluating by run_baseline.sh):
python evaluate_linking.py --model ${lm} --window ${window} --portion ${portion} --linking ${task}
parameters refer to Training, ${task} should be fact_full | fact_cut
Evaluating full pipeline setting:
bash run_baseline_pipeline.sh
parameters refer to Training.
Evaluating head entity linkers in fact linking:
bash run_baseline_head_linker.sh
parameters refer to Training.
Cross evaluation:
bash cross_evaluation.sh
Parameters:
- source data portion providing training set ${source_portion}: "persona" | "mutual" | "roc" | "movie" | "all"
- target data portion providing validation or testing set ${target_portion}: "persona" | "mutual" | "roc" | "movie" | "all"
others refer to Training.
Plot heatmap for cross evaluation (lm: roberta-large, window: nlg, task: fact_full):
python plot_cross_evaluation.py
Setup NLG evaluation toolkits
pip install git+https://github.com/Maluuba/nlg-eval.git@master
nlg-eval --setup
Download CEM data from this link and place data/ under CEM/ directory.
Original preprocessed CEM data: ED/dataset_preproc.p
We also include our preprocessed CEM data with ComFact refined knowledge: ED/dataset_preproc_link.p
Prepare directory:
mkdir CEM/saved
mkdir CEM/vectors
Copy glove.6B.zip from data/glove/ to CEM/vectors/ directory.
Training fact linker for CEM knowledge refinement:
python preprocessing_rel_tail_link_x.py
bash train_baseline_rel_tail_link_x.sh
Extracting CEM data and preprocessing for knowledge refinement:
python cem_data_extract.py
python preprocess_cem_link.py
The extracted data will be placed in data/cem/rel_tail/nlg/test/${split}_data.json, where ${split}: "train" | "val" | "test"
Knowledge refining by fact linker, i.e., labeling the relevance of knowledge in the extracted CEM data:
bash run_baseline_cem_link_x.sh
python label_cem.py
Write back to the CEM data form:
python cem_data_back.py
Switch to the CEM folder:
cd CEM
Training CEM dialogue model:
python main.py --model cem --dataset ${dataset} --save_path ${save} --model_path ${save} --cuda
Parameters:
- data source ${dataset}: dataset_preproc.p (original CEM dataset) | dataset_preproc_link.p (CEM dataset with ComFact refined knowledge),
- ${save}: your directory for saving the model and results.
Testing CEM dialogue model:
python main.py --test --model cem --dataset ${dataset} --save_path ${save} --model_path ${save} --cuda
NLG Evaluation:
Move the obtained results.txt from your result saving directory (${save}) to results/ directory, rename the file to ${name}.txt, then run:
python src/scripts/evaluate.py --results ${name}
Parameters:
- ${name}: name of the results file, e.g., CEM_link
We include the dialogue generation results: CEM_ori.txt (from original CEM) and CEM_link.txt (from CEM trained with ComFact refined knowledge) under results/ directory.