Skip to content

This repository presents and compares HeterSUMGraph and variants using GATConv, GATv2Conv and a combination of HeterSUMGraph and SummaRuNNer (using HeterSUMGraph as a sentence encoder).

Notifications You must be signed in to change notification settings

Baragouine/HeterSUMGraph

Repository files navigation

HeterSUMGraph (extractive summarization)

This repository presents and compares HeterSUMGraph and variants using GATConv, GATv2Conv and a combination of HeterSUMGraph and SummaRuNNer (using HeterSUMGraph as a sentence encoder).

The datasets are CNN-DailyMail and NYT50.

paper: HeterSUMGraph

Clone project

git clone https://github.com/Baragouine/HeterSUMGraph.git

Enter into the directory

cd HeterSUMGraph

Create environnement

conda create --name HeterSUMGraph python=3.9

Activate environnement

conda activate HeterSUMGraph

Install dependencies

pip install -r requirements.txt

Install nltk data

To install nltk data:

  • Open a python console.
  • Type import nltk; nltk.download().
  • Download all data.
  • Close the python console.

Convert NYT zip to NYT50 json and preprocessing it

preprocessing mean cleaning, labeling, etc. not mean preprocessing before training.

  • Download raw NYT zip from https://catalog.ldc.upenn.edu/LDC2008T19 to data/
  • Run 00-00-convert_nyt_to_json.ipynb (convert zip to json).
  • Run 00-01-nyt_filter_short_summaries.ipynb (keep summary with 50 distinct word only).
  • Run 00-02-compute_nyt_labels.ipynb (comput labels).
  • Run python scripts/compute_tfidf_dataset.py -input data/nyt_corpus_LDC2008T19_50.json -output data/nyt50_dataset_tfidf.json -docs_col_name docs (compute tfidfs for whole dataset).
  • Run python scripts/compute_tfidf_sent_dataset.py -input data/nyt_corpus_LDC2008T19_50.json -output data/compute_tfidf_sent_dataset.json -docs_col_name docs (compute tfidfs for each document).
  • Run 00-03-split_NYT50.ipynb (split NYT50 to train, val, test).

tfidfs computing is only necessary for HeterSUMGraph based models.

CNN-DailyMail preprocessing

preprocessing mean cleaning, labeling, etc. not mean preprocessing before training.

  • Follow CNN-DailyMail preprocessing instruction on: https://github.com/Baragouine/SummaRuNNer/tree/master.
  • After labels computed, run 00-03-merge_cnn_dailymail.ipynb to merge CNN-DailyMail to one json file.
  • Run python scripts/compute_tfidf_dataset.py -input data/cnn_dailymail.json -output data/cnn_dailymail_dataset_tfidf.json -docs_col_name article (compute tfidfs for whole dataset).
  • Run python scripts/compute_tfidf_sent_dataset.py -input data/cnn_dailymail.json -output data/cnn_dailymail_sent_tfidf.json -docs_col_name article (compute tfidfs for each document).

tfidfs computing is only necessary for HeterSUMGraph based models.

Embeddings

For training you must use glove 300 embeddings, they must have the following path: data/glove.6B/glove.6B.300d.txt

Training

For CNN/DailyMail max doc len is 100 sentences, not 50 as in the paper (same max doc len as SummaRuNNer to compare both). Run one of the notebooks below to train and evaluate the associated model:

  • 01-train_HeterSUMGraph_CNN_DailyMail.ipynb: paper model on CNN-DailyMail
  • 02-train_HeterSUMGraph_NYT50.ipynb: paper model on NYT50
  • 03-train_HeterSUMGraph_CNN_DailyMail_TG_GATConv.ipynb: HeterSUMGraph with torch_geometric GATConv layer on CNN-DailyMail.
  • 04-train_HeterSUMGraph_NYT50_TG_GATConv.ipynb: HeterSUMGraph with torch_geometric GATConv layer on NYT50.
  • 05-train_HeterSUMGraph_CNN_DailyMail_TG_GATv2Conv.ipynb: HeterSUMGraph with torch_geometric GATv2Conv layer on CNN-DailyMail.
  • 06-train_HeterSUMGraph_NYT50_TG_GATv2Conv.ipynb: HeterSUMGraph with torch_geometric GATv2Conv layer on NYT50.
  • 07-train_HSGRNN_CNN_DailyMail_TG_GATv2Conv.ipynb: HeterSUMGraph with torch_geometric GATv2Conv layer + SummaRuNNer on CNN-DailyMail.
  • 08-train_HSGRNN_NYT50_TG_GATv2Conv.ipynb: HeterSUMGraph with torch_geometric GATv2Conv layer + SummaRuNNer on NYT50.

Result

NYT50 (limited-length ROUGE Recall)

model ROUGE-1 ROUGE-2 ROUGE-L
HeterSUMGraph (Wang) 46.89 26.26 42.58
HeterSUMGraph (ours) 45.5 ± 0.0 24.2 ± 0.0 34.1 ± 0.0
HSG GATConv 45.4 ± 0.0 24.2 ± 0.0 34.0 ± 0.0
HSG GATv2Conv 47.2 ± 0.0 26.5 ± 0.0 35.5* ± 0.0
HSGRNN GATv2Conv 46.9 ± 0.0 26.3 ± 0.0 35.3 ± 0.0

*: maybe the ROUGE-L have changed in the rouge library I use.

CNN/DailyMail (full-length f1 rouge)

model ROUGE-1 ROUGE-2 ROUGE-L
SummaRuNNer(Nallapati) 39.6 ± 0.2 16.2 ± 0.2 35.3 ± 0.2
HeterSUMGraph (ours) 38.2 ± 0.0 15.1 ± 0.0 24.1 ± 0.0
HSG GATConv 39.8 ± 0.0 16.3 ± 0.0 24.6 ± 0.0
HSG GATv2Conv 39.9 ± 0.0 16.4 ± 0.0 24.7* ± 0.0
HSGRNN GATv2Conv 39.5 ± 0.0 16.2 ± 0.0 24.4 ± 0.0

*: maybe the ROUGE-L have changed in the rouge library I use.

About

This repository presents and compares HeterSUMGraph and variants using GATConv, GATv2Conv and a combination of HeterSUMGraph and SummaRuNNer (using HeterSUMGraph as a sentence encoder).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published