State-of-the-art Entity Matching (EM) approaches rely on transformer architectures, such as BERT, for generating highly contextualized embeddings of terms. The embeddings are then used to predict whether pairs of entity descriptions refer to the same real-world entity. BERT-based EM models demonstrated to be effective, but act as black-boxes for the users, who have limited insight into the motivations behind their decisions. In this repo, we perform a multi-facet analysis of the components of pre-trained and fine-tuned BERT architectures applied to an EM task.
For a detailed description of the work please read our paper. Please cite the paper if you use the code from this repository in your work.
@article{DBLP:journals/pvldb/PaganelliBBG22,
author = {Matteo Paganelli and
Francesco Del Buono and
Andrea Baraldi and
Francesco Guerra},
title = {Analyzing How {BERT} Performs Entity Matching},
journal = {Proc. {VLDB} Endow.},
volume = {15},
number = {8},
pages = {1726--1738},
year = {2022}
}
- Python: Python 3.*
- Packages: requirements.txt
$ virtualenv -p python3 venv
$ source venv/bin/activate
$ pip install -r requirements.txt
This means to create a binary classifier on top of the BERT model. For purely demonstrative purposes, below we will train the EM model only on the dataset Structured_Fodors-Zagats.
Option 1: pre-trained EM model. Only the classification layer is fine-tuned on the EM task.
python -m utils.bert_em_pretrain --use_cases Structured_Fodors-Zagats --tok sent_pair --experiment compute_features
python -m utils.bert_em_pretrain --use_cases Structured_Fodors-Zagats --tok sent_pair --experiment train
Option 2: fine-tuned EM model. Both the BERT architecture and the classification layer are fine-tuned on the EM task.
python -m utils.bert_em_fine_tuning --fit True --use_cases Structured_Fodors-Zagats --tok sent_pair
The model will be stored in the directory results/models/.
Pre-trained EM model
python -m utils.bert_em_pretrain --use_cases all --tok sent_pair --experiment eval
python -m utils.bert_em_pretrain --use_cases all --tok attr_pair --experiment eval
Fine-tuned EM model
python -m utils.bert_em_fine_tuning --fit False --use_cases all --tok sent_pair
python -m utils.bert_em_fine_tuning --fit False --use_cases all --tok attr_pair
python -m experiments.fine_tuning_impact_on_attention.py --use_cases all
python -m experiments.fine_tuning_impact_on_embeddings.py --use_cases all
Prerequisites
python -m experiments.get_attention_weights.py --use_cases all --multi_process True --attn_extractor token_extractor --special_tokens True --agg_metric mean --fine_tune False
python -m experiments.get_attention_weights.py --use_cases all --multi_process True --attn_extractor token_extractor --special_tokens True --agg_metric mean --fine_tune True
Run the experiment
python -m experiments.e2e_attention.py --use_cases Structured_Amazon-Google Structured_Beer Textual_Abt-Buy Dirty_Walmart-Amazon --experiment comparison --comparison tune --small_plot True
python -m experiments.e2e_attention.py --use_cases Structured_Amazon-Google Structured_Beer Textual_Abt-Buy Dirty_Walmart-Amazon --fine_tune True --experiment simple --small_plot True
Prerequisites
python -m experiments.attention.analyze_attention_weights.py --use_cases all --attn_extractor attr_extractor --agg_metric max --fine_tune False --attn_tester attr_tester
Run the experiment
python -m experiments.attention.attention_test.py --use_cases all --attn_extractor attr_extractor --agg_metric max --fine_tune False --attn_tester attr_tester --analysis_target benchmark --analysis_type multi --plot_params attr_attn_3_last
Prerequisites
python -m experiments.attention.analyze_attention_weights.py --use_cases all --attn_extractor attr_extractor --agg_metric mean --fine_tune False --tok sent_pair --attn_tester attr_pattern_tester
python -m experiments.attention.analyze_attention_weights.py --use_cases all --attn_extractor attr_extractor --agg_metric mean --fine_tune False --tok attr_pair --attn_tester attr_pattern_tester
python -m experiments.attention.analyze_attention_weights.py --use_cases all --attn_extractor attr_extractor --agg_metric mean --fine_tune True --tok sent_pair --attn_tester attr_pattern_tester
python -m experiments.attention.analyze_attention_weights.py --use_cases all --attn_extractor attr_extractor --agg_metric mean --fine_tune True --tok attr_pair --attn_tester attr_pattern_tester
Run the experiment
python -m experiments.attention.attention_patterns.py --use_cases all --attn_extractor attr_extractor --agg_metric mean --attn_tester attr_pattern_tester --experiment all_freq --analysis_type comparison --comparison_param tune_tok
Prerequisites
python -m experiments.attention.analyze_attention_weights.py --use_cases all --attn_extractor attr_extractor --agg_metric mean --fine_tune False --tok sent_pair --attn_tester attr_pattern_tester
python -m experiments.attention.analyze_attention_weights.py --use_cases all --attn_extractor attr_extractor --agg_metric mean --fine_tune True --tok sent_pair --attn_tester attr_pattern_tester
Run the experiment
python -m experiments.attention.attention_patterns.py --use_cases all --attn_extractor attr_extractor --agg_metric mean --attn_tester attr_pattern_tester --experiment match_freq_by_layer --analysis_type comparison --comparison_param tune
Prerequisites
python -m experiments.attention.analyze_attention_weights.py --use_cases all --attn_extractor attr_extractor --agg_metric max --fine_tune True --attn_tester attr_tester
Run the experiment
python -m experiments.attention.attention_head_pruning.py --use_cases all --attn_extractor attr_extractor --agg_metric max --attn_tester attr_tester --task compute --prune_or_mask_methods importance maa random --prune_or_mask_amounts 5 10 20 50 100 --prune True
python -m experiments.attention.attention_head_pruning.py --use_cases all --attn_extractor attr_extractor --agg_metric max --attn_tester attr_tester --task visualize --prune_or_mask_methods importance maa random --prune_or_mask_amounts 5 10 20 50 100 --prune True
Prerequisites
python -m experiments.attention.get_attention_weights.py --use_cases all --multi_process True --attn_extractor attr_extractor --special_tokens True --agg_metric max --fine_tune True
python -m experiments.attention.get_attention_weights.py --use_cases all --multi_process True --attn_extractor attr_extractor --special_tokens True --agg_metric max --fine_tune False
Run the experiment
python -m experiments.attention.cls_to_attr_attention.py --use_cases Structured_Fodors-Zagats Structured_DBLP-GoogleScholar Structured_DBLP-ACM Dirty_DBLP-ACM --attn_extractor attr_extractor --agg_metric max --experiment comparison --comparison tune --small_plot True
python -m experiments.attention.cls_to_attr_attention.py --use_cases Structured_Fodors-Zagats Structured_DBLP-GoogleScholar Structured_DBLP-ACM Dirty_DBLP-ACM --attn_extractor attr_extractor --agg_metric max --experiment simple --small_plot True --fine_tune True --data_categories all_pos all_neg
Prerequisites
python -m experiments.gradient.get_grads.py --use_cases all --grad_text_units attrs --multi_process True
Run the experiment
python -m experiments.gradient.plot_grads.py --use_cases all --grad_text_units attrs
Prerequisites
Download the fasttext embeddings (wiki-news-300d-1M) from here and save them in the data folder.
python -m experiments.get_attention_weights.py --use_cases all --multi_process True --attn_extractor word_extractor --special_tokens True --agg_metric mean --fine_tune False
python -m experiments.get_attention_weights.py --use_cases all --multi_process True --attn_extractor word_extractor --special_tokens True --agg_metric mean --fine_tune True
Run the experiment
python -m experiments.attention_to_similar_words.py --use_cases all --sim_metric cosine --sem_embs fasttext --fine_tune False --task compute
python -m experiments.attention_to_similar_words.py --use_cases all --sim_metric cosine --sem_embs fasttext --fine_tune True --task compute
python -m experiments.attention_to_similar_words.py --use_cases all --sim_metric cosine --task visualize
(Optional) Prerequisites
Download the fasttext embeddings (wiki-news-300d-1M) from here and save them in the data folder.
If the experiment in Sec. 4.3 has been run with the flag --save_embs, in the following experiment we can avoid re-computing the embeddings by specifying the option --precomputed_embs True
Run the experiment
python -m experiments.emb_sym_analysis.py --use_cases all --sim_metric cosine --sem_embs fasttext --fine_tune False --task compute
python -m experiments.emb_sym_analysis.py --use_cases all --sim_metric cosine --sem_embs fasttext --fine_tune True --task compute
python -m experiments.emb_sym_analysis.py --use_cases all --sim_metric cosine --task visualize
Prerequisites
Download the fasttext embeddings (wiki-news-300d-1M) from here and save them in the data folder.
python -m experiments.gradient.get_grads.py --use_cases all --grad_text_units words --multi_process True
Run the experiment
python -m experiments.gradient.gradient_embeddings_comparison.py --use_cases all --grad_text_units words --sim_metric cosine --sem_embs fasttext --task compute
python -m experiments.gradient.gradient_embeddings_comparison.py --use_cases all --grad_text_units words --sim_metric cosine --sem_embs fasttext --task visualize
Option 1: pre-trained EM model. Only the classification layer is fine-tuned on the EM task.
python -m utils.bert_em_pretrain --use_cases Structured_Fodors-Zagats --tok sent_pair --bert_model sentence-transformers/nli-bert-base --experiment compute_features
python -m utils.bert_em_pretrain --use_cases Structured_Fodors-Zagats --tok sent_pair --bert_model sentence-transformers/nli-bert-base --experiment train
Option 2: fine-tuned EM model. Both the SBERT architecture and the classification layer are fine-tuned on the EM task.
python -m utils.bert_em_fine_tuning --fit True --use_cases Structured_Fodors-Zagats --bert_model sentence-transformers/nli-bert-base --tok sent_pair
The model will be stored in the directory results/models/.
Pre-trained EM model.
python -m utils.bert_em_pretrain --use_cases all --tok sent_pair --bert_model sentence-transformers/nli-bert-base --experiment eval
python -m utils.bert_em_pretrain --use_cases all --tok attr_pair --bert_model sentence-transformers/nli-bert-base --experiment eval
Fine-tuned EM model.
python -m utils.bert_em_fine_tuning --fit False --use_cases all --tok sent_pair --bert_model sentence-transformers/nli-bert-base
python -m utils.bert_em_fine_tuning --fit False --use_cases all --tok attr_pair --bert_model sentence-transformers/nli-bert-base
The experiment evaluates the impact of the fine-tuning process on the BERT and SBERT model to learn the existence of attributes between the entity descriptions which match.
python -m struct_experiments_gmask.py --data_dir Structured_Fodors-Zagats
python -m struct_experiments_gmask.py --data_dir Structured_Fodors-Zagats --model_name_or_path nli-bert-base
Masking the tokens of BERT, SBERT, Ditto and SupCon models with multiple criteria (e.g, random
, maskSyn
, or maskSem
).
python -m experiments.masking.word_masking.py --use_cases all --bert_model bert-base-uncased --approach bert --max-len <MAX_LEN> --output_dir <OUTPUT_DIR>
python -m experiments.masking.word_masking.py --use_cases all --bert_model sentence-transformers/nli-bert-base --approach sbert --max-len <MAX_LEN> --output_dir <OUTPUT_DIR>
python -m experiments.masking.word_masking.py --use_cases all --bert_model roberta-base --approach ditto --max-len <MAX_LEN> --output_dir <OUTPUT_DIR>
python -m experiments.masking.word_masking.py --use_cases all --bert_model roberta-base --approach supcon --max-len <MAX_LEN> --output_dir <OUTPUT_DIR>
python -m experiments.masking.analyze_masking_results.py
Evaluate the correlation between the Jaccard sentence similarity and the cosine similarity between BERT, SBERT, Ditto, SupCon embeddings.
python -m experiments.sent_sim.model_sent_corr.py --use_cases all --bert_model bert-base-uncased --approach bert --train_type pt --output_dir <OUTPUT_DIR>
python -m experiments.sent_sim.model_sent_corr.py --use_cases all --bert_model sentence-transformers/nli-bert-base --approach sbert --train_type pt --output_dir <OUTPUT_DIR>
python -m experiments.sent_sim.model_sent_corr.py --use_cases all --bert_model bert-base-uncased --approach bert --train_type ft --output_dir <OUTPUT_DIR>
python -m experiments.sent_sim.model_sent_corr.py --use_cases all --bert_model sentence-transformers/nli-bert-base --approach sbert --train_type ft --output_dir <OUTPUT_DIR>
python -m experiments.sent_sim.model_sent_corr.py --use_cases all --bert_model roberta-base --approach ditto --output_dir <OUTPUT_DIR>
python -m experiments.sent_sim.model_sent_corr.py --use_cases all --bert_model roberta-base --approach supcon --output_dir <OUTPUT_DIR>
python -m experiments.sent_sim.load_sent_corr_results.py
Degradation Test Lerf and Morf.
python -m utils.exp_degradation --fit False --use_cases Structured_Fodors-Zagats
Evaluate how many cliques are correctly recognized by an EM model.
python -m experiments.cliques.cluster_matching_records.py --bert_model bert-base-uncased --approach bert --output_dir <OUTPUT_DIR>
Evaluate robustness to token injection.
python -m experiments.robustness.robustness_test.py --use_cases all --repeat 5 --output_dir <OUTPUT_DIR>
python -m experiments.robustness.load_word_occ_hacking_results.py
Evaluate out-of-distribution effectiveness.
python -m experiments.robustness.ood_experiment.py --bert_model bert-base-uncased --approach bert --output_dir <OUTPUT_DIR>
python -m experiments.robustness.ood_experiment.py --bert_model sentence-transformers/nli-bert-base --approach sbert --output_dir <OUTPUT_DIR>
python -m experiments.robustness.ood_experiment.py --bert_model roberta-base --approach ditto --output_dir <OUTPUT_DIR>
python -m experiments.robustness.ood_experiment.py --bert_model roberta-base --approach supcon --output_dir <OUTPUT_DIR>
python -m experiments.robustness.load_ood_results.py