Ferenc Béres, Rita Csoma, Tamás Vilmos Michaletzky, and András A. Benczúr
In this work, we intended to develop techniques that are able to efficiently differentiate between pro-vaxxer and vax-skeptic Twitter content related to COVID19 vaccines. After multiple data preprocessing steps, we evaluated Tweet content and user interaction network classification by combining text classifiers with several node embedding and community detection techniques.
Our initial results were published at the 10th International Conference on Complex Networks and Their Applications under the title Vaccine skepticism detection by network embedding. Our final work with additional findings and results was published in the Applied Network Science journal.
- UNIX environment
- Create a new conda environment with the related Python (3.8) dependencies using the provided YAML file.
- Activate the new conda environment (covid_vax_network)
- Install additional dependencies for downloading Twitter data
conda env create -f env.yml
conda activate covid_vax_network
cd install
bash install_twittercrawler.sh
After installing all dependencies, you can test your setup by executing the provided tests:
pytest
If you want to access code coverage statistics then execute:
pytest --cov
If you use our code or the Twitter data set that we collected on Covid vaccination, please cite our paper:
@Article{béres2023network,
author={Ferenc Béres and Tamás Vilmos Michaletzky and Rita Csoma and András A. Benczúr},
title={Network embedding aided vaccine skepticism detection},
journal="Applied Network Science",
year="2023",
volume="8",
number="11",
pages="21",
}
To comply data publication policy of Twitter, we cannot share the raw data. Instead, we publish the generated feature components that we used to train models for the task of vaccine skepticism detection.
We provide a bash script (download_data.sh
) to download our Twitter data set related to COVID19 vaccine skepticism.
./scripts/download_data.sh
The feature components (public_data_2022-10-27
) are downloaded and decompressed into the data
subfolder.
By executing the former bash script (download_data.sh
), the files containing Tweet identifiers are also downloaded (tweet_ids_2022-11-14
) into the data
subfolder.
We provide a script example for downloading original Tweet data using the provided Tweet identifiers:
cd scripts
python download_tweets_by_id.py <PATH_TO_API_KEY_FILE> ../data/tweet_ids_2022-11-14/seed_preprocessed/seed_tweet_ids.txt ../data/tweet_ids_2022-11-14/seed_preprocessed/valid_thread_seeds.txt 50000
where you must specify your Twitter API credentials as described here.
If you have successfully downloaded the raw Twitter data, then use the following script to generate feature components from raw data:
cd scripts
bash generate_components_all.sh ../data/tweet_ids_2022-11-14 Vax-skeptic
bash generate_components_all.sh ../data/tweet_ids_2022-11-14 Pro-vaxxer
We prepared a script to train and evaluate the models for every experiment. Specify the following parameters to run the script:
- First argument: specify the folder for generated feature components
- Second argument: specify the target vaccine view for model training (e.g. Pro-vaxxer, Vax-skeptic)
- Third argument: Number of independent experiments to run for calculating model performance
- Fourth argument (optional): Assign model training to a given GPU (e.g. cuda:1)
- Fifth argument (optional):
For example use, the following command to train models for vaccine skepticism detection:
cd scripts
bash train_eval_models.sh ../data/public_data_2022-10-27 Vax-skeptic 10
OR use a different command to detect pro-vaccine content:
cd scripts
bash train_eval_models.sh ../data/public_data_2022-10-27 Pro-vaxxer 10
Find the results for these experiments in the data/public_data_2022-10-27/{Vax-skeptic_results OR Pro-vaxxer_results}
, respectively.
For reproducability we share a JSON configuration file containing the parameters that we used to train our models.
Note that this file can be easily downloaded with the same script that is provided for downloading our Twitter data set.
We provide a script to train your own node embedding model on the Twitter reply graph.
First, you need to preprocess the network. For example, in the command below we only use the first 100K edges and exclude nodes with less than 5 connections:
python ./scripts/node_embedding.py preprocess ./data/tweet_ids_2022-11-14/seed_preprocessed/reply_network.txt --con 3 --rows 100000
Next, train a node embedding model (e.g. DeepWalk) on the preprocessed network. The input file (3core_100000.csv) was created in the previous step.
python ./scripts/node_embedding.py fit 3core_100000.csv --model DeepWalk
The 5-dimensional user representations are exported to a CSV file in your working directory.
head -3 DeepWalk_dim5_*.csv
The first column is the user identifier and the rest contains the representation for each node (Twitter user) of the reply network.
897201317223038976,-1.7705172,-2.9218996,3.7667134,-1.5700843,1.1719122
1280867032347664384,-3.7421749,-4.201378,1.2413,-2.1019526,2.6107976
1346645698176049152,-1.8614192,-3.9294648,1.6497265,-2.1258717,1.7651175
These user representations can be fed to the vaccine view classifier. However, training node embedding models on the whole Twitter reply network with millions of edges can take several days.
Open access funding provided by ELKH Institute for Computer Science and Control. Support by the the European Union project RRF-2.3.1-21-2022-00004 within the framework of the Artificial Intelligence National Laboratory Grant no RRF-2.3.1-21-2022-00004.