Skip to content

A multi-modal Twitter dataset with 7.6M tweets and 25.6M retweets related to voter fraud claims.

Notifications You must be signed in to change notification settings

sTechLab/VoterFraud2020

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VoterFraud2020

VoterFraud2020 is a multi-modal Twitter dataset with 7.6M tweets and 25.6M retweets from 2.6M users related to voter fraud claims.

Table of contents

CT Jacobs

Hydrating the data

The tweets and user objects in the dataset can be hydrated using Twarc or Hydrator.

Note: tweets from suspended users will not be available for hydration. We believe it's in the public interest to make these tweets available. We will share those tweets with published academic researchers; email us for details.

Hydrating using Hydrator (GUI)

Navigate to the Hydrator github repository and follow the instructions for installation in their README. To use the GUI, tweet IDs must first be extracted to a tweet id file from the CSVs in this repository.

Hydrating using Twarc (CLI, python 3)

First install Twarc and tqdm

pip3 install twarc tqdm

Configure Twarc with your Twitter API tokens (note you must apply for a Twitter developer account first in order to obtain the needed tokens). You can also configure the API tokens in the script, if unable to configure through CLI.

twarc configure

Run the script. The hydrated Tweets will be stored in the same folder as the Tweet-ID file, and is saved as a compressed jsonl file

python3 hydrate.py

This guide was inspired by the #Election2020 Dataset Repository.

Data description

The columns in the data are described below. See the paper for more details, or explore the project website for additional descriptive statistics.

Tweets (7.6M)

Total count: 7,603,103
Original tweets: 3,781,524
Quote tweets: 3,821,579

The tweets are split into daily chunks.

Data Column Description
tweet_id The ID of the tweet.
user_community The community of the tweet's author in the retweet graph, which is found using the Infomap community detection algorithm with default parameters. Values: 0, 1, 2, 3, 4, null
user_active_status The active status of the tweet's author (as of January 10th). Values: 'active', 'suspended', 'deleted' (not found)
retweet_count_metadata The number of retweets the tweet has received according to the tweet object's metadata (as of December 16th).
quote_count_metadata The number of quotes the tweet has received according to the tweet object's metadata (as of December 16th).
retweet_count_by_community_X The number of retweets the tweet received from users in community X (X=0-4).
quote_count_by_community_X The number of quotes the tweet received from users in community X (X=0-4).
retweet_count_by_suspended_users The number of retweets the tweet received from suspended users.
quote_count_by_suspended_users The number of quotes the tweet received from suspended users.

Retweets (25.6M)

Total count: 25,566,698

The retweets are split into daily chunks.

Data Column Description
retweeted_id The ID of the retweeted tweet.
user_id The ID of the user that retweeted.

Users (2.6M)

Total count: 2,559,018

The users are split into 5 chunks, sorted by user id (ascending).

Data Column Description
user_id The ID of the user.
user_community The community of the user in the retweet graph, which is found using the Infomap community detection algorithm with default parameters. Values: 0, 1, 2, 3, 4, null
user_active_status The active status of the user (as of January 10th). Values: 'active', 'suspended', 'deleted' (not found)
closeness_centrality_detractor_cluster Normalized closeness centrality of the top 10,000 users in the detractor cluster (computed using Networkit).
closeness_centrality_promoter_cluster Normalized closeness centrality of the top 10,000 users in the promoter cluster (computed using Networkit).
retweet_count_by_community_X Aggregated count of the retweets the user received from other users in community X (X=0-4).
quote_count_by_community_X Aggregated count of the quotes the user received from other users in community X (X=0-4).
retweet_count_by_suspended_users Aggregated count of the retweets the user received from suspended users.
quote_count_by_suspended_users Aggregated count of the quotes the user received from suspended users.

Images

Total count: 167,696

The image perceptual hash values were calculated using the ImageHash python package.

Data Column Description
unique_id Unique identifier of the image.
tweet_id The ID of the tweet that contained the image.
a_hash The Average hash of the image.
p_hash The Perceptive hash of the image.
w_hash The Wavelet hash of the image.

URLs

Data Column Description
url The URL.
domain The domain of the URL.
tweet_count Aggregated count of the tweets that contained the URL.
retweet_count_metadata Aggregated count of the retweets that tweets containing the URL received according to the tweet object's metadata (as of December 16th).
quote_count_metadata Aggregated count of the quotes that tweets containing the URL received according to the tweet object's metadata (as of December 16th).
tweet_count_by_community_X Aggregated count of tweets that contained the URL by users in community X (X=0-4).
retweet_count_by_community_X Aggregated count of the retweets that tweets containing the URL received from users in community X (X=0-4).
quote_count_by_community_X Aggregated count of the quotes that tweets containing the URL received from users in community X (X=0-4).
tweet_count_by_suspended_users Aggregated count of tweets that contained the URL by suspended users.
retweet_count_by_suspended_users Aggregated count of the retweets that tweets containing the URL received from suspended users.
quote_count_by_suspended_users Aggregated count of the quotes that tweets containing the URL received from suspended users.

Youtube Videos

Data Column Description
video_id ID of the Youtube video.
video_title Title of the video (as of January 1st).
channel_id Channel ID of the channel where the video was posted.
channel_title Channel title of the channel where the video was posted (as of January 1st).
published_at Timestamp of when the video was published.
tweet_count Aggregated count of the tweets that contained the video.
retweet_count_metadata Aggregated count of the retweets that tweets containing the video received according to the tweet object's metadata (as of December 16th).
quote_count_metadata Aggregated count of the quotes that tweets containing the video received according to the tweet object's metadata (as of December 16th).
tweet_count_by_community_X Aggregated count of tweets that contained the video by users in community X (X=0-4).
retweet_count_by_community_X Aggregated count of the retweets that tweets containing the video received from users in community X (X=0-4).
quote_count_by_community_X Aggregated count of the quotes that tweets containing the video received from users in community X (X=0-4).
tweet_count_by_community_X Aggregated count of tweets that contained the video by suspended users.
retweet_count_by_suspended_users Aggregated count of the retweets that tweets containing the video received from suspended users.
quote_count_by_suspended_users Aggregated count of the quotes that tweets containing the video received from suspended users.

About

A multi-modal Twitter dataset with 7.6M tweets and 25.6M retweets related to voter fraud claims.

Topics

Resources

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •