Skip to content

COMP3222 Machine Learning Coursework with the aim of designing a ML algorithm to classify Twitter posts as being fake or real.

Notifications You must be signed in to change notification settings

edelmans/Twitter-news-post-ML-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Social Media Post Classification within the MediaEval 2015 dataset

TASKS

  • Data Analysis
  • Algorithm Design
  • Evalutation

MODULE

Machine Learning Technologies (COMP32222)

DESCRIPTION

This individual project was set with an aim to explore ways of automatically classifying Twitter news related content as real or fake. Within this coursework, I designed a machine learning algorithm for classifying Twitter posts from MediaEval 2015 "verifying multimedia use" challenge dataset.

The project was based on Python and Jupyter Notebook, along with the use of scikit-learn library, numpy, pandas and deep translator. The dataset contained 14,277 training data entries and 3755 testing data entries, and each entry had the following set of features:

tweetId / tweetText / userId / imageId / username / timestamp / label

DATA ANALYSIS

Here are some of the graphs produced throughout the data analysis.

alt text alt text alt text alt text

ALGORITHM DESIGN

The algorithm design part started with the preprocessing steps taken. This task consisted of data cleaning by removing punctuation from tweets, text lowercasing, stop word removal, emoji removal as well as translation. Once the preprocessing was completed, the tweetText features were vectorized and transformed into a term frequency inverse document frequency matrices.

Considering all the constraints and the characteristics of the data, 3 starting classifiers were chosen: MultinomialNB / LinearSVC / SGDClassifier.

EVALUATION AND RESULT

The classifiers were evaluated and the strongest learner (MultinomialNB in this project) was chosen to further perform hyper parameter tuning through GridSearch. Additionally, other features like the imageId and username were used in an iterative process.

The best performance was achieved by using the TweetText and username feature, which resulted in an accuracy score of 89.26%.

Once the project was submitted, I received detailed feedback for them module leader highlighting the strengths of this work, mainly being the data analysis and code quality, as well as areas of improvement like additional feature selection. This project was awarded a 1st class mark of 70%.