This is a PySpark nlp project.
Implementing feature engineering using PySpark
Realizing n-gram/tf-idf/countvectorizer models using PySpark
These will be used in conjunction with a Logistic Regression to evaluate the effectiveness of the classifier.
Dataset being used is "Sentiment140" which contains info about 1.6 million tweets
More info on the dataset can be found from the link >> The dataset can be downloaded from the below link.