This is a PySpark nlp project.
-
Implementing feature engineering using PySpark
-
Realizing n-gram/tf-idf/countvectorizer models using PySpark
-
These will be used in conjunction with a Logistic Regression to evaluate the effectiveness of the classifier.
-
Dataset being used is "Sentiment140" which contains info about 1.6 million tweets
-
More info on the dataset can be found from the link >> http://help.sentiment140.com/for-students/ The dataset can be downloaded from the below link. http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip