PYSPARK NLP MODELLING

This is a PySpark nlp project.

OBJECTIVE

Implementing feature engineering using PySpark
Realizing n-gram/tf-idf/countvectorizer models using PySpark
These will be used in conjunction with a Logistic Regression to evaluate the effectiveness of the classifier.
Dataset being used is "Sentiment140" which contains info about 1.6 million tweets
More info on the dataset can be found from the link >> http://help.sentiment140.com/for-students/ The dataset can be downloaded from the below link. http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Gautam_Gowrishankar_Project_Report.docx		Gautam_Gowrishankar_Project_Report.docx
Pre-Processing.ipynb		Pre-Processing.ipynb
PySpark_Model.ipynb		PySpark_Model.ipynb
README.md		README.md