Can we use unsupervised learning and Natural Language Processing to determine if a news story is from the New York Times or the Onion? There are 2 datasets that will be used from Kaggle.
https://www.kaggle.com/datasets/undefinenull/satirical-news-from-the-onion
https://www.kaggle.com/datasets/tmishinev/nyt-headlines-20102021
There are 40,051 news stories from the New York Times and 6789 from the Onion. A row in each dataset contains both the headline and text from the article. The target will be a column added to the combined dataset which will indicate whether the story is NYT or the Onion.
The plan is to follow the project workflow outlined in Canvas and use primarily python text processing libraries and tools as well as some visualization tools such as plotly and seaborn. The only other tool that I think might be used is MongoDB.
The MVP should contain some preliminary analysis and/or a path forward to the end result of a predictive model.