Author: Wonuola Abimbola
Movies are one of the most popular means of entertainment. There are large volumes of movie data being generated and shared on the internet every second.The genre of a movie can be deciphered from its synopsis much of the time. This project is a multiclass text classification problem which uses movie plot summaries to predict movie genres. NLP preprocessing techniques were implemented on the summaries to improve predictive capability. The dataset is from Kaggle which contains about 34.8k plot summary descriptions scraped from Wikipedia. In this project, only movies that were assigned one genre were used.
Columns:
- Release Year - year of release
- Title - title of the movies
- Origin/Ethnicity - country of origin of the movies
- Director - director names associated with the movies
- Cast - cast name associated with the movies
- Wiki Page - wikipedia page of the movies
- Plot - plot summary of the movies
- The data was subset to include only the movies that have one genre.
- I label-encoded the genres.
- The 'Plot' column was then 'cleaned' using NLP preprocessing techniques (removed stopwords, punctuations, digits, changed text to lower case, lemmatization, tokenization)
Frequency distribution of the genres showed the most prevalent genre was 'drama' with 'comedy' following close behind. The least common genre was 'romance'.
The visuals below show the word frequency distribution of plot summaries per genre. There were quite a few commonly occuring words among the genres.
The metric used in this project is the 'accuracy score'
The first simple model which predicted the most frequent class achieved an accuracy score of 41% which served as the baseline.
Both Logistic regression performed the best, however logisitic regression performed better on the test set.
Best model was logistic regression with tfidf transformed summaries with the following parameters:
- C = 1.0
- class_weight = 'balanced'
- multi_class = 'ovr'
- penalty = 'l2'
- solver = 'liblinear'
I implemented the model on test data and achieved an acurracy score of 64% which is a significant increase from the baseline
This showed that the plot words in each genre were highly similar to the others
- The confusion matrix above confirms what I suspected when looking at the most common words among the genres.
- Due to the fact that some of the genres had words in common with another genre, false predictions of those genres were mostly as the genres they had common words with.
- The decision tree models performed the least favorably.
├── data <- folder containing data
│ ├── wiki_movie_plots_deduped.csv <- contains movie plot data scraped from Wikipedia
├── images <- Visualizations used in README and pdf file
├── .gitignore <- files to ignore
├── README.md <- This README file
├── predict_movie_genre_from_plot_summary.ipynb <- final notebook
└── predict_movie_genre_from_plot_summary.pdf <- final presentation
With questions or feedback on this repository, please reach out via: