This repository contains a detailed analysis of a dataset containing recipes with their respective ingredients, used to predict the cuisine based on these ingredients. The analysis includes various steps such as data exploration, visualizations, and classification modeling using multiple algorithms, including Multinomial Naive Bayes, XGBoost, CNN, and Random Forest.
The primary objective of this project is to explore the relationship between ingredients and cuisine types, and develop machine learning models to classify recipes into different cuisine categories.
-
Notebook (
recipe_ingredients_classification.ipynb
):- This Jupyter notebook contains the entire analysis, including data preprocessing, exploratory data analysis (EDA), feature extraction, and machine learning model training and evaluation.
- It covers:
- Data Import and Extraction: Loading and extracting the dataset.
- Data Preprocessing: Cleaning and organizing data.
- Data Exploration: Visualizing ingredient distributions and cuisine counts.
- Modeling: Using multiple classifiers to predict the cuisine of recipes based on ingredients.
-
Data:
- The dataset is downloaded from Kaggle using the
kaggle
API and consists of recipes categorized by their cuisine and a list of ingredients.
- The dataset is downloaded from Kaggle using the
-
Output:
- The notebook generates visualizations, including count plots, word clouds, and confusion matrices, to assess the performance of different models.
- You need Python 3.x with the following libraries installed:
pandas
matplotlib
seaborn
scikit-learn
xgboost
keras
tensorflow
wordcloud
nltk
numpy
To install the required libraries, run the following command:
pip install pandas matplotlib seaborn scikit-learn xgboost keras tensorflow wordcloud nltk numpy
To download the dataset from Kaggle, you must set up the Kaggle API on your environment.
-
Create a Kaggle account if you don't have one.
-
Go to Kaggle API and create a new API key (a
kaggle.json
file). -
Upload the
kaggle.json
file to your environment and set the Kaggle credentials path:!mkdir -p ~/.kaggle !cp /content/kaggle.json ~/.kaggle/
-
Install the Kaggle package:
pip install kaggle
-
Download the dataset:
!kaggle datasets download -d kaggle/recipe-ingredients-dataset !unzip -q recipe-ingredients-dataset.zip
- train.json: The training dataset containing recipes with their ingredients and corresponding cuisine labels.
- test.json: The test dataset containing recipes with their ingredients (without cuisine labels for predictions).
Each entry in the dataset consists of:
ingredients
: A list of ingredients used in the recipe.cuisine
: The cuisine type for the recipe (only available in the training set).
-
Data Exploration:
- Loading the dataset and performing basic checks.
- Visualizations:
- Count plot of cuisines.
- Distribution of the number of ingredients.
- Boxplot of ingredients by cuisine.
- Wordcloud representation of most frequent ingredients per cuisine.
-
Feature Engineering:
- Creating new features like the number of ingredients per recipe.
- Vectorizing the ingredients list into a bag-of-words representation using
CountVectorizer
.
-
Machine Learning Models:
- Multinomial Naive Bayes: A basic model used for classification based on the ingredients.
- XGBoost: A powerful gradient boosting model for classification.
- CNN (Convolutional Neural Network): A deep learning approach using
Keras
to predict cuisines based on ingredients. - Random Forest: An ensemble model for predicting cuisines.
For each model, the notebook includes:
- Model training
- Predictions
- Evaluation metrics (accuracy, confusion matrix, classification report)
-
Model Evaluation:
- Confusion matrix visualization for each model.
- Accuracy comparison across different models.
-
Clone the repository:
git clone https://github.com/yourusername/recipe-ingredients-cuisine-classification.git cd recipe-ingredients-cuisine-classification
-
Install the required libraries:
pip install -r requirements.txt
-
Run the Jupyter notebook:
jupyter notebook recipe_ingredients_classification.ipynb
-
Follow the instructions in the notebook to explore the dataset and run the models.
- Distribution of Cuisines: Visualizing how the recipes are distributed across different cuisine categories.
- Number of Ingredients: Analyzing how the number of ingredients varies across different cuisines.
- Top 20 Ingredients: A bar plot showing the most common ingredients in the dataset.
- Wordclouds: Word clouds for each cuisine showing the most frequent ingredients.
- Confusion Matrices: For each model, a confusion matrix to evaluate the classification performance.
- Hyperparameter Tuning: Use grid search or random search to tune the hyperparameters for each model and improve accuracy.
- Cross-validation: Implement cross-validation to ensure the robustness of the models.
- Deep Learning: Experiment with more complex deep learning architectures like LSTM or Transformers for ingredient-based classification.
- Data Augmentation: Apply data augmentation techniques to expand the dataset and improve model performance.
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset sourced from Kaggle: Recipe Ingredients Dataset
- Libraries used:
pandas
,matplotlib
,seaborn
,scikit-learn
,xgboost
,keras
,tensorflow
,wordcloud
,nltk
, andnumpy
.
If you have any questions or suggestions, feel free to open an issue or contact the repository owner at [[email protected]].