This project demonstrates the effectiveness of the Transformers library in text classification tasks, specifically applied to news data. The objective is to classify news articles into six predefined categories using state-of-the-art Transformer models.
The project is divided into two main parts:
- Generating Labels with Natural Language Inference (NLI)
- Fine-tuning BERT for Text Classification
Sure, here are the steps for each part of the project listed in a numbered format:
- Obtain news data from a Git repository.
- Process the data into a DataFrame.
- Clean the data.
- Provide the cleaned data to the
bert-base-nli-mean-tokens
Sentence Transformer model. - Distribute the news articles into six different categories: "Realistic", "Investigative", "Artistic", "Social", "Enterprising", and "Conventional" based on embeddings and cosine similarity.
- Combine every three sentences of labeled data into a single row along with their respective labels.
- Split the combined categories by commas and expand them into separate columns.
- Generate binary columns for each unique value, where 1 indicates the presence of the category and 0 indicates absence.
- Export the processed data to a single CSV file.
- Load the labeled data obtained from the previous model from the CSV file into a DataFrame.
- Split the data into training and testing sets.
- Process the testing data using
bert-base-uncased
by tokenizing it with its respective tokenizer and embedding it. - Build the model by creating layers for each category.
- Apply a Dropout layer to the [CLS] token embedding for each category to introduce regularization and prevent overfitting.
- Use a Dense layer to map the dropout output to a single output value for each category.
- Utilize the sigmoid activation function, indicating a binary classification task.
- Employ binary cross-entropy loss, treating each layer/class separately.
- Train the model.
- Test the model and evaluate accuracy scores for each category.
- Google Collab Python Compute
- Transformers library
- Pandas
- Scikit-learn
- TensorFlow
- Clone the repository.
- Run each part of the project sequentially as described in the code or documentation.
- deepakat002 for making such a great tutorial
- Contributors to this Git repository for implementing code for this project.
- The Transformers library by Hugging Face for providing pre-trained models and utilities for natural language processing tasks.