Skip to content

Project on a Natural Language Processing (NLP) task known as Natural Language Inference (NLI) for the Bahasa Indonesia language.

Notifications You must be signed in to change notification settings

chukbert/naturalLanguageInference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Natural Language Inference for Bahasa Indonesia Using pytorch

TL;DR :

Introduction

Welcome to the "Natural Language Inference for Bahasa Indonesia" project! This project focuses on a Natural Language Processing (NLP) task known as Natural Language Inference (NLI) specifically designed for the Bahasa Indonesia language. NLI involves determining the relationship between pairs of sentences, typically categorized as entailment, contradiction, or neutral. In this project, we leverage the IndoNLI Dataset, which is a valuable resource for NLI tasks in Bahasa Indonesia. The result notebook NLIIndonesia.ipynb.

Dataset

The dataset used for this project is the IndoNLI Dataset, which can be downloaded from this link. It consists of over 18,000 sentence pairs, with more than 12,000 pairs for training and 5,000 pairs for testing. Each sentence pair is labeled with its corresponding relationship (entailment, contradiction, or neutral), making it suitable for NLI model training and evaluation.

Model

We employ the "Decomposable Attention Model," which was originally introduced by Parikh et al. (2016) for NLI tasks. This model consists of three main stages: attending, comparing, and aggregating implemented using pytorch from scratch. It has proven effective in capturing complex relationships between sentence pairs, making it a suitable choice for our Bahasa Indonesia NLI task. You can find the original paper detailing this model here.

Word Embeddings

For word embeddings, we utilize pretrained word embeddings from fastText specifically trained for the Bahasa Indonesia language. You can download these embeddings from the fastText website. Pretrained embeddings help our model understand the semantics and contextual information of Bahasa Indonesia words, enhancing its performance. Download pretrained vector embeddings and place it in the pretrained directory.

Training Results

After training the Decomposable Attention Model on the IndoNLI Dataset, we obtained the following results: alt text

  • Training Loss: 0.368
  • Training Accuracy: 0.832
  • Test Accuracy: 0.441

It's important to note that these results, while a valuable step forward, are relatively lower compared to models trained on much larger English NLI datasets. This discrepancy is expected, given the significant difference in dataset size, with English NLI datasets often exceeding 550,000 labeled sentence pairs. Nonetheless, this project represents a crucial milestone in advancing NLP for Bahasa Indonesia, as NLI applications are diverse and demand extensive research in this field.

Usage

conda install --file requirements.txt

Challenges and Future Work

The challenges encountered in this project, such as limited dataset size and language-specific nuances, highlight the need for further research and resource development in Bahasa Indonesia NLP. Future work can include:

  • Expansion of the IndoNLI Dataset.
  • Exploring more advanced NLI architectures.
  • Fine-tuning models for specific NLI subtasks.

References and Citation

  • Original Paper on Decomposable Attention Model: Link
  • IndoNLI Dataset: Link
@inproceedings{mahendra-etal-2021-indonli,
    title = "{I}ndo{NLI}: A Natural Language Inference Dataset for {I}ndonesian",
    author = "Mahendra, Rahmad and Aji, Alham Fikri and Louvan, Samuel and Rahman, Fahrurrozi and Vania, Clara",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.821",
    pages = "10511--10527",
}
  • Pretrained Word Embeddings for Bahasa Indonesia: Link
@inproceedings{grave2018learning,
  title={Learning Word Vectors for 157 Languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}

Feel free to explore the code and resources in this repository to gain a deeper understanding of our NLI project for Bahasa Indonesia. If you have any questions or feedback, please don't hesitate to reach out. Thank you for your interest in our NLP research!

About

Project on a Natural Language Processing (NLP) task known as Natural Language Inference (NLI) for the Bahasa Indonesia language.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published