This repository contains the dataset used in our paper : Codeswitched Sentence Creation using Dependency Parsing.
The sentiment data for Hindi and Marathi are provided in the respective directories. We use the code provided in Sub-word-LSTM for training and testing.
Given a sentence we extract independent phrases using Stanford Parser. The independent phrases are translated using Google's On-device NMT and transliterated using Indic trans. The original phrases are replaced by these phrases in such a way that the CMI of the resulting codemixed sentence is maximum.
We use the test set of Sentiment 140 as the original dataset for generating the codemixed dataset.