Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for id-en-code-mixed #303

Open
SamuelCahyawijaya opened this issue Oct 2, 2022 · 2 comments
Open

Create dataset loader for id-en-code-mixed #303

SamuelCahyawijaya opened this issue Oct 2, 2022 · 2 comments
Assignees

Comments

@SamuelCahyawijaya
Copy link
Member

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?id_en_code_mixed

Dataset id_en_code_mixed
Description This dataset contain 825 tweet instances of Indonesian-English, corresponding to four NLP tasks, i.e., tokenization, language identification, lexical normalization, and word translation. Data for lexical normalization task is curated in MultiLexNorm (already in Nusa Catalogue), but other tasks are not. Tokenization for social media data is not as trivial as splitting the token using white space delimiter. In this data, language identification is performed in token-level granularity.
License CC-BY-NC-SA 4.0
@VanillaMacchiato
Copy link
Contributor

#self-assign

@haryoa
Copy link
Contributor

haryoa commented Dec 20, 2022

#self-assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants