Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to preprocess text for CTM? #29

Closed
aneesha opened this issue Sep 3, 2021 · 3 comments
Closed

How to preprocess text for CTM? #29

aneesha opened this issue Sep 3, 2021 · 3 comments

Comments

@aneesha
Copy link

aneesha commented Sep 3, 2021

I read CTM uses both the preprocessed text for BOW and full text for BERT embedding. How can I create this as Dataset for the CTM model? Does saving an a OCTIS datasets automatically do this?

Many thanks

@silviatti
Copy link
Collaborator

Hi,
at the moment there is no way to do that in OCTIS. The modification would not require too much effort, but we need to think about the format of the saved file that represents the corpus (that is currently a .tsv file with some columns that may be optional). And so it would also require to re-preprocess the available datasets in this new format. Otherwise, this can be just an additional file (similar to the vocabulary file).

I am open to discuss this point.

Silvia

@lfmatosm
Copy link
Contributor

Hi @silviatti. So, if I understand correctly, currently there's no way to load the unprocessed corpus documents on OCTIS' CTM while using its optimizer, in a manner similar to the one done on standalone CTM's README?

@silviatti
Copy link
Collaborator

I'm closing this issue because the discussion has moved here: #46

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants