How to preprocess text for CTM? #29

aneesha · 2021-09-03T05:51:02Z

I read CTM uses both the preprocessed text for BOW and full text for BERT embedding. How can I create this as Dataset for the CTM model? Does saving an a OCTIS datasets automatically do this?

Many thanks

silviatti · 2021-09-08T12:04:54Z

Hi,
at the moment there is no way to do that in OCTIS. The modification would not require too much effort, but we need to think about the format of the saved file that represents the corpus (that is currently a .tsv file with some columns that may be optional). And so it would also require to re-preprocess the available datasets in this new format. Otherwise, this can be just an additional file (similar to the vocabulary file).

I am open to discuss this point.

Silvia

lfmatosm · 2021-11-21T19:08:00Z

Hi @silviatti. So, if I understand correctly, currently there's no way to load the unprocessed corpus documents on OCTIS' CTM while using its optimizer, in a manner similar to the one done on standalone CTM's README?

silviatti · 2021-12-09T09:02:50Z

I'm closing this issue because the discussion has moved here: #46

lfmatosm mentioned this issue Nov 29, 2021

Loading unprocessed corpus documents with CTM and Optimizer #46

Open

silviatti closed this as completed Dec 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to preprocess text for CTM? #29

How to preprocess text for CTM? #29

aneesha commented Sep 3, 2021

silviatti commented Sep 8, 2021

lfmatosm commented Nov 21, 2021

silviatti commented Dec 9, 2021

How to preprocess text for CTM? #29

How to preprocess text for CTM? #29

Comments

aneesha commented Sep 3, 2021

silviatti commented Sep 8, 2021

lfmatosm commented Nov 21, 2021

silviatti commented Dec 9, 2021