Filtering and cleaning google news "processed text" field #182
-
Hello guys, Love obsei first of all! I would like to know if there is a way to get a more cleaned text (i.e. article body) when we use the google news source. The returned field, "processed text" comes with lots of links and some part of the text that I would like to remove. I do not know if can be done with the help of obsei or if do I have to use an extra tool to do that task. Thanks anyway, All the best, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
@edumagol Thank you for kind words. Tutorial to showcase use of cleaner - https://colab.research.google.com/github/obsei/obsei/blob/master/tutorials/04_GoogleNews_Cleaner_Splitter_Classification_Aggregator.ipynb All the current supported cleaner functions: https://github.com/obsei/obsei/blob/master/obsei/preprocessor/text_cleaning_function.py |
Beta Was this translation helpful? Give feedback.
@edumagol Thank you for kind words.
Yes we have some basic cleaning capabilities. Currently it do not have capabilities to remove hyperlinks. But adding single cleaner function in obsei for this capability would not take time.
Please refer following links -
Tutorial to showcase use of cleaner - https://colab.research.google.com/github/obsei/obsei/blob/master/tutorials/04_GoogleNews_Cleaner_Splitter_Classification_Aggregator.ipynb
All the current supported cleaner functions: https://github.com/obsei/obsei/blob/master/obsei/preprocessor/text_cleaning_function.py