-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How are texts with "dont", etc. handled? #26
Comments
I think that if we start adding commonly misspelled words we'd need to add many of the other commonly misspelled words as well, at which point performance may become a problem (considering that a large regular expression is used to process the text). Many online web-scrapers use the library already and I'm sure users won't be happy if there is a sudden unexpected drop in performance after performing a composer upgrade. That being said, one solution could be to have two sets of language files for each language. The first would contain common stop words as it currently is and would be used by default. The second set could contains the original stop words and in addition an extended set of stop words such as commonly misspelled words. That way a user can then choose to use the extended set by specifying the language If the problem is serious enough and performance isn't that much of a concern you can already do this. Copy the
After that simple load your own custom |
Yeah, I can see it would expand quite a bit. I've opted to replace some cases before sending it to RakePlus. The idea with two separate lists is neat as it would bring a choice. Do you think this is something you would want in general? |
Yes I'm sure it would be helpful to have an extended set of stop words and perhaps I can add it in the next release. I do think however that it will still not be enough and perhaps in the feature a better text processing library can use some clever A.I. trickery to improve both the speed and the end results of what this library achieves. |
Using AI or similar to identify typos sounds like next level and probably
won't happen any time soon I guess.
I wouldn't actually include typos such as ("huose" instead of "house") for
now. That is too much. Only common variations that actually used frequently
(e.g. "dont" instead of "don't"). Otherwise the list will be massive.
…On Wed, 4 Aug 2021 at 13:57, Don Schoeman ***@***.***> wrote:
Do you think this is something you would want in general?
Yes I'm sure it would be helpful to have an extended set of stop words and
perhaps I can add it in the next release. I do think however that it will
still not be enough and perhaps in the feature a better text processing
library can use some clever A.I. trickery to improve both the speed and the
end results of what this library achieves.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#26 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACAK7M7MX4CX53AAMKBCMCTT3ETJHANCNFSM5BQWT2QQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
There is already AI exactly for this type of thing, Google "BERT for extractive text summarization". The problem is getting hold of the trained datasets and the additional complexity of setting up and interacting with external/non-PHP AI based libraries on your servers. In fact, when it comes to AI to solve this problem we are probably going to have to use some kind of paid online service, unless someone provides this kind of service for free most of us will have to make due with libraries such as RakePHP and others in the mean time. |
Yeah, sure there are services/APIs for this. I'm just not sure if this something I would use with the package. I prefer to keep it locally for performance and privacy reasons. |
Hello @Donatello-za,
I was wondering what you think is the correct approach to handling texts with incorrect writing. Such as "dont" instead of "don't"? "Dont" isn't filtered out and ends up in keywords while "don't" is. I feel it should be included to improve the keyword extraction.
Cheers,
Peter
The text was updated successfully, but these errors were encountered: