How are texts with "dont", etc. handled? #26

spekulatius · 2021-08-04T09:45:25Z

I was wondering what you think is the correct approach to handling texts with incorrect writing. Such as "dont" instead of "don't"? "Dont" isn't filtered out and ends up in keywords while "don't" is. I feel it should be included to improve the keyword extraction.

Cheers,
Peter

Donatello-za · 2021-08-04T11:25:28Z

I think that if we start adding commonly misspelled words we'd need to add many of the other commonly misspelled words as well, at which point performance may become a problem (considering that a large regular expression is used to process the text). Many online web-scrapers use the library already and I'm sure users won't be happy if there is a sudden unexpected drop in performance after performing a composer upgrade.

That being said, one solution could be to have two sets of language files for each language. The first would contain common stop words as it currently is and would be used by default. The second set could contains the original stop words and in addition an extended set of stop words such as commonly misspelled words.

That way a user can then choose to use the extended set by specifying the language .pattern file or .php file manually (as shown in the docs).

If the problem is serious enough and performance isn't that much of a concern you can already do this. Copy the lang/en_US.pattern and lang/en_US.php files to your own directory and simply add the additional words you'd like to have. Perhaps look at this Wikipedia page.

Tip: You can add the additional words to your copy of the en_US.php file first and then use the /console/extractor.php tool to create a new custom en_US.pattern file for it.

After that simple load your own custom .php or .pattern file when creating the new instance of the RakePlus class as shown in Example 5

spekulatius · 2021-08-04T11:46:30Z

Yeah, I can see it would expand quite a bit. I've opted to replace some cases before sending it to RakePlus. The idea with two separate lists is neat as it would bring a choice. Do you think this is something you would want in general?

Donatello-za · 2021-08-04T11:56:56Z

Do you think this is something you would want in general?

Yes I'm sure it would be helpful to have an extended set of stop words and perhaps I can add it in the next release. I do think however that it will still not be enough and perhaps in the feature a better text processing library can use some clever A.I. trickery to improve both the speed and the end results of what this library achieves.

spekulatius · 2021-08-04T12:32:23Z

Using AI or similar to identify typos sounds like next level and probably won't happen any time soon I guess. I wouldn't actually include typos such as ("huose" instead of "house") for now. That is too much. Only common variations that actually used frequently (e.g. "dont" instead of "don't"). Otherwise the list will be massive.

…

On Wed, 4 Aug 2021 at 13:57, Don Schoeman ***@***.***> wrote: Do you think this is something you would want in general? Yes I'm sure it would be helpful to have an extended set of stop words and perhaps I can add it in the next release. I do think however that it will still not be enough and perhaps in the feature a better text processing library can use some clever A.I. trickery to improve both the speed and the end results of what this library achieves. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACAK7M7MX4CX53AAMKBCMCTT3ETJHANCNFSM5BQWT2QQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

Donatello-za · 2021-08-04T12:50:07Z

Using AI or similar to identify typos sounds like next level and probably won't happen any time soon I guess.

There is already AI exactly for this type of thing, Google "BERT for extractive text summarization". The problem is getting hold of the trained datasets and the additional complexity of setting up and interacting with external/non-PHP AI based libraries on your servers. In fact, when it comes to AI to solve this problem we are probably going to have to use some kind of paid online service, unless someone provides this kind of service for free most of us will have to make due with libraries such as RakePHP and others in the mean time.

spekulatius · 2021-08-04T13:35:44Z

Yeah, sure there are services/APIs for this. I'm just not sure if this something I would use with the package. I prefer to keep it locally for performance and privacy reasons.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are texts with "dont", etc. handled? #26

How are texts with "dont", etc. handled? #26

spekulatius commented Aug 4, 2021

Donatello-za commented Aug 4, 2021

spekulatius commented Aug 4, 2021

Donatello-za commented Aug 4, 2021

spekulatius commented Aug 4, 2021 via email

Donatello-za commented Aug 4, 2021

spekulatius commented Aug 4, 2021

How are texts with "dont", etc. handled? #26

How are texts with "dont", etc. handled? #26

Comments

spekulatius commented Aug 4, 2021

Donatello-za commented Aug 4, 2021

spekulatius commented Aug 4, 2021

Donatello-za commented Aug 4, 2021

spekulatius commented Aug 4, 2021 via email

Donatello-za commented Aug 4, 2021

spekulatius commented Aug 4, 2021