-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
comprehensive french tokenizer without exceptions list #13378
Open
thjbdvlt
wants to merge
12
commits into
explosion:master
Choose a base branch
from
thjbdvlt:quelquhui
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
i apologize for all these failed tests!! it's the first time i contribute to a project (i'm not a programmer: i study french literature) and i just finally understood that i could do these tests by myself: now it doesnt fail anymore. sorry again and thank's for having look at my pull request :) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current french tokenizer doesn't handle hyphens and apostrophes very well. It uses a gigantic (15600) list of words with hyphen that must not be split on the hyphen. This list is not only huge (full of village names such as Minaucourt-le-Mesnil-lès-Hurlus, or Beaujeu-Saint-Vallier-Pierrejux-et-Quitteur), but also very incomplete. This list has no chance to ever become exhaustive, because the number of french common nouns and proper names that contain a hypen and must not be split by the tokenizer is virtually infinite: the hyphen is called in french trait d'union (union trait), it unifies, it joins separate words into one semantic word (and token). For example, the verb porter (to carry) produces nouns porte-clé (a thing we use to carry keys), porte-manteau, and we can invent any word like this (with porter or any other word). Plus, there is inclusive language (relecteur-rice-s). And of course there are people and places names, wich often containd hyphens, combining existing names or words into new and larger names. At the other hand, there are cases where a hyphen must split a substring into two words, and these cases are easily handled with a simple regex, because unlike the infinite exceptions, they are not very diverse: a) verb-subject inversion where subject is pronominalized; b) verb-object form where object is pronominalized; for a total of 21 words (suffixes). This current pull requests replaces the tokenizer exceptions by a new 're_infixes' function, that easily handles each of the 15600 exceptions, and many more. It reverses the rule-exception relation: rule = keep as one token the words containing a hyphen; exception = split words containing a hyphen if the hyphen is followed by one of the registered word (pronominalized subject/object).