-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plural is not chosen over similar word #41
Comments
I also am getting a fuzzy result preferred over an exact match if the list contains similar names. |
Ok, so here's what I found out. If I'm using SpaczzRuler with a reasonably large (~1000) list of items where some are similar words to others, it often finds something that is not what I expect.
I understand fuzzy matching can't always handle things perfectly, but this is a bit extreme... So, questions:
Thanks a lot for this package, looking forward to understanding how to deal with these issues! |
Hi @brunobg. It's exciting to see more interest in spaczz! As I mentioned to you in issue #40, I am close to a new release for spaczz adding a token-based matcher, hopefully addressing the bug in issue #40, and a few other changes. After that is complete I will dig into your questions here. Just wanted you to know that I've seen this and will definitely address your issues in the near future. |
Thank @gandersen101! I'm willing to test your new code and help debug any issues, so ping me if you think it's stable enough for that. Really looking forward to the new version. |
Hey @brunobg, the code in master is stable so feel free to play around with it. There is a decent bit of refactoring and new features in there but most of the preexisting components behave mostly the same. I'm hoping to have the next release out tomorrow, Jan 19th, towards the end of the day EST, but it could take another couple days. |
@gandersen101 I tested it, but I'm still getting very weird results, particularly with exact matches being missed and other matches returned instead, even with single words ( My idea of using spaczz is handling typos for words that are not in dictionaries, so a spell checker would be useless. I have a list of around 2k terms in 8 categories. I'm happy to help with testing and bug fixing, but I'm a bit worried with these failures. Do you think spaczz could be using in production at this time? |
Hey @brunobg, so I hadn't actually had a lot of time to explore your questions here yet so the code in master shouldn't result in many differences in results from what you're seeing in the latest release. I will caution you that spaczz is a one person, passion project that is still pre v1.0.0. Any production use should strictly be at your own risk. I use spaczz in a batch process at my job and haven't run into issues yet but obviously bugs/issues will pop up as more people use spaczz and put it through more stress and different use-cases. I did have a little time to play with some of your example this morning and I am a little confused by some of the issues you are facing because I am either seeing different results or spaczz is working as intended. I am using the code on master for these tests. Test 1: Blank English Model, No Kwarg/Default Modifications import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.blank("en")
ruler = SpaczzRuler(nlp)
patterns = [
{"label": "WINE", "pattern": "Goldriesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Riesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Eggo Riesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Pinot Noir", "type": "fuzzy"},
{"label": "COUNTRY", "pattern": "Argentina", "type": "fuzzy"},
{"label": "TEST", "pattern": "one-two", "type": "fuzzy"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc = nlp(
"""They sell many Rieslings. My favorite wines are Pinot Nero, Pinot Noir,
and Cape Riesling. I am from Egito. And a one two"""
)
print([ent for ent in doc.ents])
The fuzzy matcher will find every highest fuzzy ratio fuzzy match for each pattern. Sometimes patterns will overlap when you're looking through many patterns (if you're searching for "Riesling" and "Cape Riesling", "Riesling" is part of both of those patterns) so as long as the So in the above example we see that "Cape Riesling" has a match ratio >= 75 to "Eggo Riesling" so it is matched because it is longer than just "Riesling". We can enforce stricter matching by modifying the Test 2: Blank English Model with Kwarg Modifications import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.blank("en")
ruler = SpaczzRuler(nlp)
patterns = [
{"label": "WINE", "pattern": "Goldriesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Riesling", "type": "fuzzy"},
{
"label": "WINE",
"pattern": "Eggo Riesling",
"type": "fuzzy",
"kwargs": {"min_r2": 90},
},
{"label": "WINE", "pattern": "Pinot Noir", "type": "fuzzy"},
{"label": "COUNTRY", "pattern": "Argentina", "type": "fuzzy"},
{"label": "TEST", "pattern": "one-two", "type": "fuzzy"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc = nlp(
"""They sell many Rieslings. My favorite wines are Pinot Nero, Pinot Noir,
and Cape Riesling. I am from Egito. And a one two"""
)
print([ent for ent in doc.ents])
"Cape Riesling" is no longer matched from "Eggo Riesling" but the "Riesling" part of "Cape Riesling" matches the "Riesling" pattern. A couple other tips based on your questions. If you want to prioritize exact matches via the spaCy Lastly be careful with the statistical NER model matches when testing spaczz matches. For example if I use the medium English model instead of a blank English model in the example we've been using: Test 3: Medium English Model with Kwarg Modifications import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.load("en_core_web_md")
ruler = SpaczzRuler(nlp)
patterns = [
{"label": "WINE", "pattern": "Goldriesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Riesling", "type": "fuzzy"},
{
"label": "WINE",
"pattern": "Eggo Riesling",
"type": "fuzzy",
"kwargs": {"min_r2": 90},
},
{"label": "WINE", "pattern": "Pinot Noir", "type": "fuzzy"},
{"label": "COUNTRY", "pattern": "Argentina", "type": "fuzzy"},
{"label": "TEST", "pattern": "one-two", "type": "fuzzy"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before="ner")
doc = nlp(
"""They sell many Rieslings. My favorite wines are Pinot Nero, Pinot Noir,
and Cape Riesling. I am from Egito. And a one two"""
)
print([(ent, ent._.spaczz_span) for ent in doc.ents])
"Egito" is picked up by the model's NER component but is not a match via spaczz. If you intend to use spaczz as a kind of "spell-checker" I think you'll get better results by turning up the As a side note Lastly, the token-based matcher that I will be adding to v0.4.0 will have a syntax similar to spaCy's |
Wow, thank you for the long and detailed reply. This is incredibly helpful. I need to go back to my code now and modify it with your hints, then I'll reply with my findings. A couple of remarks:
Thanks for the great project. I'll try to devote some time to help you. One-person-projects are the bread and butter of OS. |
You're welcome @brunobg ! I tried using the import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.load("pt_core_news_sm")
ruler = SpaczzRuler(nlp)
patterns = [
{"label": "WINE", "pattern": "Goldriesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Riesling", "type": "fuzzy"},
{
"label": "WINE",
"pattern": "Eggo Riesling",
"type": "fuzzy",
"kwargs": {"min_r2": 90},
},
{"label": "WINE", "pattern": "Pinot Noir", "type": "fuzzy"},
{"label": "GPE", "pattern": "Egito", "type": "fuzzy"},
{"label": "TEST", "pattern": "one-two", "type": "fuzzy"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before="ner")
doc = nlp(
"""They sell many Rieslings. My favorite wines are Pinot Nero, Pinot Noir,
and Cape Riesling. I am from Argentina. And a one two"""
)
print([(ent, ent._.spaczz_span) for ent in doc.ents])
The model ner thinks "They sell many" is an entity and it doesn't capture "one two" but I didn't get a match on "Argentina for "Egito". The fuzzy matching doesn't match the pattern "one-two" to "one two" in this case because the pt_core_news_sm model does not tokenize "one-two" into three tokens but rather one. By increasing the Decreasing the flex value (min of 0) is another way to control the quality of matches at expense of missing some potential matches (as we just saw). import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.load("pt_core_news_sm")
ruler = SpaczzRuler(nlp)
patterns = [
{"label": "WINE", "pattern": "Goldriesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Riesling", "type": "fuzzy"},
{
"label": "WINE",
"pattern": "Eggo Riesling",
"type": "fuzzy",
"kwargs": {"min_r2": 90},
},
{"label": "WINE", "pattern": "Pinot Noir", "type": "fuzzy"},
{"label": "GPE", "pattern": "Egito", "type": "fuzzy"},
{"label": "TEST", "pattern": "one-two", "type": "fuzzy", "kwargs": {"flex": "max"}},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before="ner")
doc = nlp(
"""They sell many Rieslings. My favorite wines are Pinot Nero, Pinot Noir,
and Cape Riesling. I am from Argentina. And a one two"""
)
print([(ent, ent._.spaczz_span) for ent in doc.ents])
Lastly, I wanted to provide you with a little intuition for how the FuzzyMatcher produces matches, which the SpaczzRuler ultimately picks the longest of (along with matches from the Regex and Token matchers and assures no duplicates). So, in your use case, if you prefer prioritizing ratio over length of matches, you could write a callback to incorporate that functionality. As I mentioned before though, you will likely miss out on good, longer matches if part of the phrase matches perfectly. Perhaps some sort of weighting between match ratio and length would give the best results for your use case. I am going back to using a blank english model and getting rid of kwargs for the sake of illustration. import spacy
from spaczz.matcher import FuzzyMatcher
nlp = spacy.blank("en")
matcher = FuzzyMatcher(nlp.vocab)
wines = ["Goldriesling", "Riesling", "Eggo Riesling", "Pinot Noir"]
matcher.add("WINE", [nlp(wine) for wine in wines])
matcher.add("GPE", [nlp("Egito")])
matcher.add("TEST", [nlp("one-two")])
doc = nlp(
"""They sell many Rieslings. My favorite wines are Pinot Nero, Pinot Noir,
and Cape Riesling. I am from Argentina. And a one two"""
)
for match_id, start, end, ratio in matcher(doc):
print((doc[start:end], start, end, ratio, match_id)) This produces the following: (Rieslings, 3, 4, 76, 'WINE') # Match from "Goldriesling"
(Rieslings, 3, 4, 94, 'WINE') # Match from "Riesling" higher ratio would be written to custom attributes in SpaczzRuler
(Pinot Nero, 9, 11, 80, 'WINE') # Match from "Pinot Noir"
(Pinot Noir, 12, 14, 100, 'WINE') # match from "Pinot Noir"
(Cape Riesling, 17, 19, 77, 'WINE') # Longer match from "Eggo Riesling" but we see it barely makes the 75% cutoff.
(Riesling, 18, 19, 100, 'WINE') # match from "Riesling" but the SpaczzRuler will pick the above bc it is longer / makes cutoff.
(Riesling, 18, 19, 80, 'WINE') # Match from "Goldriesling"
(one two, 27, 29, 86, 'TEST') # Match from "one-two" I'm sure there are opportunities to improve the matching logic and/or provide more options but hopefully that helps for now. I will also note that it is very hard for me to test and code for other languages as I unfortunately only speak English. If you find that Portuguese (or any other language) needs it's own set of considerations, I will need help in accounting for them |
Thanks. I think the mismatches I reported are caused by a mix of the pt mode and the size of the pattern list. It's probably not that easy to reproduce in a small, contained test. I am a bit worried with the performance of spaczz to use it in practice. I could help you to profile and see if there are obvious optimizations, because the ~100x slowdown becomes unfeasible (I was expecting up to ~10x, which would still be fast enough for me; but a 3k character text is taking around 30 seconds with my list of ~1k patterns). I'd need to understand your code to suggest improvements and that might take a while; I don't really have time to develop another OS project, but I can help with tests, suggestions and minor fixes. But here's some thoughts about that. Could you tell me roughly what is the algorithm in spaczz? My guess is that most of the time comes from comparing neighbor tokens to see if they match. So in theory it seems that if I accept that the tokenization has no errors and all entities are either one word long or won't mix and match with others (so I can never have "Cape Riesling" found for "Riesling") that could be a good optimization at the expense of completely missing a few rarer mistakes (it'd never catch "Gold riesling" if you add "Goldriesling" as a pattern). But it'd work well enough for me if it's fast. It'd be something like this:
It's a little bit more complicated than that for multi-word entities, but you just match the first term and if it works you check the next one, etc. Now, I don't know what's the complexity and performance of the nearest_match() algorithm you're using (and hopefully it's O(log n), instead of O(n); what is it?), but if I were implementing something like spaczz that'd be my first attempt. This is much simpler from what you're currently doing, of course. So, is this something you already tried? Or is this something that can be done with spaczz right now by changing its settings? |
Short answer for now as I am trying to wrap up v0.4.0 and the potential changes you're asking for could be patch version changes after that. A combination of ideas you've presented and settings tweaking could definitely optimize things considerably. I'll give a more detailed update later today or tomorrow. Attempting some of these optimizations will be a priority for me after the next release. |
Right. Look, I could perhaps help by adding a more "real life example". Load patterns (list of country names could work work well, ~200, easy to find a good one) and a few longer texts. This would be useful to as a harder test to complement the unit ones, since it's more likely to catch a weird combination like #40, and could also be a basic benchmark. Is it something you'd like? If so, let me know about it. I think it could go do a |
Hey @brunobg that would be much appreciated and helpful. An As far as the computational complexity of the search algorithm I'm using, I'm a little embarrassed to say I'm not sure. I familiar with the concepts of O notation but I only started coding a few years ago for a University Data Science program. Most of my engineering skills have been picked up on my own or on the job. What I do know, now that I've been thinking through the algorithm more, is there are a few glaring inefficiencies I can pretty easily address. Let's take the pattern, "United Kingdom", for example. The search, by default, takes two steps. First it iterates through the doc by The algorithm would work fastest with the In the case of flex > 0, there are three optimizations I could make.
After these modifications, spaczz still won't be spaCy fast, but I think we'll see a very significant improvement. Also, the new |
I suggest you read briefly on algorithmic complexity. This is the most important thing when optimizing code. If the algorithm is bad, no good implementation will be able to compete with a slower implementation of a better algorithm. I think your suggestions are very sound and should make things much faster. spaczz should be always slower than Spacy since a fuzzy comparison will always be slower, but it shouldn't be so slow to be unusable in practice. I'm a bit swamped here but if you want to keep the discussion going I'll be happy to help. I'll see if I can send that example to help benchmarks. |
PR with examples in #47 |
@brunobg thank you for the examples in #47. I will pull that soon and start experimenting with the examples and using them as benchmarks. I will also try to spend some time on algorithmic complexity to hopefully further improve spaczz's speed. I am about to release v0.4.1 which incorporates some of the optimization I discussed above.
See the performance section in the readme for parameter changes that will further increase speed. |
Hi @brunobg I know it's been a little while but I wanted to provide some updates. I reformatted some of your examples you provided but the core of them remains the same and they helped me identify a few issues in the SpaczzRuler. I believe I took care of most of the quirks in the SpaczzRuler's matching process in v0.4.2. There are additional improvements to be made but I think I can break these into multiple smaller issues.
Let me know if there is anything I am missing. If not I would like to close this and work on the above in more focused issues. Thanks for your help and interest! PS I'll provide some more information on your issue #45 separately soon. |
Hey! It all looks great. The documentation on the example is very good and clear. Thank you and congratulations! |
How to reproduce: fuzzy pattern with "Goldriesling", "Riesling", default fuzzy_func. Search on a phrase like
They sell many Rieslings
.Found:
Goldriesling
. Expected:Riesling
.The text was updated successfully, but these errors were encountered: