-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement multiprocessing speedup for suggest
in absence of python-Levenshtein
#178
Comments
suggest
in absence of python-Levenshtein
Honestly, it's still not great. The best next step appears to be taking the list of inventory items and using `fuzzywuzzy` directly for more control over how the query is scored against each item. In short, `sphobjinv` does enough for a proof-of-concept, but how its `.suggest()` method works isn't quite search-engine-y enough.
If I'm going to have to re-implement |
Scorer should make its own internal decision about multi or not, but suggest should have an option to coerce single or nproc if user prefers, or if the multi-detector is making a bad determination. The scorer should have a fully inspect able/introspectable API. Defined interface on how suggest will call it. Should be the most general, information rich inputs possible, which is probably the search term and the inventory itself! Provide index and score flags, too? Threshold? I could see a scorer knowing enough about its own properties to be able to make a quick first pass and discard sufficiently poor matches. Might as well provide all the information possible. Have to set this up so that it's easy to add more information if something new comes up. |
Best practice... recommend that any scorer, builtin or plugged, define |
As part of the new implementation of the multiprocessing-enabled E.g., if |
Either way, keep the |
Can consider using https://pypi.org/project/editdistance/ as a new Or --- if it's fast enough with the |
Sprinkling this in various issues: |
NOTE: With the deprecation of the
python-Levenshtein
speedup for thesuggest
functionality (see #211 & #218), identifying other methods to increase performance is a priority. This multi-processing based approach is the best one I've thought of so far. If anyone has another suggestion, please open a new issue to discuss it.Very early POC on suggest-multiproc branch.
Suggests some speedup possible, but not a slam-dunk whether it's worth it given the sub-2s processing times for most inventories. Properly exploiting
difflib.SequenceMatcher
's caching behavior may change this, however.For comparison, python-Levenshtein is a better speedup for less internal complexity, and doesn't appear to benefit at all from multiproc.
Notes
pool.map()
, without the context managermultiprocessing.cpu_count()
would be a reasonable default pool sizeInventory.suggest()
, likelysuggest
subparsernproc == 1
then skipmultiprocessing
entirelysphobjinv
import-time whethermultiprocessing
is availabledifflib
docs, implementing a bespoke scoring function directly withdifflib.SequenceMatcher
may allow significant speed gains, due to the ability to cache thesuggest
search termdifflib.SequenceMatcher
does not give good suggestions for some searches (e.g., "dataobj" in the sphobjinv inventory), compared to the default,WRatio
-based partial-matcher infuzzywuzzy
The text was updated successfully, but these errors were encountered: