Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea for a further feature: similarity matrices #130

Open
trichie opened this issue May 27, 2020 · 2 comments
Open

Idea for a further feature: similarity matrices #130

trichie opened this issue May 27, 2020 · 2 comments

Comments

@trichie
Copy link

trichie commented May 27, 2020

Dear James,

One of the bottlenecks of fuzzy comparison between two large lists of size n and m is that it has O(n*m) runtime as the comparison function -- e.g. jellyfish.jaro_winkler -- has to be called n * m times. So far I have not been able to vectorize this operation, i.e. to push the looping through the lists into the highly performant faster C++ implementation cjellyfish as the jellyfish package seems not to support vectorized operations e.g. on NumPy arrays yet (as far as I was able to figure out). Also, just-in-time compilation packages such as Numba seem to have issues with the jellyfish function calls (i.e. it doesn't seem to work that way either).

I am aware that this is no general cure for the issue: O(n*m) just explodes with growing size and from a certain size on one needs to rethink the entire approach e.g. by somehow making an intelligent pre-split of the list. However, having e.g. a factor 16 speed gain (I see roughly factor 22 on my machine when %timeit-ing the c-compiled jaro_winkler vs the python jaro_winkler) would still make it possible to match 4 times larger lists in the same runtime (which would already help me a lot for the kind of lists I normally have to deal with).

Below you find some idea that I had for what could be added into the precompiled cjellyfish part of the package for significant speed gain when searching matches in lists. The return object would be a n*m matrix containing all the individual distances that can be processed further. Please note that -- if a list of strings is compared against itself (e.g. for duplicate scans of that list) -- one only needs to calculate the upper triangle of the matrix for symmetry reasons. this is achieved by setting flag = 1 as then the inner loop starts at i + 1 instead of 0. This uses up only half the runtime which is valuable as the whole reason for this exercise is runtime optimization ;-)

Please see the attached .txt file (I was not allowed to upload a .py file) for a python code suggestion of how to maybe implement it. But i am no pro programmer, just an actuary-turning-data-scientist with less than half a year of Python experience, so I am sure that my implementation still has some flaws (e.g. no type checking, no error handling, etc.) and is also dependent on numpy (which you might not want). It is just meant to show you the basic idea.

Regards Thomas / trichie

similarity_matrix.txt

@jamesturk
Copy link
Owner

Hi, thanks for taking the time to write-

I'm definitely open to a patch for this if there were an optional numpy dependency :) I'm not a numpy user myself, so someone with experience there would have to do that.

@trichie
Copy link
Author

trichie commented Jul 11, 2020

Hi James, meanwhile I got a bit more accustomed to Python ;-) This should actually do the trick without numpy, returning a 2d list.

import jellyfish as jf
def similarity_matrix(lst: list, oth = None, sim_func = jf.jaro_winkler_similarity, ignore_case = True, **kwargs):
lst = [x.lower() if isinstance(x, str) else x for x in lst] if ignore_case else lst
oth = lst if oth is None else [x.lower() if isinstance(x, str) else x for x in oth] if ignore_case else oth
flag = 1 if oth == lst else 0
sim_mat = [[sim_func(lst[i], oth[j], **kwargs) if j >= flag * (i + 1) else 0 for j in range(len(oth))]
for i in range(len(lst))]
return sim_mat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants