-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add more "advanced" search options #12
Comments
Are we thinking of doing this by hand or programmatically? There are tf-idf and word vector based approaches that spring to mind. Perhaps @andreassot10 could advise |
I was thought that we could look at cosine similarities between word embeddings. As it turns out, it may not be that great a solution after all. And that's because "meals" may be highly correlated with many other words that are irrelevant to eating. Thus, words like "food" and "drink" wouldn't necessarily appear at the top of the correlations list (sorted in descending order). There's a workaround though: We can manually specify the list of words that we believe are associated with the search word ("food" in this example) and return results in the following way: The results appearing first are the ones for the word with which "food" has the highest cosine correlation. Then will follow the results for the word with the second highest correlation with "food" etc. So what's returned would be based on a table that looks like this:
I've done some experimentation with the implementation of Facebook's
Note that this is early days and there's probably a much better way of tackling this issue. It smells like @ChrisBeeley, you said some approaches spring to mind. It would be good to share any links with us. |
Package |
Nothing clever particularly. Just the word vector thing you mentioned, and also maybe looking at tf-idf values within particular categories (e.g. the words with the top 5 tf-idf from the "food" theme). In general I guess it would be fairly easy to pick up certain words with high tf-idf in each theme and bring back the whole theme for them (they could obviously turn this off because it would be quite over inclusive). Could tweak it a bit I suppose depending on how much comes back- if there isn't a lot you could scrape the barrel a bit, like Google does with searches. |
I'm confused. First of all, by "theme" you mean the code (Access, Miscellaneous etc.)? Second, what do you mean by "[...] bring back the whole theme [...]?" I need a clear explanation of what you two are after. |
First of all, by "theme" you mean the code (Access, Miscellaneous etc.)? Yes Second, what do you mean by "[...] bring back the whole theme [...]?" Bring back everything tagged to that theme. If they search the word "nurse" they get back the whole "staff" theme. Incidentally, we have talked about fitting models for some of the subthemes- "food" (from "Environment/ facilities") might be a good candidate for this. I imagine the TF-IDF would be reasonably different for food subthemes than for the rest of the Environment/ facilities category |
Thanks @ChrisBeeley. Before delving into subthemes like "Food" from "Environment/f facilities", I thought it'd be a good idea to demonstrate you a process for relating words to themes that you may find useful. It uses
|
This looks really helpful. I can't process anything else before Wednesday, can we discuss on a call some time? Maybe when we're testing the Python pipeline. It looks at first glance as though the > 0.7 is not discriminating very well- but > 0.9 would. Please let's bring this and related matters on Wednesday for discussion |
create categories, for example if user looks up word "meal" the app could also look for the terms "food" and "drink"
The text was updated successfully, but these errors were encountered: