Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add more "advanced" search options #12

Open
milanwiedemann opened this issue Jan 29, 2021 · 8 comments
Open

add more "advanced" search options #12

milanwiedemann opened this issue Jan 29, 2021 · 8 comments
Labels
enhancement New feature or request
Milestone

Comments

@milanwiedemann
Copy link
Contributor

create categories, for example if user looks up word "meal" the app could also look for the terms "food" and "drink"

@milanwiedemann milanwiedemann added the enhancement New feature or request label Feb 4, 2021
@milanwiedemann milanwiedemann added this to the 0.3.0 milestone Feb 4, 2021
@ChrisBeeley
Copy link
Member

Are we thinking of doing this by hand or programmatically? There are tf-idf and word vector based approaches that spring to mind. Perhaps @andreassot10 could advise

@andreassot10
Copy link

I was thought that we could look at cosine similarities between word embeddings. As it turns out, it may not be that great a solution after all. And that's because "meals" may be highly correlated with many other words that are irrelevant to eating. Thus, words like "food" and "drink" wouldn't necessarily appear at the top of the correlations list (sorted in descending order).

There's a workaround though: We can manually specify the list of words that we believe are associated with the search word ("food" in this example) and return results in the following way: The results appearing first are the ones for the word with which "food" has the highest cosine correlation. Then will follow the results for the word with the second highest correlation with "food" etc. So what's returned would be based on a table that looks like this:

word      food
meals     0.9942232
meal      0.9700867
drink     0.9208235

I've done some experimentation with the implementation of Facebook's StarSpace in R (ruimtehol). Download this data and run this:

library(magrittr)

text_data_starspace <- text_data

# Text data in StarSpace-friendly format
text_data_starspace <- text_data_starspace %>% 
  dplyr::mutate(
    feedback = feedback %>% 
      strsplit(., "\\W") %>% 
      purrr::map_chr(
        ~ paste(setdiff(.x, ""), collapse = " ")
      ) %>% 
      tolower()
  )


# Calculate word embeddings. Can be slow so set maxTrainingTime to a few minutes (in seconds).
# Model will probably be rubbish, but should give you an idea.
model_wordspace <- ruimtehol::embed_wordspace(x = text_data_starspace$feedback, 
                                              model = "wordspace.bin",
                                              early_stopping = 0.8,
                                              validationPatience = 10,
                                              dim = 50,
                                              lr = 0.01, 
                                              epoch = 60, 
                                              loss = "softmax", 
                                              adagrad = TRUE, 
                                              similarity = "cosine", 
                                              negSearchLimit = 50,
                                              ngrams = 5, 
                                              minCount = 5,
                                              maxTrainTime = 3 * 60)

plot(model_wordspace)

# Matrix of word vectors
wordvectors <- as.matrix(model_wordspace)

# Data frame of cosine similarities between all word vectors
word_similarities <- wordvectors %>% 
  ruimtehol::embedding_similarity(wordvectors) %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column() %>% 
  dplyr::rename(word = rowname)

# Table of cosine similarities between search word and possibly related words
word1 <- 'food'
word2 <- c('meal', 'meals', 'drink')
corr_threshold <- 0.7
word_similarities %>% 
  dplyr::select(word, {{word1}}) %>% 
  dplyr::arrange(
    dplyr::across(word1, ~ dplyr::desc(.))
  ) %>%
  dplyr::filter(
    dplyr::across({{word1}}, ~ . >= corr_threshold),
    !word %in% tidytext::stop_words$word,
    !word %in% word1,
    word %in% word2
  )

Note that this is early days and there's probably a much better way of tackling this issue. It smells like Python to me!

@ChrisBeeley, you said some approaches spring to mind. It would be good to share any links with us.

@andreassot10
Copy link

Package tm could also be useful?
https://fredgibbs.net/tutorials/document-similarity-with-r.html

@ChrisBeeley
Copy link
Member

Nothing clever particularly. Just the word vector thing you mentioned, and also maybe looking at tf-idf values within particular categories (e.g. the words with the top 5 tf-idf from the "food" theme). In general I guess it would be fairly easy to pick up certain words with high tf-idf in each theme and bring back the whole theme for them (they could obviously turn this off because it would be quite over inclusive).

Could tweak it a bit I suppose depending on how much comes back- if there isn't a lot you could scrape the barrel a bit, like Google does with searches.

@andreassot10
Copy link

Nothing clever particularly. Just the word vector thing you mentioned, and also maybe looking at tf-idf values within particular categories (e.g. the words with the top 5 tf-idf from the "food" theme). In general I guess it would be fairly easy to pick up certain words with high tf-idf in each theme and bring back the whole theme for them (they could obviously turn this off because it would be quite over inclusive).

Could tweak it a bit I suppose depending on how much comes back- if there isn't a lot you could scrape the barrel a bit, like Google does with searches.

I'm confused.

First of all, by "theme" you mean the code (Access, Miscellaneous etc.)?

Second, what do you mean by "[...] bring back the whole theme [...]?"

I need a clear explanation of what you two are after.

@ChrisBeeley
Copy link
Member

First of all, by "theme" you mean the code (Access, Miscellaneous etc.)? Yes

Second, what do you mean by "[...] bring back the whole theme [...]?" Bring back everything tagged to that theme. If they search the word "nurse" they get back the whole "staff" theme.

Incidentally, we have talked about fitting models for some of the subthemes- "food" (from "Environment/ facilities") might be a good candidate for this. I imagine the TF-IDF would be reasonably different for food subthemes than for the rest of the Environment/ facilities category

@andreassot10
Copy link

Thanks @ChrisBeeley.

Before delving into subthemes like "Food" from "Environment/f facilities", I thought it'd be a good idea to demonstrate you a process for relating words to themes that you may find useful. It uses ruimtehol again, only it builds a supervised model this time:

library(magrittr)

text_data_starspace <- text_data # Download from https://github.com/CDU-data-science-team/pxtextminingdashboard/blob/master/data/text_data.rda

# Text data in StarSpace-friendly format
text_data_starspace <- text_data_starspace %>% 
  dplyr::mutate(
    feedback = feedback %>% 
      strsplit(., "\\W") %>% 
      purrr::map_chr(
        ~ paste(setdiff(.x, ""), collapse = " ")
      ) %>% 
      tolower(),
    label = label %>% 
      as.character() %>%  
      strsplit(split = ",") %>% 
      purrr::map(~ gsub(" ", "-", .x))
  )

# Build supervised model
model_supervised <- ruimtehol::embed_tagspace(x = text_data_starspace$feedback, y = text_data_starspace$label,
                                   early_stopping = 0.8,
                                   validationPatience = 10,
                                   dim = 50,
                                   lr = 0.01, 
                                   epoch = 60, 
                                   loss = "softmax", 
                                   adagrad = TRUE, 
                                   similarity = "cosine", 
                                   negSearchLimit = 50,
                                   ngrams = 5, 
                                   minCount = 5)

plot(model_supervised)

# Dictionary (we won't be needing it- I'm just demonstrating it can be done)
dict <- ruimtehol::starspace_dictionary(model_supervised)
str(dict)

# Get embeddings of the dictionary of words as well as the categories
embedding_words <- as.matrix(model_supervised, type = "words")
embedding_labels <- as.matrix(model_supervised, type = "label")

# Find correlations between words and themes
corr_threshold <- 0.7
words <- c('nurse', 'ward')
embedding_labels %>% 
  ruimtehol::embedding_similarity(embedding_words) %>% 
  as.data.frame() %>%
  tibble::rownames_to_column() %>% 
  dplyr::rename(label = rowname) %>% 
  dplyr::mutate(label = sub("__label__", "", label)) %>% 
  tidyr::pivot_longer(cols = -1, names_to = "word") %>%
  dplyr::filter(
    !word %in% tidytext::stop_words$word,
    word %in% words,
    value >= 0.7
  ) %>% 
  dplyr::group_by(label) %>% 
  dplyr::arrange(word, desc(value))

# A tibble: 10 x 3
# Groups:   label [7]
#   label                   word  value
#   <chr>                   <chr> <dbl>
# 1 Staff                   nurse 0.961
# 2 Care-received           nurse 0.841
# 3 Dignity                 nurse 0.756
# 4 Staff                   ward  0.948
# 5 Care-received           ward  0.902
# 6 Dignity                 ward  0.841
# 7 Environment/facilities  ward  0.789
# 8 Access                  ward  0.762
# 9 Transition/coordination ward  0.758
#10 Communication           ward  0.758

@ChrisBeeley
Copy link
Member

This looks really helpful. I can't process anything else before Wednesday, can we discuss on a call some time? Maybe when we're testing the Python pipeline.

It looks at first glance as though the > 0.7 is not discriminating very well- but > 0.9 would.

Please let's bring this and related matters on Wednesday for discussion

@ChrisBeeley ChrisBeeley modified the milestones: 0.3.0, 0.4.0 Jun 20, 2021
@ChrisBeeley ChrisBeeley assigned yiwen-h and asegun-cod and unassigned ChrisBeeley Dec 1, 2022
@ChrisBeeley ChrisBeeley modified the milestones: Year 2, 0.6 Dec 1, 2022
@ChrisBeeley ChrisBeeley modified the milestones: 0.6, 0.7 Jan 31, 2023
@ChrisBeeley ChrisBeeley modified the milestones: 0.7, Year two Mar 15, 2023
@ChrisBeeley ChrisBeeley removed this from the Year two milestone Aug 4, 2023
@ChrisBeeley ChrisBeeley added this to the Year three milestone Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants