add more "advanced" search options #12

milanwiedemann · 2021-01-29T17:44:58Z

create categories, for example if user looks up word "meal" the app could also look for the terms "food" and "drink"

ChrisBeeley · 2021-03-05T13:42:27Z

Are we thinking of doing this by hand or programmatically? There are tf-idf and word vector based approaches that spring to mind. Perhaps @andreassot10 could advise

andreassot10 · 2021-05-10T11:39:40Z

I was thought that we could look at cosine similarities between word embeddings. As it turns out, it may not be that great a solution after all. And that's because "meals" may be highly correlated with many other words that are irrelevant to eating. Thus, words like "food" and "drink" wouldn't necessarily appear at the top of the correlations list (sorted in descending order).

There's a workaround though: We can manually specify the list of words that we believe are associated with the search word ("food" in this example) and return results in the following way: The results appearing first are the ones for the word with which "food" has the highest cosine correlation. Then will follow the results for the word with the second highest correlation with "food" etc. So what's returned would be based on a table that looks like this:

word      food
meals     0.9942232
meal      0.9700867
drink     0.9208235

I've done some experimentation with the implementation of Facebook's StarSpace in R (ruimtehol). Download this data and run this:

library(magrittr)

text_data_starspace <- text_data

# Text data in StarSpace-friendly format
text_data_starspace <- text_data_starspace %>% 
  dplyr::mutate(
    feedback = feedback %>% 
      strsplit(., "\\W") %>% 
      purrr::map_chr(
        ~ paste(setdiff(.x, ""), collapse = " ")
      ) %>% 
      tolower()
  )


# Calculate word embeddings. Can be slow so set maxTrainingTime to a few minutes (in seconds).
# Model will probably be rubbish, but should give you an idea.
model_wordspace <- ruimtehol::embed_wordspace(x = text_data_starspace$feedback, 
                                              model = "wordspace.bin",
                                              early_stopping = 0.8,
                                              validationPatience = 10,
                                              dim = 50,
                                              lr = 0.01, 
                                              epoch = 60, 
                                              loss = "softmax", 
                                              adagrad = TRUE, 
                                              similarity = "cosine", 
                                              negSearchLimit = 50,
                                              ngrams = 5, 
                                              minCount = 5,
                                              maxTrainTime = 3 * 60)

plot(model_wordspace)

# Matrix of word vectors
wordvectors <- as.matrix(model_wordspace)

# Data frame of cosine similarities between all word vectors
word_similarities <- wordvectors %>% 
  ruimtehol::embedding_similarity(wordvectors) %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column() %>% 
  dplyr::rename(word = rowname)

# Table of cosine similarities between search word and possibly related words
word1 <- 'food'
word2 <- c('meal', 'meals', 'drink')
corr_threshold <- 0.7
word_similarities %>% 
  dplyr::select(word, {{word1}}) %>% 
  dplyr::arrange(
    dplyr::across(word1, ~ dplyr::desc(.))
  ) %>%
  dplyr::filter(
    dplyr::across({{word1}}, ~ . >= corr_threshold),
    !word %in% tidytext::stop_words$word,
    !word %in% word1,
    word %in% word2
  )

Note that this is early days and there's probably a much better way of tackling this issue. It smells like Python to me!

@ChrisBeeley, you said some approaches spring to mind. It would be good to share any links with us.

andreassot10 · 2021-05-10T11:42:37Z

Package tm could also be useful?
https://fredgibbs.net/tutorials/document-similarity-with-r.html

ChrisBeeley · 2021-05-10T12:12:11Z

Nothing clever particularly. Just the word vector thing you mentioned, and also maybe looking at tf-idf values within particular categories (e.g. the words with the top 5 tf-idf from the "food" theme). In general I guess it would be fairly easy to pick up certain words with high tf-idf in each theme and bring back the whole theme for them (they could obviously turn this off because it would be quite over inclusive).

Could tweak it a bit I suppose depending on how much comes back- if there isn't a lot you could scrape the barrel a bit, like Google does with searches.

andreassot10 · 2021-05-10T12:25:28Z

Nothing clever particularly. Just the word vector thing you mentioned, and also maybe looking at tf-idf values within particular categories (e.g. the words with the top 5 tf-idf from the "food" theme). In general I guess it would be fairly easy to pick up certain words with high tf-idf in each theme and bring back the whole theme for them (they could obviously turn this off because it would be quite over inclusive).

Could tweak it a bit I suppose depending on how much comes back- if there isn't a lot you could scrape the barrel a bit, like Google does with searches.

I'm confused.

First of all, by "theme" you mean the code (Access, Miscellaneous etc.)?

Second, what do you mean by "[...] bring back the whole theme [...]?"

I need a clear explanation of what you two are after.

ChrisBeeley · 2021-05-10T12:35:53Z

First of all, by "theme" you mean the code (Access, Miscellaneous etc.)? Yes

Second, what do you mean by "[...] bring back the whole theme [...]?" Bring back everything tagged to that theme. If they search the word "nurse" they get back the whole "staff" theme.

Incidentally, we have talked about fitting models for some of the subthemes- "food" (from "Environment/ facilities") might be a good candidate for this. I imagine the TF-IDF would be reasonably different for food subthemes than for the rest of the Environment/ facilities category

andreassot10 · 2021-05-10T13:50:57Z

Thanks @ChrisBeeley.

Before delving into subthemes like "Food" from "Environment/f facilities", I thought it'd be a good idea to demonstrate you a process for relating words to themes that you may find useful. It uses ruimtehol again, only it builds a supervised model this time:

library(magrittr)

text_data_starspace <- text_data # Download from https://github.com/CDU-data-science-team/pxtextminingdashboard/blob/master/data/text_data.rda

# Text data in StarSpace-friendly format
text_data_starspace <- text_data_starspace %>% 
  dplyr::mutate(
    feedback = feedback %>% 
      strsplit(., "\\W") %>% 
      purrr::map_chr(
        ~ paste(setdiff(.x, ""), collapse = " ")
      ) %>% 
      tolower(),
    label = label %>% 
      as.character() %>%  
      strsplit(split = ",") %>% 
      purrr::map(~ gsub(" ", "-", .x))
  )

# Build supervised model
model_supervised <- ruimtehol::embed_tagspace(x = text_data_starspace$feedback, y = text_data_starspace$label,
                                   early_stopping = 0.8,
                                   validationPatience = 10,
                                   dim = 50,
                                   lr = 0.01, 
                                   epoch = 60, 
                                   loss = "softmax", 
                                   adagrad = TRUE, 
                                   similarity = "cosine", 
                                   negSearchLimit = 50,
                                   ngrams = 5, 
                                   minCount = 5)

plot(model_supervised)

# Dictionary (we won't be needing it- I'm just demonstrating it can be done)
dict <- ruimtehol::starspace_dictionary(model_supervised)
str(dict)

# Get embeddings of the dictionary of words as well as the categories
embedding_words <- as.matrix(model_supervised, type = "words")
embedding_labels <- as.matrix(model_supervised, type = "label")

# Find correlations between words and themes
corr_threshold <- 0.7
words <- c('nurse', 'ward')
embedding_labels %>% 
  ruimtehol::embedding_similarity(embedding_words) %>% 
  as.data.frame() %>%
  tibble::rownames_to_column() %>% 
  dplyr::rename(label = rowname) %>% 
  dplyr::mutate(label = sub("__label__", "", label)) %>% 
  tidyr::pivot_longer(cols = -1, names_to = "word") %>%
  dplyr::filter(
    !word %in% tidytext::stop_words$word,
    word %in% words,
    value >= 0.7
  ) %>% 
  dplyr::group_by(label) %>% 
  dplyr::arrange(word, desc(value))

# A tibble: 10 x 3
# Groups:   label [7]
#   label                   word  value
#   <chr>                   <chr> <dbl>
# 1 Staff                   nurse 0.961
# 2 Care-received           nurse 0.841
# 3 Dignity                 nurse 0.756
# 4 Staff                   ward  0.948
# 5 Care-received           ward  0.902
# 6 Dignity                 ward  0.841
# 7 Environment/facilities  ward  0.789
# 8 Access                  ward  0.762
# 9 Transition/coordination ward  0.758
#10 Communication           ward  0.758

ChrisBeeley · 2021-05-10T14:36:58Z

This looks really helpful. I can't process anything else before Wednesday, can we discuss on a call some time? Maybe when we're testing the Python pipeline.

It looks at first glance as though the > 0.7 is not discriminating very well- but > 0.9 would.

Please let's bring this and related matters on Wednesday for discussion

milanwiedemann added the enhancement New feature or request label Feb 4, 2021

milanwiedemann added this to the 0.3.0 milestone Feb 4, 2021

ChrisBeeley assigned ChrisBeeley, milanwiedemann and andreassot10 Mar 5, 2021

ChrisBeeley unassigned milanwiedemann May 7, 2021

ChrisBeeley modified the milestones: 0.3.0, 0.4.0 Jun 20, 2021

ChrisBeeley mentioned this issue Jun 30, 2021

Produce a better search tool #28

Closed

ChrisBeeley modified the milestones: 0.4.0, Ongoing development Jun 30, 2021

ChrisBeeley unassigned andreassot10 Aug 20, 2021

ChrisBeeley assigned yiwen-h and asegun-cod and unassigned ChrisBeeley Dec 1, 2022

ChrisBeeley modified the milestones: Year 2, 0.6 Dec 1, 2022

ChrisBeeley modified the milestones: 0.6, 0.7 Jan 31, 2023

ChrisBeeley modified the milestones: 0.7, Year two Mar 15, 2023

ChrisBeeley removed this from the Year two milestone Aug 4, 2023

ChrisBeeley added this to the Year three milestone Aug 4, 2023

ChrisBeeley unassigned yiwen-h and asegun-cod Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add more "advanced" search options #12

add more "advanced" search options #12

milanwiedemann commented Jan 29, 2021

ChrisBeeley commented Mar 5, 2021

andreassot10 commented May 10, 2021

andreassot10 commented May 10, 2021

ChrisBeeley commented May 10, 2021

andreassot10 commented May 10, 2021

ChrisBeeley commented May 10, 2021

andreassot10 commented May 10, 2021

ChrisBeeley commented May 10, 2021

add more "advanced" search options #12

add more "advanced" search options #12

Comments

milanwiedemann commented Jan 29, 2021

ChrisBeeley commented Mar 5, 2021

andreassot10 commented May 10, 2021

andreassot10 commented May 10, 2021

ChrisBeeley commented May 10, 2021

andreassot10 commented May 10, 2021

ChrisBeeley commented May 10, 2021

andreassot10 commented May 10, 2021

ChrisBeeley commented May 10, 2021