Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurrent neural networks and string arrays #871

Open
purplnay opened this issue Jan 2, 2023 · 1 comment
Open

Recurrent neural networks and string arrays #871

purplnay opened this issue Jan 2, 2023 · 1 comment

Comments

@purplnay
Copy link

purplnay commented Jan 2, 2023

A GIF or MEME to give some spice of the internet

Summary

Improvement in input/output formats when dealing with arrays of strings / tokens for RNN.

Basic example

Currently, the LSTM model for example, can accept different types of data as input and as training data, arrays and strings. While the support for strings out of the box is great, it has some defaults that don't seem to fit so many use cases.

Current behavior

  • Given raw strings:
const data = [
  {
    input: 'hello I am an input',
    output: 'hi I am the output'
  }
]

This will implicitly use character tokenization behaving as: input.split('') and output.join(''). See #799 which is a common case of LSTM usage I believe, string input and labels output. Because of the default behavior, the labels are being treated as actual text.
It is possible to preprocess data by splitting everything to arrays so the model will be mapping words to neurons instead of characters.

  • Given arrays of strings
const data = [
  {
    input: ['I', 'split', 'my', 'input', 'into', 'tokens']
    output: ['so', 'I', 'expect', 'tokens', 'as', 'output']
  }
]

Here, we are using words instead of characters. That helps having simpler data to learn from for our model, and it also assures us of the output being exactly what we expect in term of label names. The problem with this right now is my output would be soIexpecttokensasoutput because of the output.join('') behavior. Working with string arrays is not documented anywhere so it surely causes trouble to some users of the library.

Current workaround

For now the most basic workaround I can think of is to add a space to every word in an array so that the output is readable and processable.

const data = [
  {
    input: ['I ', 'split ', 'my ', 'input ', 'into ', 'tokens ']
    output: ['so ', 'I ', 'expect ', 'tokens ', 'as ', 'output ']
  }
]

Possible improvement

I think the best improvement would be to have the same output format as the input. In the case of #799 where strings are being used as input and arrays as output, it should probably throw an explicit error about the input being a string while the output is an array. Preprocessing the data by splitting it into words input.split(' ') would be a pretty easy step for the user, rather than figuring out both how the input and output are mapped to neurons and how the output formatted.

Motivation

Apart from the #799 issue, I have been dealing with that problem and spent some time on it until I could write a lot of extra code to get some workaround. String arrays using RNN are not really documented and don't really provide a lot of customization.

A lot of the RNN (specifically LSTM) usage seems to be for reading questions or requests and writing a response using words that are related to a topic in a pretty limited vocabulary, as well as usage for mapping labels to a specific textual input which is not really well supported at the moment and has very implicit behavior. We could provide a more flexible RNN that is more easily customizable regarding how it treats the input and output, as well as making it more adapted to more usage.

@robertleeplummerjr
Copy link
Contributor

Working on this now. Not as easy as it seems, but have started abstracting the DataFormatter into a different type that can work with typescript generics and the original api. This is a priority for me because this is an existing feature that isn't alpha/beta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants