New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recurrent neural networks and string arrays #871
Labels
Comments
Working on this now. Not as easy as it seems, but have started abstracting the DataFormatter into a different type that can work with typescript generics and the original api. This is a priority for me because this is an existing feature that isn't alpha/beta. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Summary
Improvement in input/output formats when dealing with arrays of strings / tokens for RNN.
Basic example
Currently, the LSTM model for example, can accept different types of data as input and as training data, arrays and strings. While the support for strings out of the box is great, it has some defaults that don't seem to fit so many use cases.
Current behavior
This will implicitly use character tokenization behaving as:
input.split('')
andoutput.join('')
. See #799 which is a common case of LSTM usage I believe, string input and labels output. Because of the default behavior, the labels are being treated as actual text.It is possible to preprocess data by splitting everything to arrays so the model will be mapping words to neurons instead of characters.
Here, we are using words instead of characters. That helps having simpler data to learn from for our model, and it also assures us of the output being exactly what we expect in term of label names. The problem with this right now is my output would be
soIexpecttokensasoutput
because of theoutput.join('')
behavior. Working with string arrays is not documented anywhere so it surely causes trouble to some users of the library.Current workaround
For now the most basic workaround I can think of is to add a space to every word in an array so that the output is readable and processable.
Possible improvement
I think the best improvement would be to have the same output format as the input. In the case of #799 where strings are being used as input and arrays as output, it should probably throw an explicit error about the input being a string while the output is an array. Preprocessing the data by splitting it into words
input.split(' ')
would be a pretty easy step for the user, rather than figuring out both how the input and output are mapped to neurons and how the output formatted.Motivation
Apart from the #799 issue, I have been dealing with that problem and spent some time on it until I could write a lot of extra code to get some workaround. String arrays using RNN are not really documented and don't really provide a lot of customization.
A lot of the RNN (specifically LSTM) usage seems to be for reading questions or requests and writing a response using words that are related to a topic in a pretty limited vocabulary, as well as usage for mapping labels to a specific textual input which is not really well supported at the moment and has very implicit behavior. We could provide a more flexible RNN that is more easily customizable regarding how it treats the input and output, as well as making it more adapted to more usage.
The text was updated successfully, but these errors were encountered: