Recurrent neural networks and string arrays #871

purplnay · 2023-01-02T18:31:03Z

Summary

Improvement in input/output formats when dealing with arrays of strings / tokens for RNN.

Basic example

Currently, the LSTM model for example, can accept different types of data as input and as training data, arrays and strings. While the support for strings out of the box is great, it has some defaults that don't seem to fit so many use cases.

Current behavior

Given raw strings:

const data = [
  {
    input: 'hello I am an input',
    output: 'hi I am the output'
  }
]

This will implicitly use character tokenization behaving as: input.split('') and output.join(''). See #799 which is a common case of LSTM usage I believe, string input and labels output. Because of the default behavior, the labels are being treated as actual text.
It is possible to preprocess data by splitting everything to arrays so the model will be mapping words to neurons instead of characters.

Given arrays of strings

const data = [
  {
    input: ['I', 'split', 'my', 'input', 'into', 'tokens']
    output: ['so', 'I', 'expect', 'tokens', 'as', 'output']
  }
]

Here, we are using words instead of characters. That helps having simpler data to learn from for our model, and it also assures us of the output being exactly what we expect in term of label names. The problem with this right now is my output would be soIexpecttokensasoutput because of the output.join('') behavior. Working with string arrays is not documented anywhere so it surely causes trouble to some users of the library.

Current workaround

For now the most basic workaround I can think of is to add a space to every word in an array so that the output is readable and processable.

const data = [
  {
    input: ['I ', 'split ', 'my ', 'input ', 'into ', 'tokens ']
    output: ['so ', 'I ', 'expect ', 'tokens ', 'as ', 'output ']
  }
]

Possible improvement

I think the best improvement would be to have the same output format as the input. In the case of #799 where strings are being used as input and arrays as output, it should probably throw an explicit error about the input being a string while the output is an array. Preprocessing the data by splitting it into words input.split(' ') would be a pretty easy step for the user, rather than figuring out both how the input and output are mapped to neurons and how the output formatted.

Motivation

Apart from the #799 issue, I have been dealing with that problem and spent some time on it until I could write a lot of extra code to get some workaround. String arrays using RNN are not really documented and don't really provide a lot of customization.

A lot of the RNN (specifically LSTM) usage seems to be for reading questions or requests and writing a response using words that are related to a topic in a pretty limited vocabulary, as well as usage for mapping labels to a specific textual input which is not really well supported at the moment and has very implicit behavior. We could provide a more flexible RNN that is more easily customizable regarding how it treats the input and output, as well as making it more adapted to more usage.

The text was updated successfully, but these errors were encountered:

robertleeplummerjr · 2023-04-25T14:40:16Z

Working on this now. Not as easy as it seems, but have started abstracting the DataFormatter into a different type that can work with typescript generics and the original api. This is a priority for me because this is an existing feature that isn't alpha/beta.

purplnay added the enhancement label Jan 2, 2023

purplnay mentioned this issue Jan 2, 2023

Getting gibberish predictions when using recurrent LSTM and arrays of strings as output training data #799

Open

robertleeplummerjr self-assigned this Apr 25, 2023

robertleeplummerjr added the 2 - Working <= 5 label Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recurrent neural networks and string arrays #871

Recurrent neural networks and string arrays #871

purplnay commented Jan 2, 2023

robertleeplummerjr commented Apr 25, 2023

Recurrent neural networks and string arrays #871

Recurrent neural networks and string arrays #871

Comments

purplnay commented Jan 2, 2023

Summary

Basic example

Current behavior

Current workaround

Possible improvement

Motivation

robertleeplummerjr commented Apr 25, 2023