Skip to content

Data Grouping and Usage of Past Data During Instruction Tuning

Li Bo edited this page Jul 18, 2023 · 1 revision

Data Grouping and Usage of Past Data in instruction_following.py

In our model, we organize the training data into distinct groups, primarily based on the format of the data they represent and whether they are from past or new datasets. The rationale behind this organization and the specific purpose of using past data are explained below.

Why Group Data?

Data grouping is a crucial step that allows us to manage and handle different types of data more effectively. Each group of data represents a unique type of input that the model is expected to handle. For instance, we have groups such as:

  • image-text in-context: This group contains data in an image-text format where each instance has some contextual relevance. Both the past and new instances are loaded from specific paths. This format is beneficial for tasks where image-text pairs need a specific context to make sense.
  • image-text: In this group, we handle regular image-text data, where each instance is an image-text pair. Like the previous group, it also supports loading data from both past and new instances.
  • text: This group is all about text data. It's useful for any tasks that solely deal with text data, such as text classification or text generation.
  • video-text: The final group is video-text data. This format is essential for tasks where textual data is needed to describe or explain video content.

Grouping data like this allows us to handle the unique needs of each data type separately, which leads to better organization and a model that can handle multiple types of inputs effectively.

Role of Past Data

Past data plays a crucial role in training our models. As new data comes in and models get updated, there's a risk that the model could 'forget' the patterns and structures it had learned from the past data. This phenomenon, known as "catastrophic forgetting", is a common challenge in training neural networks, especially in scenarios where the model is expected to continually learn over time.

To counteract this issue, we utilize a technique called "resampling" of past data. Essentially, when we train the model on new data, we also include a subset of the past data in the training set. This helps to remind the model of the patterns it had learned before, thus avoiding catastrophic forgetting.

The ratio of resampling the past dataset can be adjusted using the past_subset_ratio argument. By changing this ratio, we can control how much of the past data is incorporated into each new training cycle.

In conclusion, the way we group the data and our method of incorporating past data into ongoing training cycles are key elements of how we keep our model updated while also retaining its ability to apply past learning to new data. This approach allows for continual learning while minimizing the risk of catastrophic forgetting.