You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Delta releases are great, and they have correct data after v13.0 delta releases. But, incorporating them in one's workflow is not straightforward, as it involves clips AND metadata merging and re-splitting.
The idea behind delta releases is like this in general:
version [N] dataset (which you already have & extracted) + version [N+1] DELTA (that you just downloaded)
=> version [N+1] dataset
Now you have to do these (manually or using a script):
You have v[N] extracted
Download v[N+1] DELTA and extract it somewhere else
Create a v[N+1] directory and copy the clips directory contents form v[N] & v[N+1] DELTA into it
Merge v[N]validated with v[N+1] DELTAvalidated and write it in v[N+1] validated. (The same works for invalidated, but NOT for other.tsv (some from the previous version possibly moved to other buckets - validated/invalidated)! So you actually cannot reconstruct the new dataset wholly - perhaps (!) unless you write some code to check if they moved)
Use CorporaCreator repo on v[N+1]validated (by renaming it as clips.tsv), which further creates train/dev/test splits.
And very few people do that - because of steps 4 & 5, so the whole point of having delta releases gets lost.
So I propose this:
Describe the solution you'd like
Include the same metadata which resides in the full version of v[N+1] into v[N+1] DELTA.
This way one can just copy the files and be done.
Describe alternatives you've considered
I can think of two different workflows here:
Using the newly created dataset, create a model from scratch
You have a model based on v[N] and you have a nice amount of new recordings in the v[N+1] DELTA, so you choose to fine-tune (this may be preferred for the largest datasets).
The solution I suggested works for 1, and a programmer can easily get what it needs from the clips file list, so the second can also be solved.
Alternatively, both full and delta metadata can be put into the distribution, if desired.
Additional context
I don't know how many people use the second workflow, I don't have the numbers, but I suspect very few download the deltas.
I find delta releases important. They are much smaller than the full releases, thus using them will save lots of bandwidth and reduce the carbon footprint of the whole system - which I find utmost important.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Delta releases are great, and they have correct data after v13.0 delta releases. But, incorporating them in one's workflow is not straightforward, as it involves clips AND metadata merging and re-splitting.
The idea behind delta releases is like this in general:
Now you have to do these (manually or using a script):
clips
directory contents form v[N] & v[N+1] DELTA into itvalidated
with v[N+1] DELTAvalidated
and write it in v[N+1] validated. (The same works forinvalidated
, but NOT forother.tsv
(some from the previous version possibly moved to other buckets - validated/invalidated)! So you actually cannot reconstruct the new dataset wholly - perhaps (!) unless you write some code to check if they moved)validated
(by renaming it asclips.tsv
), which further createstrain/dev/test
splits.And very few people do that - because of steps 4 & 5, so the whole point of having delta releases gets lost.
So I propose this:
Describe the solution you'd like
Include the same metadata which resides in the full version of v[N+1] into v[N+1] DELTA.
This way one can just copy the files and be done.
Describe alternatives you've considered
I can think of two different workflows here:
The solution I suggested works for 1, and a programmer can easily get what it needs from the clips file list, so the second can also be solved.
Alternatively, both full and delta metadata can be put into the distribution, if desired.
Additional context
I don't know how many people use the second workflow, I don't have the numbers, but I suspect very few download the deltas.
I find delta releases important. They are much smaller than the full releases, thus using them will save lots of bandwidth and reduce the carbon footprint of the whole system - which I find utmost important.
The text was updated successfully, but these errors were encountered: