-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storage of integer features + Feature extraction best practice #1407
Comments
For storage you can use with NumpyFilesWriter(...) as w:
for cut in cuts:
array = extract_codebook(cut)
cut.codebook = w.store_array(cut.id, array, temporal_dim=..., frame_shift=...) then if you save with You can also do it shorter without writing to disk: cut = cut.attach_tensor("codebook", extract_codebook(cut), temporal_dim=..., frame_shift=...) in which case everything is kept in memory. If you want to keep everything in separate files, I suggest looking into Lhotse Shar format which allows that (various fields are combined on the fly). This lets you have multiple versions of codebooks etc. and easily switch between them if you're experimenting with different models. |
There I have a problem where during loading lhotse converts the features back to float32. I made a custom class with the following line:
|
You might want to replace
You can do class LazyFeatureAttacher:
def __init__(self, cuts, features):
self.cuts = cuts
self.features = features
def __iter__(self):
for c, f in zip(self.cuts, self.features):
c.features = f
yield c
cuts = CutSet.from_file(...)
features = FeatureSet.from_file(...)
cuts = CutSet(LazyFeatureAttacher(cuts, features)) |
Do you know if this last operation is possible with lazy CutSets and FeatureSets? |
Yes this would work with lazy datasets. |
I think unfortunately this won't work because if you have filters in the CutSet or they are saved in a different order, then the features attached correspond to a different utterance... |
Make sure the feature set is sorted according to the cut set and apply any filter only after you attach the features. |
Hello, I have some features which are coming from codebooks and are int16 values. Which is the best way to store them?
Also, if I already have a manifest file with Recordings and Supervisions and I want to extract features, but not save a new manifest file. Can I attach the existing features during runtime to the CutSet created from said manifest file?
In general which would be the best way to preprocess a dataset? The available manifests that are provided with lhotse include CutSets. In order to extract features and use them do I have to create new CutSets that include the features? Can't I have separate CutSets with recordings+supervisions and separate files for features?
Would it be better to always have separate recording manifests, supervision manifests and feature manifests and just combine them during dataloading with the from_manifests function?
The text was updated successfully, but these errors were encountered: