Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage of integer features + Feature extraction best practice #1407

Open
njellinas opened this issue Oct 22, 2024 · 7 comments
Open

Storage of integer features + Feature extraction best practice #1407

njellinas opened this issue Oct 22, 2024 · 7 comments

Comments

@njellinas
Copy link

Hello, I have some features which are coming from codebooks and are int16 values. Which is the best way to store them?

Also, if I already have a manifest file with Recordings and Supervisions and I want to extract features, but not save a new manifest file. Can I attach the existing features during runtime to the CutSet created from said manifest file?
In general which would be the best way to preprocess a dataset? The available manifests that are provided with lhotse include CutSets. In order to extract features and use them do I have to create new CutSets that include the features? Can't I have separate CutSets with recordings+supervisions and separate files for features?
Would it be better to always have separate recording manifests, supervision manifests and feature manifests and just combine them during dataloading with the from_manifests function?

@njellinas njellinas changed the title Storage of integer features Storage of integer features + Feature extraction best practice Oct 22, 2024
@pzelasko
Copy link
Collaborator

For storage you can use NumpyFilesWriter, sth like

with NumpyFilesWriter(...) as w:
    for cut in cuts:
        array = extract_codebook(cut)
        cut.codebook = w.store_array(cut.id, array, temporal_dim=..., frame_shift=...)

then if you save with cuts.to_file() the codebook manifest will be present in the cutset.

You can also do it shorter without writing to disk:

cut = cut.attach_tensor("codebook", extract_codebook(cut), temporal_dim=..., frame_shift=...)

in which case everything is kept in memory.

If you want to keep everything in separate files, I suggest looking into Lhotse Shar format which allows that (various fields are combined on the fly). This lets you have multiple versions of codebooks etc. and easily switch between them if you're experimenting with different models.

@njellinas
Copy link
Author

njellinas commented Oct 23, 2024

  1. I would like to utilize the lhotse compute_and_store_features function in order to be compatible with other type of features, i.e. interchange these codebooks with mel for other models, but with the same interface.
    I managed to pad the wav so that my own custom feature extractor passes the validation checks of lhotse for the number of frames, so the features are stored (I used the hdf5 writer).

There I have a problem where during loading lhotse converts the features back to float32. I made a custom class with the following line: self.hdf.create_dataset(key, data=value, dtype=value.dtype), so that the value is stored with the same dtype, but during loading with the PrecomputedFeatures strategy it loads them back as float32.

  1. Let's say I have downloaded the libri-tts CutSet files from lhotse download scripts. Then I perform feature extraction with: cuts = self.cuts.compute_and_store_features and I save the features in the disk.
    Then I want to load the same CutSet of libri-tts that I downloaded and attach the existing features. Is this possible? Or I must save the new cuts that have occured from the above command?

@pzelasko
Copy link
Collaborator

but during loading with the PrecomputedFeatures strategy it loads them back as float32.

You might want to replace PrecomputedFeatures with sth like collate_matrices(c.load_features() for c in cuts); or modify PrecomputedFeatures to keep the original dtype (I'd be OK with a PR with this change).

cuts = self.cuts.compute_and_store_features and I save the features in the disk.

You can do FeatureSet(c.features for c in cuts).to_file("my_feats.jsonl.gz") and later

class LazyFeatureAttacher:
  def __init__(self, cuts, features):
    self.cuts = cuts
    self.features = features
  def __iter__(self):
    for c, f in zip(self.cuts, self.features):
      c.features = f
      yield c

cuts = CutSet.from_file(...)
features = FeatureSet.from_file(...)
cuts = CutSet(LazyFeatureAttacher(cuts, features))

@njellinas
Copy link
Author

Do you know if this last operation is possible with lazy CutSets and FeatureSets?
e.g. I have the libri cutsets that don't have features. Then I save the features as FeatureSets and then I want to combine them but lazily in order to use a DynamicBucketSampler.
I could save the cuts that occur from the compute_and_store_features, but this is not very optimal, beacuse if e.g. the front-end changes, then I would have to recompute every feature for just some changes in the corpus.

@pzelasko
Copy link
Collaborator

Yes this would work with lazy datasets.

@njellinas
Copy link
Author

I think unfortunately this won't work because if you have filters in the CutSet or they are saved in a different order, then the features attached correspond to a different utterance...
So, the only way to be sure is to save the CutSet that occurs from the feature extraction I guess?

@pzelasko
Copy link
Collaborator

Make sure the feature set is sorted according to the cut set and apply any filter only after you attach the features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants