Storage of integer features + Feature extraction best practice #1407

njellinas · 2024-10-22T13:47:30Z

Hello, I have some features which are coming from codebooks and are int16 values. Which is the best way to store them?

Also, if I already have a manifest file with Recordings and Supervisions and I want to extract features, but not save a new manifest file. Can I attach the existing features during runtime to the CutSet created from said manifest file?
In general which would be the best way to preprocess a dataset? The available manifests that are provided with lhotse include CutSets. In order to extract features and use them do I have to create new CutSets that include the features? Can't I have separate CutSets with recordings+supervisions and separate files for features?
Would it be better to always have separate recording manifests, supervision manifests and feature manifests and just combine them during dataloading with the from_manifests function?

pzelasko · 2024-10-22T23:28:54Z

For storage you can use NumpyFilesWriter, sth like

with NumpyFilesWriter(...) as w:
    for cut in cuts:
        array = extract_codebook(cut)
        cut.codebook = w.store_array(cut.id, array, temporal_dim=..., frame_shift=...)

then if you save with cuts.to_file() the codebook manifest will be present in the cutset.

You can also do it shorter without writing to disk:

cut = cut.attach_tensor("codebook", extract_codebook(cut), temporal_dim=..., frame_shift=...)

in which case everything is kept in memory.

If you want to keep everything in separate files, I suggest looking into Lhotse Shar format which allows that (various fields are combined on the fly). This lets you have multiple versions of codebooks etc. and easily switch between them if you're experimenting with different models.

njellinas · 2024-10-23T09:30:16Z

I would like to utilize the lhotse compute_and_store_features function in order to be compatible with other type of features, i.e. interchange these codebooks with mel for other models, but with the same interface.
I managed to pad the wav so that my own custom feature extractor passes the validation checks of lhotse for the number of frames, so the features are stored (I used the hdf5 writer).

There I have a problem where during loading lhotse converts the features back to float32. I made a custom class with the following line: self.hdf.create_dataset(key, data=value, dtype=value.dtype), so that the value is stored with the same dtype, but during loading with the PrecomputedFeatures strategy it loads them back as float32.

Let's say I have downloaded the libri-tts CutSet files from lhotse download scripts. Then I perform feature extraction with: cuts = self.cuts.compute_and_store_features and I save the features in the disk.
Then I want to load the same CutSet of libri-tts that I downloaded and attach the existing features. Is this possible? Or I must save the new cuts that have occured from the above command?

pzelasko · 2024-10-29T01:45:10Z

but during loading with the PrecomputedFeatures strategy it loads them back as float32.

You might want to replace PrecomputedFeatures with sth like collate_matrices(c.load_features() for c in cuts); or modify PrecomputedFeatures to keep the original dtype (I'd be OK with a PR with this change).

cuts = self.cuts.compute_and_store_features and I save the features in the disk.

You can do FeatureSet(c.features for c in cuts).to_file("my_feats.jsonl.gz") and later

class LazyFeatureAttacher:
  def __init__(self, cuts, features):
    self.cuts = cuts
    self.features = features
  def __iter__(self):
    for c, f in zip(self.cuts, self.features):
      c.features = f
      yield c

cuts = CutSet.from_file(...)
features = FeatureSet.from_file(...)
cuts = CutSet(LazyFeatureAttacher(cuts, features))

njellinas · 2024-11-12T14:41:53Z

Do you know if this last operation is possible with lazy CutSets and FeatureSets?
e.g. I have the libri cutsets that don't have features. Then I save the features as FeatureSets and then I want to combine them but lazily in order to use a DynamicBucketSampler.
I could save the cuts that occur from the compute_and_store_features, but this is not very optimal, beacuse if e.g. the front-end changes, then I would have to recompute every feature for just some changes in the corpus.

pzelasko · 2024-11-13T23:40:49Z

Yes this would work with lazy datasets.

njellinas · 2024-11-14T14:12:35Z

I think unfortunately this won't work because if you have filters in the CutSet or they are saved in a different order, then the features attached correspond to a different utterance...
So, the only way to be sure is to save the CutSet that occurs from the feature extraction I guess?

pzelasko · 2024-11-14T14:59:05Z

Make sure the feature set is sorted according to the cut set and apply any filter only after you attach the features.

njellinas changed the title ~~Storage of integer features~~ Storage of integer features + Feature extraction best practice Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage of integer features + Feature extraction best practice #1407

Storage of integer features + Feature extraction best practice #1407

njellinas commented Oct 22, 2024

pzelasko commented Oct 22, 2024

njellinas commented Oct 23, 2024 •

edited

Loading

pzelasko commented Oct 29, 2024

njellinas commented Nov 12, 2024

pzelasko commented Nov 13, 2024

njellinas commented Nov 14, 2024

pzelasko commented Nov 14, 2024

Storage of integer features + Feature extraction best practice #1407

Storage of integer features + Feature extraction best practice #1407

Comments

njellinas commented Oct 22, 2024

pzelasko commented Oct 22, 2024

njellinas commented Oct 23, 2024 • edited Loading

pzelasko commented Oct 29, 2024

njellinas commented Nov 12, 2024

pzelasko commented Nov 13, 2024

njellinas commented Nov 14, 2024

pzelasko commented Nov 14, 2024

njellinas commented Oct 23, 2024 •

edited

Loading