You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As shown in the code snippet above, we can utilize the GigaSpeech dataset without downloading it to local machines by setting streaming=True. I am interested in combining the Hugging Face streaming datasets feature with Lhotse functionalities, such as the Dynamic Sampler.
It would be nice to have a HF dataset adapter for Lhotse. We may call it HFDatasetIterator. Since HF datasets don't provide a common schema for every dataset, we need to support user-defined mapping from the items in HF example to fields in lhotse Cut.
As shown in the code snippet above, we can utilize the GigaSpeech dataset without downloading it to local machines by setting streaming=True. I am interested in combining the Hugging Face streaming datasets feature with Lhotse functionalities, such as the Dynamic Sampler.
I noticed there are features in Lhotse like
However, the Hugging Face datasets are formatted differently, such as:
https://huggingface.co/datasets/speechcolab/gigaspeech/blob/main/data/audio/m_files_additional/m_chunks_0000.tar.gz
I am looking for a way to integrate these two approaches effectively. Let me know if you need any further adjustments!
The text was updated successfully, but these errors were encountered: