How to combine with huggingface audio datasets? #1366

yuekaizhang · 2024-07-02T10:47:35Z

from datasets import load_dataset
ds = load_dataset(
    "speechcolab/gigaspeech",
    "xl",
    split="train",
    trust_remote_code=True,
    streaming=True,
)

As shown in the code snippet above, we can utilize the GigaSpeech dataset without downloading it to local machines by setting streaming=True. I am interested in combining the Hugging Face streaming datasets feature with Lhotse functionalities, such as the Dynamic Sampler.

I noticed there are features in Lhotse like

    >>> cuts = LazySharIterator({
    ...     "cuts": ["pipe:curl https://my.page/cuts.000000.jsonl.gz"],
    ...     "recording": ["pipe:curl https://my.page/recording.000000.tar"],
    ... })

However, the Hugging Face datasets are formatted differently, such as:
https://huggingface.co/datasets/speechcolab/gigaspeech/blob/main/data/audio/m_files_additional/m_chunks_0000.tar.gz

I am looking for a way to integrate these two approaches effectively. Let me know if you need any further adjustments!

The text was updated successfully, but these errors were encountered:

pzelasko · 2024-07-03T16:04:56Z

Hi Yuekai,

It would be nice to have a HF dataset adapter for Lhotse. We may call it HFDatasetIterator. Since HF datasets don't provide a common schema for every dataset, we need to support user-defined mapping from the items in HF example to fields in lhotse Cut.

pseudo code:

class HFDatasetIterator:
    def __init__(self, *hf_dataset_or_args, field_map: Dict[str, str], **hf_kwargs) -> None:
        self.dataset = hf_dataset_or_args
        self.hf_kwargs = hf_kwargs
        self.field_map = field_map

    def __iter__(self):
        from datasets import Dataset, load_dataset

        if len(self.dataset) == 1 and isinstance(self.dataset[0], Dataset):
            dataset = self.dataset[0]
        else:
            dataset = load_dataset(*self.dataset, **hf_kwargs)
        
        for example in dataset:
            cut = create_cut(example)
            for field in example:
                tgt_field = self.field_map[field]
                update_field(cut, tgt_field, field)
            yield cut

then expose new CutSet constructor

from datasets import load_dataset
ds = load_dataset(...)
cuts = CutSet.from_huggingface(ds, field_map=field_map)

# alternatively
cuts = CutSet.from_huggingface("speechcolab/gigaspeech", "xl", split="train", ..., field_map=field_map)

It would be good to use as much of the HF dataset integration into NeMo datasets as much as possible to simplify the later integration https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Transducers_with_HF_Datasets.ipynb

Let me know if you'd like to contribute that, otherwise I'll try to find some time later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to combine with huggingface audio datasets? #1366

How to combine with huggingface audio datasets? #1366

yuekaizhang commented Jul 2, 2024

pzelasko commented Jul 3, 2024 •

edited

Loading

How to combine with huggingface audio datasets? #1366

How to combine with huggingface audio datasets? #1366

Comments

yuekaizhang commented Jul 2, 2024

pzelasko commented Jul 3, 2024 • edited Loading

pzelasko commented Jul 3, 2024 •

edited

Loading