Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to combine with huggingface audio datasets? #1366

Open
yuekaizhang opened this issue Jul 2, 2024 · 1 comment
Open

How to combine with huggingface audio datasets? #1366

yuekaizhang opened this issue Jul 2, 2024 · 1 comment

Comments

@yuekaizhang
Copy link
Contributor

from datasets import load_dataset
ds = load_dataset(
    "speechcolab/gigaspeech",
    "xl",
    split="train",
    trust_remote_code=True,
    streaming=True,
)

As shown in the code snippet above, we can utilize the GigaSpeech dataset without downloading it to local machines by setting streaming=True. I am interested in combining the Hugging Face streaming datasets feature with Lhotse functionalities, such as the Dynamic Sampler.

I noticed there are features in Lhotse like

    >>> cuts = LazySharIterator({
    ...     "cuts": ["pipe:curl https://my.page/cuts.000000.jsonl.gz"],
    ...     "recording": ["pipe:curl https://my.page/recording.000000.tar"],
    ... })

However, the Hugging Face datasets are formatted differently, such as:
https://huggingface.co/datasets/speechcolab/gigaspeech/blob/main/data/audio/m_files_additional/m_chunks_0000.tar.gz

I am looking for a way to integrate these two approaches effectively. Let me know if you need any further adjustments!

@pzelasko
Copy link
Collaborator

pzelasko commented Jul 3, 2024

Hi Yuekai,

It would be nice to have a HF dataset adapter for Lhotse. We may call it HFDatasetIterator. Since HF datasets don't provide a common schema for every dataset, we need to support user-defined mapping from the items in HF example to fields in lhotse Cut.

pseudo code:

class HFDatasetIterator:
    def __init__(self, *hf_dataset_or_args, field_map: Dict[str, str], **hf_kwargs) -> None:
        self.dataset = hf_dataset_or_args
        self.hf_kwargs = hf_kwargs
        self.field_map = field_map

    def __iter__(self):
        from datasets import Dataset, load_dataset

        if len(self.dataset) == 1 and isinstance(self.dataset[0], Dataset):
            dataset = self.dataset[0]
        else:
            dataset = load_dataset(*self.dataset, **hf_kwargs)
        
        for example in dataset:
            cut = create_cut(example)
            for field in example:
                tgt_field = self.field_map[field]
                update_field(cut, tgt_field, field)
            yield cut

then expose new CutSet constructor

from datasets import load_dataset
ds = load_dataset(...)
cuts = CutSet.from_huggingface(ds, field_map=field_map)

# alternatively
cuts = CutSet.from_huggingface("speechcolab/gigaspeech", "xl", split="train", ..., field_map=field_map)

It would be good to use as much of the HF dataset integration into NeMo datasets as much as possible to simplify the later integration https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Transducers_with_HF_Datasets.ipynb

Let me know if you'd like to contribute that, otherwise I'll try to find some time later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants