Adding duration for each AudioSource #1069

popcornell · 2023-05-23T09:43:11Z

popcornell
May 23, 2023

Hi,

would it be a sensible choice to add an optional duration field for each AudioSource ?
I know that by design you avoided to do so and have the duration on the recording only and also the toleration field which can handle the difference in length between AudioSources.
But in some instances you may want to check also the duration of each AudioSources.

E.g.
You are training a (single-channel, suffice to say) ASR model, you are "unlucky" and sample the "wrong" AudioSource for which the current utterance is past its end. With tolerance it will be padded as I understand, so now you have in input zeros and the model still has to learn to predict the utterance.
There are other workarounds right now as discarding the truncated supervision. Just want to bring this up.

desh2608 · 2023-05-23T14:45:10Z

desh2608
May 23, 2023
Collaborator

I usually deal with recording source issues at the time of manifest creation. For e.g., you can check fix_manifests() in lhotse/qa.py.

5 replies

popcornell May 23, 2023
Author

Another workaround is taking the min over all AudioSource as the Recording duration. Right now I am doing this actually.
Maybe it is worth specifying that this issue in my cases arises because I sample random channels during training and sample random windows inside the whole recording. I need both because the application and method is quite peculiar, you know what I am talking about ;) .

pzelasko May 23, 2023
Maintainer

Why do you have different AudioSource durations? I'm wondering if you should use each audio source as separate recording, and use cuts to work with them.

popcornell May 23, 2023
Author

CHiME-6, DiPCo and other datasets which employ full meetings recorded by multiple devices. Recordings can have different duration due to clock drifting and packet losses. Treating each audiosource as different recording then you "lose" multichannel processing capability.

pzelasko May 23, 2023
Maintainer

Ahhh, I wasn't aware of that. But do you expect these differences to be significant? I'd expect we can set Recording.duration/num_samples as the minimum duration from all AudioSources, and then if the differences are not too large, Recording.load_audio() would just work any way (because of duration tolerance thing). If that's not the case, can you post a specific example of this issue?

popcornell May 23, 2023
Author

They can be several seconds.
Recording.load_audio() definitely works and one way to avoid this is loading all channels. At least one of the channels will have the speech corresponding to the supervision, the others will be padded as far as I understand.
The problem may arise when you randomly sample the channels (if you train a multi-channel model you usually sample a random amount of channels and if you are unlucky you will end up sampling the subset which has only zero padding).

Maybe the best place where to handle this is actually in the dataset class, I can check if the sampled audiosource segment is all zeros and re-sample another audiosource/channel.
If I had the duration i could ve used that instead of loading the audio. But, frankly, this issue happens not that much as the meeting recordings are several minutes in length.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding duration for each AudioSource #1069

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Adding duration for each AudioSource #1069

popcornell May 23, 2023

Replies: 1 comment · 5 replies

desh2608 May 23, 2023 Collaborator

popcornell May 23, 2023 Author

pzelasko May 23, 2023 Maintainer

popcornell May 23, 2023 Author

pzelasko May 23, 2023 Maintainer

popcornell May 23, 2023 Author

popcornell
May 23, 2023

Replies: 1 comment 5 replies

desh2608
May 23, 2023
Collaborator

popcornell May 23, 2023
Author

pzelasko May 23, 2023
Maintainer

popcornell May 23, 2023
Author

pzelasko May 23, 2023
Maintainer

popcornell May 23, 2023
Author