Replies: 2 comments 11 replies
-
Try something like the following (you might need to further develop it as I did not test it; it's just a starting point): import random
from typing import Optional
from lhotse import SupervisionSegment
from lhotse.cut import Cut
from lhotse.utils import Seconds, compute_num_samples, overspans, TimeSpan, ifnone
def sample_alignment_segment(
cut: Cut,
min_duration: Seconds,
max_duration: Seconds,
seed: Optional[int] = None,
) -> Cut:
"""
Given a cut that has possibly multiple supervisions with alignments,
create a sub-cut with a single supervision that may combine text from several supervisions.
We use the word-level alignment to determine output cut's transcript.
The output cut's duration is sampled uniformly between ``min_duration`` and ``max_duration``.
Example usage::
>>> cuts = CutSet.from_file(...) # long cuts with multiple supervisions
>>> segment_cuts = cuts.repeat().map(sample_alignment_segment) # infinite cut set of segments
"""
assert all(
s.alignment is not None and "word" in s.alignment for s in cut.supervisions
), "We require that every supervision has a word-level alignment available."
if cut.duration < min_duration:
return cut
rng = random if seed is None else random.Random(seed)
def _quantize(dur: Seconds) -> Seconds:
# Avoid potential numerical issues later on
num_samples = compute_num_samples(dur, cut.sampling_rate)
return num_samples / cut.sampling_rate
start = _quantize(rng.random() * (cut.duration - min_duration))
duration = rng.uniform(min_duration, max_duration)
alignment_items = [
ai
for s in cut.supervisions
for ai in ifnone(s.alignment, {}).get("word", ())
if overspans(TimeSpan(start, start + duration), ai)
]
supervision = SupervisionSegment(
id=cut.id,
recording_id=cut.recording_id,
start=0,
duration=duration,
text=" ".join(ai.symbol for ai in alignment_items),
)
new_cut = cut.truncate(offset=start, duration=duration)
new_cut.supervisions = [supervision]
return new_cut |
Beta Was this translation helpful? Give feedback.
4 replies
-
@pzelasko's code works perfectly (with a minor addition to ensure we don't exceed the cut's original duration): # ensure that cut end time is not exceeded
duration = min(duration, cut.duration - start) Follow-up question: I created a dynamic CutSet using the above function as follows: cuts = load_manifest_lazy(self.args.manifest_dir / "cuts_train_full.jsonl.gz")
fn = partial(
sample_alignment_segment,
min_duration=min_duration,
max_duration=max_duration,
)
return cuts.repeat().map(fn) If I use this new CutSet in icefall-based DDP training, would it automatically generate a different random batch for every worker and every epoch? |
Beta Was this translation helpful? Give feedback.
7 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Consider an ASR task on a dataset such as TedLium, which contains long recordings (e.g. 15 minutes). Usually, we train and test on segments obtained from an oracle VAD.
Now, suppose we do not have access to oracle VAD at inference time, i.e., we want to perform a long-form ASR task. One way to do this is to use the trained model to perform overlapping inference (say 30s chunks, like Whisper). The 30s chunk may contain both speech and silences, which creates a train-test mismatch since the model was trained only on speech segments.
To alleviate this, we can do the following. We have the full train recordings along with segments, and also have word-level alignments for the segments. Instead of training on the segments, we can dynamically create segments using the alignments during training, which could span multiple segments from the original segmentation. These dynamic segments would be ideally created on-the-fly during training.
I am wondering what would be the best way to achieve this in Lhotse?
Beta Was this translation helpful? Give feedback.
All reactions