Dynamic segmentation during training #1092

desh2608 · 2023-06-15T07:48:10Z

desh2608
Jun 15, 2023
Collaborator

Consider an ASR task on a dataset such as TedLium, which contains long recordings (e.g. 15 minutes). Usually, we train and test on segments obtained from an oracle VAD.

Now, suppose we do not have access to oracle VAD at inference time, i.e., we want to perform a long-form ASR task. One way to do this is to use the trained model to perform overlapping inference (say 30s chunks, like Whisper). The 30s chunk may contain both speech and silences, which creates a train-test mismatch since the model was trained only on speech segments.

To alleviate this, we can do the following. We have the full train recordings along with segments, and also have word-level alignments for the segments. Instead of training on the segments, we can dynamically create segments using the alignments during training, which could span multiple segments from the original segmentation. These dynamic segments would be ideally created on-the-fly during training.

I am wondering what would be the best way to achieve this in Lhotse?

pzelasko · 2023-06-15T10:05:35Z

pzelasko
Jun 15, 2023
Maintainer

Try something like the following (you might need to further develop it as I did not test it; it's just a starting point):

import random
from typing import Optional

from lhotse import SupervisionSegment
from lhotse.cut import Cut
from lhotse.utils import Seconds, compute_num_samples, overspans, TimeSpan, ifnone


def sample_alignment_segment(
    cut: Cut,
    min_duration: Seconds,
    max_duration: Seconds,
    seed: Optional[int] = None,
) -> Cut:
    """
    Given a cut that has possibly multiple supervisions with alignments,
    create a sub-cut with a single supervision that may combine text from several supervisions.
    We use the word-level alignment to determine output cut's transcript.
    The output cut's duration is sampled uniformly between ``min_duration`` and ``max_duration``.

    Example usage::

        >>> cuts = CutSet.from_file(...)  # long cuts with multiple supervisions
        >>> segment_cuts = cuts.repeat().map(sample_alignment_segment)  # infinite cut set of segments

    """
    assert all(
        s.alignment is not None and "word" in s.alignment for s in cut.supervisions
    ), "We require that every supervision has a word-level alignment available."

    if cut.duration < min_duration:
        return cut

    rng = random if seed is None else random.Random(seed)

    def _quantize(dur: Seconds) -> Seconds:
        # Avoid potential numerical issues later on
        num_samples = compute_num_samples(dur, cut.sampling_rate)
        return num_samples / cut.sampling_rate

    start = _quantize(rng.random() * (cut.duration - min_duration))
    duration = rng.uniform(min_duration, max_duration)

    alignment_items = [
        ai
        for s in cut.supervisions
        for ai in ifnone(s.alignment, {}).get("word", ())
        if overspans(TimeSpan(start, start + duration), ai)
    ]

    supervision = SupervisionSegment(
        id=cut.id,
        recording_id=cut.recording_id,
        start=0,
        duration=duration,
        text=" ".join(ai.symbol for ai in alignment_items),
    )
    new_cut = cut.truncate(offset=start, duration=duration)
    new_cut.supervisions = [supervision]
    return new_cut

4 replies

desh2608 Jun 15, 2023
Collaborator Author

Thanks! In terms of the RNG, I suppose I could put this in the sampler and use the sampler's RNG state.

pzelasko Jun 15, 2023
Maintainer

IMO it's better to decouple it from the sampler, writing/extending sampler code can be quite messy.

desh2608 Jun 15, 2023
Collaborator Author

But then how would it be run on-the-fly during training?

desh2608 Jun 15, 2023
Collaborator Author

Oh never mind, I just saw the usage in the docstring. Thanks!

desh2608 · 2023-07-10T12:17:42Z

desh2608
Jul 10, 2023
Collaborator Author

@pzelasko's code works perfectly (with a minor addition to ensure we don't exceed the cut's original duration):

    # ensure that cut end time is not exceeded
    duration = min(duration, cut.duration - start)

Follow-up question: I created a dynamic CutSet using the above function as follows:

cuts = load_manifest_lazy(self.args.manifest_dir / "cuts_train_full.jsonl.gz")
fn = partial(
    sample_alignment_segment,
    min_duration=min_duration,
    max_duration=max_duration,
)
return cuts.repeat().map(fn)

If I use this new CutSet in icefall-based DDP training, would it automatically generate a different random batch for every worker and every epoch?

7 replies

desh2608 Jul 17, 2023
Collaborator Author

I managed to speed it up by indexing the supervisions ahead of time (CutSet.index_supervisions()), truncating the cut to desired segment using the index, and then iterating over the alignments of the truncated supervisions. This avoids looping over all the supervisions at each instance. Here is the resulting function:

def sample_alignment_segment(
    cut: Cut,
    min_duration: Seconds,
    max_duration: Seconds,
    supervisions_index: Optional[Any] = None,
    seed: Optional[int] = None,
) -> Cut:
    """
    Given a cut that has possibly multiple supervisions with alignments,
    create a sub-cut with a single supervision that may combine text from several supervisions.
    We use the word-level alignment to determine output cut's transcript.
    The output cut's duration is sampled uniformly between ``min_duration`` and ``max_duration``.

    Example usage::

        >>> cuts = CutSet.from_file(...)  # long cuts with multiple supervisions
        >>> segment_cuts = cuts.repeat().map(sample_alignment_segment)  # infinite cut set of segments

    """
    if cut.duration < min_duration:
        return cut

    rng = random if seed is None else random.Random(seed)

    def _quantize(dur: Seconds) -> Seconds:
        # Avoid potential numerical issues later on
        num_samples = compute_num_samples(dur, cut.sampling_rate)
        return num_samples / cut.sampling_rate

    start = _quantize(rng.random() * (cut.duration - min_duration))
    duration = rng.uniform(min_duration, max_duration)

    # ensure that cut end time is not exceeded
    duration = min(duration, cut.duration - start)

    alignment_items = []
    trimmed = cut.truncate(
        offset=start,
        duration=duration,
        keep_excessive_supervisions=True,
        _supervisions_index=supervisions_index,
    )
    for s in trimmed.supervisions:
        # collect all alignment items that fall within the segment
        for ai in ifnone(s.alignment, {}).get("word", []):
            if overspans(TimeSpan(0, duration), ai):
                alignment_items.append(ai)

    supervision = SupervisionSegment(
        id=trimmed.id,
        recording_id=trimmed.recording_id,
        start=0,
        duration=duration,
        text=" ".join(ai.symbol for ai in alignment_items),
    )
    trimmed.supervisions = [supervision]
    return trimmed

This improved the GPU utilization considerably, but it is still ~2x slower (per epoch) compared to training with fixed segments (although part of that could be because the dynamic segments are longer on average).

pzelasko Jul 17, 2023
Maintainer

Looks great! You can still profile and post the flamegraph here so we can check if there's any other CPU bottleneck. If this function is useful, maybe it's worth to contribute it as a CutSet method?

desh2608 Jul 17, 2023
Collaborator Author

Sure, let me take a look at py-spy.

desh2608 Jul 17, 2023
Collaborator Author

On profiling the code, I realized I was spending 2x time because I was still iterating over the original data-loader (containing oracle segments). Originally, this was to keep track of epochs, but now I removed this and replaced it with a simple --num-iters-per-epoch parameter. Now the training speed is ~1.5x that of original segments. Trying to optimize further.

pzelasko Jul 17, 2023
Maintainer

Great! Glad it was useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic segmentation during training #1092

{{title}}

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Dynamic segmentation during training #1092

desh2608 Jun 15, 2023 Collaborator

Replies: 2 comments · 11 replies

pzelasko Jun 15, 2023 Maintainer

desh2608 Jun 15, 2023 Collaborator Author

pzelasko Jun 15, 2023 Maintainer

desh2608 Jun 15, 2023 Collaborator Author

desh2608 Jun 15, 2023 Collaborator Author

desh2608 Jul 10, 2023 Collaborator Author

desh2608 Jul 17, 2023 Collaborator Author

pzelasko Jul 17, 2023 Maintainer

desh2608 Jul 17, 2023 Collaborator Author

desh2608 Jul 17, 2023 Collaborator Author

pzelasko Jul 17, 2023 Maintainer

desh2608
Jun 15, 2023
Collaborator

Replies: 2 comments 11 replies

pzelasko
Jun 15, 2023
Maintainer

desh2608 Jun 15, 2023
Collaborator Author

pzelasko Jun 15, 2023
Maintainer

desh2608 Jun 15, 2023
Collaborator Author

desh2608 Jun 15, 2023
Collaborator Author

desh2608
Jul 10, 2023
Collaborator Author

desh2608 Jul 17, 2023
Collaborator Author

pzelasko Jul 17, 2023
Maintainer

desh2608 Jul 17, 2023
Collaborator Author

desh2608 Jul 17, 2023
Collaborator Author

pzelasko Jul 17, 2023
Maintainer