New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utterance stacker algorithm #1446

Open

johnwdubois opened this issue Apr 18, 2024 · 0 comments

Labels

Owner

johnwdubois commented Apr 18, 2024

The utterance stacker produces some bad utterances. In some cases they are way too long. (See #1431 )

Proposal

Keep the current utterance concatenation rule (which concatenates successive units by the same speaker into one utterance), with the following exceptions:

follow the concatenation rule as long as all units in a sequence are verbal (unitType = verbal), but NOT when they are non-verbal --see below
gapUnits < 6 (Otherwise, start a new utterance)

Classify units as {verbal, laugh, pause, vocalism, annotation, other}.

If a unit contains at least one word (kind = word), then unitType = verbal
Else, if it contains a laugh, then unitType = laugh
Else, if it contains a pause or in-breath (or both), then unitType = pause
Else, if it contains a vocalism, then unitType = vocalism
Else, if it contains ONLY annotation (e.g. transcriber's comments, glosses, etc.), then unitType = annotation
Else, unitType = other

Assign utteranceType based on the unitType:

if all units in an utterance are verbal (unitType = verbal), then utteranceType = verbal

If a unit is nonverbal (not all utteranceType != verbal), then

if the next unit by the same participant has the same utteranceType, and gapUnits = 0, then extend the utterance to include it, and assign utteranceType to be the same as its component unitType value(s)
if the the next unit has a different utteranceType, end the utterance, and assign utteranceType to be the same as its component unitType value(s)

If the corpus transcription data lacks annotation for one or more of features referenced above, create an automatic classification algorithm that achieves the same effect. For example, replace:

kind=word => algorithm that tests for the presence/absence of alphabetic characters (or other strategy, depending on the language)
kind=laugh => algorithm that tests for the presence of @ sign, or another user-specified symbol for laughter
etc.

johnwdubois added the enhancement label

This was referenced Apr 18, 2024

Utterance stacker: Quick improvements #1431

Open

Utterance (a.k.a. prosodic sentence) as unit or stack #214

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment