Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utterance stacker algorithm #1446

Open
johnwdubois opened this issue Apr 18, 2024 · 0 comments
Open

Utterance stacker algorithm #1446

johnwdubois opened this issue Apr 18, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@johnwdubois
Copy link
Owner

The utterance stacker produces some bad utterances. In some cases they are way too long. (See #1431 )

Proposal

  1. Keep the current utterance concatenation rule (which concatenates successive units by the same speaker into one utterance), with the following exceptions:
  • follow the concatenation rule as long as all units in a sequence are verbal (unitType = verbal), but NOT when they are non-verbal --see below
  • gapUnits < 6 (Otherwise, start a new utterance)
  1. Classify units as {verbal, laugh, pause, vocalism, annotation, other}.
  • If a unit contains at least one word (kind = word), then unitType = verbal
  • Else, if it contains a laugh, then unitType = laugh
  • Else, if it contains a pause or in-breath (or both), then unitType = pause
  • Else, if it contains a vocalism, then unitType = vocalism
  • Else, if it contains ONLY annotation (e.g. transcriber's comments, glosses, etc.), then unitType = annotation
  • Else, unitType = other
  1. Assign utteranceType based on the unitType:
  • if all units in an utterance are verbal (unitType = verbal), then utteranceType = verbal
  1. If a unit is nonverbal (not all utteranceType != verbal), then
  • if the next unit by the same participant has the same utteranceType, and gapUnits = 0, then extend the utterance to include it, and assign utteranceType to be the same as its component unitType value(s)
  • if the the next unit has a different utteranceType, end the utterance, and assign utteranceType to be the same as its component unitType value(s)
  1. If the corpus transcription data lacks annotation for one or more of features referenced above, create an automatic classification algorithm that achieves the same effect. For example, replace:
  • kind=word => algorithm that tests for the presence/absence of alphabetic characters (or other strategy, depending on the language)
  • kind=laugh => algorithm that tests for the presence of @ sign, or another user-specified symbol for laughter
  • etc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: To do
Development

No branches or pull requests

1 participant