Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utterance stacker: Quick improvements #1431

Open
kayaulai opened this issue Jun 13, 2023 · 3 comments
Open

Utterance stacker: Quick improvements #1431

kayaulai opened this issue Jun 13, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request
Projects

Comments

@kayaulai
Copy link
Collaborator

kayaulai commented Jun 13, 2023

  • Currently, lines with no actual words are all put in one gigantic Utterance. I suggest simply removing this Utterance.
  • Very large gapUnits, say above 5, should be disallowed. Currently, there are some ridiculous gapUnits produced by the Utterance stacker, which is difficult to detect without looking at the gapUnits (because a naive annotator might just assume those lines got assigned to different utterances).

JWD: See detailed comments below.

@kayaulai kayaulai added the enhancement New feature or request label Jun 13, 2023
@kayaulai kayaulai self-assigned this Jun 13, 2023
@johnwdubois johnwdubois added this to To do in Core via automation Jun 13, 2023
@johnwdubois johnwdubois added polish This issue is about a small item of polishing, mainly aesthetics. and removed enhancement New feature or request labels Jun 13, 2023
@johnwdubois johnwdubois moved this from To do to In progress in Core Jun 13, 2023
@johnwdubois
Copy link
Owner

johnwdubois commented Jun 15, 2023

My suggestion: (see #1446 )

  1. Keep the current utterance concatenation rule (which concatenates successive units by the same speaker into one utterance), with the following exceptions:
  • follow the concatenation rule as long as all units in a sequence are verbal (unitType = verbal), but NOT when they are non-verbal --see below
  • gapUnits < 6 (Otherwise, start a new utterance)
  1. Classify units as {verbal, laugh, pause, vocalism, annotation, other}.
  • If a unit contains at least one word (kind = word), then unitType = verbal
  • Else, if it contains a laugh, then unitType = laugh
  • Else, if it contains a pause or in-breath (or both), then unitType = pause
  • Else, if it contains a vocalism, then unitType = vocalism
  • Else, if it contains ONLY annotation (e.g. transcriber's comments, glosses, etc.), then unitType = annotation
  • Else, unitType = other
  1. Assign utteranceType based on the unitType:
  • if all units in an utterance are verbal (unitType = verbal), then utteranceType = verbal
  1. If a unit is nonverbal (not all utteranceType != verbal), then
  • if the next unit by the same participant has the same utteranceType, and gapUnits = 0, then extend the utterance to include it, and assign utteranceType to be the same as its component unitType value(s)
  • if the the next unit has a different utteranceType, end the utterance, and assign utteranceType to be the same as its component unitType value(s)
    (see Utterance stacker algorithm #1446 )

@johnwdubois johnwdubois added enhancement New feature or request and removed polish This issue is about a small item of polishing, mainly aesthetics. labels Jun 15, 2023
@johnwdubois johnwdubois moved this from In progress to To do in Core Jun 19, 2023
@kayaulai
Copy link
Collaborator Author

I'm uncertain about using kind = word, because I fear that will make the stacker too SBC-specific.

@johnwdubois
Copy link
Owner

johnwdubois commented Jul 2, 2023

Point taken. Still, reference to "kind = word" is just one way to describe the algorithm/pseudocode.
The same effect can be gotten by writing a little routine that does the same thing (presumably with a higher error rate, but all you really need is to recognize one word per IU to get the main benefit.
(see #1446 )

@johnwdubois johnwdubois changed the title Quick improvements to Utterance stacker Utterance stacker: Quick improvements Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: To do
Core
  
To do
Development

No branches or pull requests

2 participants