Fixes Issue #803 #804

aravindMahadevan · 2024-06-11T02:15:55Z

Modify the _decode_asr method to support decoding user defined token in Whisper based models. Addresses issue in #803

xenova

Looks good! Just one thing:

src/tokenizers.js

Co-authored-by: Joshua Lochner <[email protected]>

aravindMahadevan · 2024-06-11T19:59:28Z

@xenova committed the fix, let me know if there is anything else needed!

avinashr175

lgtm!

xenova · 2024-06-13T12:28:07Z

Thanks! Will merge after tests pass. By the way, do you have an example of a whisper model which such tokens? Might be good to add a test.

HuggingFaceDocBuilderDev · 2024-06-13T12:28:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

aravindMahadevan · 2024-06-13T18:20:40Z

Hi @xenova, there are two tests failing due to a few test models not having the <30.00> token. Here's one such example: https://huggingface.co/Xenova/whisper-small/resolve/output_attentions/tokenizer.json.

Do you have any suggestions on how to overcome this issue? This token does exist in all of the base whisper models.

xenova · 2024-06-13T22:15:59Z

We could probably use the time precision (0.02) to calculate the offset: 30/0.02 + 1 = 1501 tokens (50364 -> 51864). Another fix is to simply update the tokenizer.json.

Also, can you provide a model which does have added tokens after the final timestamps token?

aravindMahadevan · 2024-06-14T02:58:06Z

Hi @xenova , that was a good suggestion and I have updated the logic to the following: const timestamp_end = timestamp_begin + total_timestamp_tokens
const total_timestamp_tokens = (30.00 - 0.00) / 0.02

This logic should work on both English only Whisper and multilingual Whisper variants. The beginning timestamp of 0.00 is equal to token id 50363 and 50364 in the English only Whisper variants and Multilingual Whisper variants respectively. Similarly, the final timestamp token id of 30.00 is equal to 51863 and 51864 in the English only Whisper variants and Multilingual Whisper variants respectively as well. In both cases, the final timestamp token id occurs at an offset of 1500 which is what total_timestamp_tokens evaluates to.

I do not have a model with added tokens that I can share publicly. I did find this model https://huggingface.co/oza75/whisper-bambara-asr-001 which has added tokens after the final timestamp but it's a special token.

support user defined tokens by bounding timestamp token if statement

58b8346

xenova requested changes Jun 11, 2024

View reviewed changes

src/tokenizers.js Outdated Show resolved Hide resolved

Update src/tokenizers.js

7d0cdbf

Co-authored-by: Joshua Lochner <[email protected]>

avinashr175 approved these changes Jun 11, 2024

View reviewed changes

calculate timestamp_end instead of hardcoding

e6e7e93

Update tokenizers.js

03b2c40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes Issue #803 #804

Fixes Issue #803 #804

aravindMahadevan commented Jun 11, 2024

xenova left a comment

aravindMahadevan commented Jun 11, 2024

avinashr175 left a comment

xenova commented Jun 13, 2024

HuggingFaceDocBuilderDev commented Jun 13, 2024

aravindMahadevan commented Jun 13, 2024

xenova commented Jun 13, 2024 •

edited

aravindMahadevan commented Jun 14, 2024

Fixes Issue #803 #804

Are you sure you want to change the base?

Fixes Issue #803 #804

Conversation

aravindMahadevan commented Jun 11, 2024

xenova left a comment

Choose a reason for hiding this comment

aravindMahadevan commented Jun 11, 2024

avinashr175 left a comment

Choose a reason for hiding this comment

xenova commented Jun 13, 2024

HuggingFaceDocBuilderDev commented Jun 13, 2024

aravindMahadevan commented Jun 13, 2024

xenova commented Jun 13, 2024 • edited

aravindMahadevan commented Jun 14, 2024

xenova commented Jun 13, 2024 •

edited