Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes Issue #803 #804

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

aravindMahadevan
Copy link

Modify the _decode_asr method to support decoding user defined token in Whisper based models. Addresses issue in #803

Copy link
Owner

@xenova xenova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just one thing:

src/tokenizers.js Outdated Show resolved Hide resolved
Co-authored-by: Joshua Lochner <[email protected]>
@aravindMahadevan
Copy link
Author

@xenova committed the fix, let me know if there is anything else needed!

Copy link

@avinashr175 avinashr175 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@xenova
Copy link
Owner

xenova commented Jun 13, 2024

Thanks! Will merge after tests pass. By the way, do you have an example of a whisper model which such tokens? Might be good to add a test.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@aravindMahadevan
Copy link
Author

Hi @xenova, there are two tests failing due to a few test models not having the <30.00> token. Here's one such example: https://huggingface.co/Xenova/whisper-small/resolve/output_attentions/tokenizer.json.

Do you have any suggestions on how to overcome this issue? This token does exist in all of the base whisper models.

@xenova
Copy link
Owner

xenova commented Jun 13, 2024

We could probably use the time precision (0.02) to calculate the offset: 30/0.02 + 1 = 1501 tokens (50364 -> 51864). Another fix is to simply update the tokenizer.json.

Also, can you provide a model which does have added tokens after the final timestamps token?

@aravindMahadevan
Copy link
Author

Hi @xenova , that was a good suggestion and I have updated the logic to the following: const timestamp_end = timestamp_begin + total_timestamp_tokens
const total_timestamp_tokens = (30.00 - 0.00) / 0.02

This logic should work on both English only Whisper and multilingual Whisper variants. The beginning timestamp of 0.00 is equal to token id 50363 and 50364 in the English only Whisper variants and Multilingual Whisper variants respectively. Similarly, the final timestamp token id of 30.00 is equal to 51863 and 51864 in the English only Whisper variants and Multilingual Whisper variants respectively as well. In both cases, the final timestamp token id occurs at an offset of 1500 which is what total_timestamp_tokens evaluates to.

I do not have a model with added tokens that I can share publicly. I did find this model https://huggingface.co/oza75/whisper-bambara-asr-001 which has added tokens after the final timestamp but it's a special token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants