-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial transcription support #494
Conversation
sounds asthough this PR should be marked WIP? and... very exciting!!!! |
I've added the WIP flag but this is misleading since I'm not currently able to work on it a lot. With the limited time I get I'd rather focus on improving transcriptions and maybe preparing POC with search which was my initial goal. The current code requires some love to improve the looks (little HTML, some CSS and maybe javascript to set the correct time in a web player) It seems like a perfect opportunity for someone to pick up a nice task. I'd be more than happy to "mentor" as much as I can |
Just a thought do we want this to be on a separate page or should it ideally be embedded into the episode page. That would also be better for the JS interaction. I was thinking maybe tabs "show notes" and "transcript" |
Since it has not received much attention I've picked it up. For now setting playback time based on timestamp is not possible but folks at podverse will soon add it podverse/podverse-web#1071 (reply in thread) 💪 |
Over the weekend I've tried to give it a run, I've uploaded fresh transcripts (90sh) for different episodes: As mentioned above it is currently impossible to set the current time in podverse online player (well unless we proxy podverse player on the same domain but this would open a can of warms). Transcripts are imperfect but are easily editable by users even using an online GitHub client 👍 for the newest ones I'll definitely want to run the large.en model. @gerbrent @ChanceM @pagdot @noblepayne (people mentioned in other issues) please provide feedback 😄 |
Can you also upload the code to run the transcriptions? Could imagine it also being in another repository, but it would allow others to also work on it or just be inspired :) |
The sources are available in my repository Jupiter search I still have a lot of work there but on x86 machine it should be as simple as downloading a model and running inference using docker image |
I definitely agree with @ChanceM (src):
for a final solution we should have some type of separate page or tabbed area. For now though, I think that it just being inline is fine for an initial PoC. @FlakM, once this is merged (or even before then) could you collect a list of enhancements that we could do for transcriptions? Maybe this one we'll consider as closing #301 and we have another one for enhancing the transcription experience. Then we can reference the old issue in the new one, so anyone that wants to make that leap (from PoC -> enhanced) has that link. We can eventually break each of those tasks out in their own GH issues (to allow individuals to work on them separately), but for now I think just doing a single issue would be nice (till we do some spring cleaning on some of these issues 😅 ) |
Hello @elreydetoda as I've mentioned in the matrix (hehe) this work has been taken over by JB crew and if I'm not mistaken they have different ideas about how the transcripts are to be generated. If there is an actual decision to host transcripts on s3, not a GitHub repo, then this MR should probably get closed and new one should be created to ingest data from s3. As for further improvements here is the list of my personal acceptance criteria that I would add:
|
Hello @FlakM 😁 Yep, no problem I remember seeing it (thank you for the reminder 🙃) . I just wanted to get my feedback about the longer term goal to be added to the PR/issue about this feature.
I completely agree with all of these criteria/features! Whatever I can do to convey these point I'll definitely try to get them all included (if I'm asked/consultanted about this feature). I can't guarantee it'll happen (since in the end it's up to the JB team), but IMO I definitely think 1, 2, & 5 (of the points you listed above) should be considered critical (even for an MVP) to ensure the transcript doesn't offend someone and reflect badly on JB. If they did, it would allow anyone in the community to quickly fix it and that would fix it for everything/one. Honestly, (just thinking out loud here) would you think a good alternative to an s3 bucket could be something like GH pages to actually host just the raw text of the transcripts (in whatever format they need to be in). Plus IIRC GH pages already has some type of CDN in front of it. If it doesn't, since it's just text, we could just put cloudflare in front of it too. |
Hi! So this is the initial MR for getting the ball rolling on incorporating the transcriptions created for issue #301. The idea is that the transcriptions should be a plain json file and they should be displayed only for the pages where the relevant transcription is already present.
This is in no way a ready code, just an initial setup, maybe someone will have an easier time picking it up now 👍
Features I'd like to see:
Unfortunately, whisper AI is currently cutting the sentences strangely - this should be fixed in sometime in the future.
I'd be happy to rerun them then and backport modifications.