Realtime with direct audio input on Windows #214

RndyP · 2022-12-02T02:46:22Z

RndyP
Dec 2, 2022

Great work here, the accuracy is unbelievable! I would like to get realtime support, and will be following progress here on this intently. I am familiar with the Win32 wav APIs and want to use direct audio input streaming in real time. So basically I will call ::waveInOpen, ::waveInPrepareHeader, ::waveInAddBuffer, etc to load buffers continuously, and pass on to Whisper in realtime. Obviously I need to bypass all the wav file stuff, and hook in there somewhere. I then need to parse the text coming out of an output buffer. I did read some of the discussion here about the "chunking" issues with realtime streaming and am not up to speed enough to comment at this point. It appears in the code that there are issues with how the stream is pieced together (what if the chunk is in the middle of a word?) and also the processing algorithm that seems to be tuned to 30 second chunks. So perhaps I am "early to the party" here with regard to realtime streaming.

ggerganov · 2022-12-02T18:26:48Z

ggerganov
Dec 2, 2022
Maintainer

Sounds like you want to achieve the same thing as in the stream example?
If yes, I think you just need to replace the SDL audio input with the "direct audio input" that you want to use on Windows.

1 reply

RndyP Dec 5, 2022
Author

I've replaced SDL with the Windows wave API. There's a couple of design issues here. The chunk processing seems to have a fixed floor time. With the tiny model it's about 1.5 seconds and with base it's about 3 seconds. (This is on my i7-8550U) This means that you can't chunk real time with less than a couple seconds chunk time, otherwise the system chokes on CPU usage. Also, the "steal segment" method is sub-optimal in my opinion. (Grab a fixed length piece from the previous chunk). If the stolen piece is in the middle of a long word, it's lost. I hate to dive into this, but I think the best solution is to pre-process the stream by identifying space between words via thresholding, eliminate the gaps, and pass the processed chunks to whisper_full(). This also has the benefit of releasing the heavy CPU loading during large silences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realtime with direct audio input on Windows #214

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Realtime with direct audio input on Windows #214

RndyP Dec 2, 2022

Replies: 1 comment · 1 reply

ggerganov Dec 2, 2022 Maintainer

RndyP Dec 5, 2022 Author

RndyP
Dec 2, 2022

Replies: 1 comment 1 reply

ggerganov
Dec 2, 2022
Maintainer

RndyP Dec 5, 2022
Author