-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding audio output to gonbui #139
Comments
Very cool.. A couple of suggestions come to mind:
Do you want to create the PR ? |
I am still experimenting with this. It is rather hard to make this into something more useful besides just generating one output. The current use case was to create a "Podcast" like NotebookLM does. My code currently outputs multiple audio cues after each other in a loop. To make this feasible I added a some code to delay the next output roughly for as long as the audio plays. Having all of them save with the notebook was kind of expected though. That they all play together is not so nice. I guess one could use JS and events to create some smart "play once" and "play chained" stuff. But I would not want to go further into that and rather write a real application outside the notebook. For my experiment I now use I am not sure if something like this should actually be included. Probably it is better to have just an example of the basic functionality somehwere? |
I just made some improvements to the code (there was some AI involved). This can only handle "wav" but can autoplay and also optionally keep the audio. func getWAVDuration(data []byte) (time.Duration, error) {
if len(data) < 44 {
return 0, errors.New("data too short to be a valid WAV file")
}
if string(data[0:4]) != "RIFF" {
return 0, errors.New("invalid WAV file: missing 'RIFF' header")
}
if string(data[8:12]) != "WAVE" {
return 0, errors.New("invalid WAV file: missing 'WAVE' header")
}
// Initialize variables to hold format and data chunk positions
var fmtChunkPos, dataChunkPos int
var dataChunkSize uint32
// Start parsing chunks after the first 12 bytes ("RIFF" and "WAVE" headers)
pos := 12
for pos < len(data)-8 {
chunkID := string(data[pos : pos+4])
chunkSize := binary.LittleEndian.Uint32(data[pos+4 : pos+8])
switch chunkID {
case "fmt ":
fmtChunkPos = pos + 8
case "data":
dataChunkPos = pos + 8
dataChunkSize = chunkSize
// Once we've found the data chunk, we can break the loop
pos = len(data)
}
// Move to the next chunk (8 bytes for chunk header + chunkSize)
pos += 8 + int(chunkSize)
}
if fmtChunkPos == 0 {
return 0, errors.New("invalid WAV file: missing 'fmt ' chunk")
}
if dataChunkPos == 0 {
return 0, errors.New("invalid WAV file: missing 'data' chunk")
}
// Read format parameters
audioFormat := binary.LittleEndian.Uint16(data[fmtChunkPos : fmtChunkPos+2])
numChannels := binary.LittleEndian.Uint16(data[fmtChunkPos+2 : fmtChunkPos+4])
sampleRate := binary.LittleEndian.Uint32(data[fmtChunkPos+4 : fmtChunkPos+8])
bitsPerSample := binary.LittleEndian.Uint16(data[fmtChunkPos+14 : fmtChunkPos+16])
if audioFormat != 1 {
return 0, errors.New("unsupported audio format (only PCM is supported)")
}
// Calculate the duration
bytesPerSample := bitsPerSample / 8
totalSamples := dataChunkSize / uint32(bytesPerSample*uint16(numChannels))
durationSeconds := float64(totalSamples) / float64(sampleRate)
duration := time.Duration(durationSeconds * float64(time.Second))
return duration, nil
}
func PlayAudio(audioBytes []byte, autoplay bool, keep bool) {
var b bytes.Buffer
htmlCellId := "PlayAudioAndWait_"+gonbui.UniqueId()
w := base64.NewEncoder(base64.StdEncoding, &b)
if _, err := w.Write(audioBytes); err != nil {
panic(err)
}
if err := w.Close(); err != nil {
panic(err)
}
dataURL:="data:audio/wav;base64,"+b.String()
if autoplay {
gonbui.UpdateHtml(htmlCellId,`<audio src="`+dataURL+`" controls="" autoplay="">`)
sleep, err := getWAVDuration(audioBytes)
if err != nil {
//sleep:=time.Duration(len(audioBytes))*time.Microsecond*21
keep=true
} else {
fmt.Println(sleep.String())
time.Sleep(sleep)
gonbui.UpdateHtml(htmlCellId,"")
}
}
if keep {
// if we want to keep the audio
gonbui.DisplayHtml(`<audio src="`+dataURL+`" controls="">`)
}
} |
Indeed something more complex, like chaining of audio ... likely could either be local to your application, or as a separate library. And I agree, it's easier to have the chaining of playing to be done in Javascript. There is an experimental Btw, talking about NotebookLM, did you see the new Illuminate ? It seems super cool as well. Although, I confess I haven't tried NotebookLM much either. From the code, one missing part is the support for other audio mime types often used in the browser (mpg, ogg, aac, flac, webm). Not sure if the mime type can be guessed automatically from the |
The approach above works quite good. Adding a "chain play" feature with some javascript should not be hard. But I actually do not need that really. My experiment worked out quite good. It is amazing what local models can do. It can be guessed from the bytes but for that one needs to add more packages. It may be to special to be added in GONB? About WASM, we are using https://github.com/maxence-charriere/go-app/ quite a lot in production to create PWA based application and I also have some private AI tools written with that, like my personal SD frontend that creates and delivers random images to the Divoom Pixoo-64 in my home-office. I did not yet try it with GONB. Playing audio from WASM with the browser is something I solved some time ago. But that is also pretty complicated and it is much easier to just use The Illuminate functionality seems very close to NotebookLM. To me most AI stuff is only really interesting if I can keep my privacy when using it. This limits what I want to feed into public models. |
That's really cool. I looked at https://github.com/maxence-charriere/go-app/ a long time ago, it seems to have improved quite significantly. I should try it in my next project! I have an old project of a multi-player RTS like game (with few units) in Go purely in Wasm, but directly using https://github.com/go-webapi/webapi . It uses WebGL for graphics, and it runs super smooth, it's really nice. But ... I had to do lots of heavy lifting of the basics. I also setup a background musing on the main menus of the game, but I haven't added sound effects yet. I should resume and open-source it... The privacy issue in AI is not trivial ... since we are not able to run large models in our own computers. Btw, what are you using for text-to-speech ? After I submit the new Docker with tools PR, I'll take a stab at the audio one. |
Here's a revised version for your GitHub comment: I'm using xtts-api-server for text-to-speech. It occasionally generates some strange audio artifacts or hallucinates funny sounds, but overall, I'm quite happy with it. For podcast generation, I use solar:10.7b-instruct-v1-q8_0, which works surprisingly well. I tested my idea on 20+ models and picked the ones that performed best, especially for generating JSON output. The main prompt is surprisingly literal:
After grabbing a website with go-readability (cached by I can assign specific voices to the hosts, like ones I created for Hannah Fry and Derek Muller for added realism. Then, I generate an image using Stable Diffusion, prompted by the LLM, with a straightforward prompt that keeps some recognizable style:
The generated line is also passed to CoquiTTS for voice generation, then combined with timing adjustments to produce the final output. I even translate the content into other languages (typically German). It works fairly well, and even when it messes up, it’s still informative and fun. For even more fun, I use CoquiTTS to add accents, like Polish, and change the characters’ interpretation of the story to be more playful or critical. P.S.: I might release the notebook, though it requires backend services and models to function properly. (Edit: Used AI for grammar xD) |
This is amazing Hans! I'm curious to see the notebook!! When it's ready, can I add a link/screenshot to GoNB homepage ? Btw, the LLM solar:10.7b-instruct-v1-q8_0 is pretty large, so I assume it won't work on normal GPUs (maybe it works on an NVidia 4090). Do you run it on CPU, but since it only needs to run it once to generate the |
I use one RTX 3090 ti (24 GB VRAM) for all of my AI stuff. It is quite fast. Some seconds for the podcast script and then a few for the single images and TTS. I currently do not even use the audio output time for generating the next part. |
As I am playing with TTS right now. How about adding a standard way to play audio?
This does the trick for me:
The text was updated successfully, but these errors were encountered: