Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding audio output to gonbui #139

Open
oderwat opened this issue Oct 12, 2024 · 9 comments
Open

Adding audio output to gonbui #139

oderwat opened this issue Oct 12, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@oderwat
Copy link

oderwat commented Oct 12, 2024

As I am playing with TTS right now. How about adding a standard way to play audio?

This does the trick for me:

func PlayAudio(data []byte) {
    var b bytes.Buffer
    w := base64.NewEncoder(base64.StdEncoding, &b)
    if _, err := w.Write(data); err != nil {
    	panic(err)
    }
    if err := w.Close(); err != nil {
        panic(err)
    }
    dataURL:="data:audio/wav;base64,"+b.String()
    gonbui.DisplayHtml(`<audio src="`+dataURL+`" controls="" autoplay="">`)
}
@janpfeifer
Copy link
Owner

janpfeifer commented Oct 13, 2024

Very cool.. A couple of suggestions come to mind:

  • Should we add the audio format as an extra parameter (maybe a protocol.MIMEType, like we did for SendAsDownload ?
  • Maybe use gonbui.UpdateHtml instead, so it uses a transient cell output: this way the sound doesn't get saved as part of the notebook (or when exported to html). Otherwise, I fear when loading the notebook it would play all cell outputs that have sound at once 😄 One can use a dynamically created unique id (gonbui.UniqueID()) for each call.

Do you want to create the PR ?

@oderwat
Copy link
Author

oderwat commented Oct 13, 2024

I am still experimenting with this. It is rather hard to make this into something more useful besides just generating one output.

The current use case was to create a "Podcast" like NotebookLM does. My code currently outputs multiple audio cues after each other in a loop. To make this feasible I added a some code to delay the next output roughly for as long as the audio plays.

Having all of them save with the notebook was kind of expected though. That they all play together is not so nice. I guess one could use JS and events to create some smart "play once" and "play chained" stuff. But I would not want to go further into that and rather write a real application outside the notebook.

For my experiment I now use UpdateHTML() and the rough timing calculation because I know the sample rate and the number of bytes. This would break with other sample rates or compressed audio.

I am not sure if something like this should actually be included. Probably it is better to have just an example of the basic functionality somehwere?

@oderwat
Copy link
Author

oderwat commented Oct 13, 2024

I just made some improvements to the code (there was some AI involved). This can only handle "wav" but can autoplay and also optionally keep the audio.

func getWAVDuration(data []byte) (time.Duration, error) {
    if len(data) < 44 {
        return 0, errors.New("data too short to be a valid WAV file")
    }
    if string(data[0:4]) != "RIFF" {
        return 0, errors.New("invalid WAV file: missing 'RIFF' header")
    }
    if string(data[8:12]) != "WAVE" {
        return 0, errors.New("invalid WAV file: missing 'WAVE' header")
    }

    // Initialize variables to hold format and data chunk positions
    var fmtChunkPos, dataChunkPos int
    var dataChunkSize uint32

    // Start parsing chunks after the first 12 bytes ("RIFF" and "WAVE" headers)
    pos := 12
    for pos < len(data)-8 {
        chunkID := string(data[pos : pos+4])
        chunkSize := binary.LittleEndian.Uint32(data[pos+4 : pos+8])

        switch chunkID {
        case "fmt ":
            fmtChunkPos = pos + 8
        case "data":
            dataChunkPos = pos + 8
            dataChunkSize = chunkSize
            // Once we've found the data chunk, we can break the loop
            pos = len(data)
        }

        // Move to the next chunk (8 bytes for chunk header + chunkSize)
        pos += 8 + int(chunkSize)
    }

    if fmtChunkPos == 0 {
        return 0, errors.New("invalid WAV file: missing 'fmt ' chunk")
    }
    if dataChunkPos == 0 {
        return 0, errors.New("invalid WAV file: missing 'data' chunk")
    }

    // Read format parameters
    audioFormat := binary.LittleEndian.Uint16(data[fmtChunkPos : fmtChunkPos+2])
    numChannels := binary.LittleEndian.Uint16(data[fmtChunkPos+2 : fmtChunkPos+4])
    sampleRate := binary.LittleEndian.Uint32(data[fmtChunkPos+4 : fmtChunkPos+8])
    bitsPerSample := binary.LittleEndian.Uint16(data[fmtChunkPos+14 : fmtChunkPos+16])

    if audioFormat != 1 {
        return 0, errors.New("unsupported audio format (only PCM is supported)")
    }

    // Calculate the duration
    bytesPerSample := bitsPerSample / 8
    totalSamples := dataChunkSize / uint32(bytesPerSample*uint16(numChannels))
    durationSeconds := float64(totalSamples) / float64(sampleRate)
    duration := time.Duration(durationSeconds * float64(time.Second))
    return duration, nil
}

func PlayAudio(audioBytes []byte, autoplay bool, keep bool) {
    var b bytes.Buffer
    htmlCellId := "PlayAudioAndWait_"+gonbui.UniqueId()
    w := base64.NewEncoder(base64.StdEncoding, &b)
    if _, err := w.Write(audioBytes); err != nil {
    	panic(err)
    }
    if err := w.Close(); err != nil {
        panic(err)
    }
    dataURL:="data:audio/wav;base64,"+b.String()
    
    if autoplay {
        gonbui.UpdateHtml(htmlCellId,`<audio src="`+dataURL+`" controls="" autoplay="">`)
        sleep, err := getWAVDuration(audioBytes)
        if err != nil {
            //sleep:=time.Duration(len(audioBytes))*time.Microsecond*21
            keep=true
        } else {
            fmt.Println(sleep.String())
            time.Sleep(sleep)
            gonbui.UpdateHtml(htmlCellId,"")
        }
    }
    
    if keep {
        // if we want to keep the audio
        gonbui.DisplayHtml(`<audio src="`+dataURL+`" controls="">`)
    }
}

@janpfeifer
Copy link
Owner

Indeed something more complex, like chaining of audio ... likely could either be local to your application, or as a separate library.

And I agree, it's easier to have the chaining of playing to be done in Javascript. There is an experimental %wasm feature, where it compiles and run as a wasm in the notebook, if you are feeling brave -- but then the whole cell program will run in the browser.

Btw, talking about NotebookLM, did you see the new Illuminate ? It seems super cool as well. Although, I confess I haven't tried NotebookLM much either.

From the code, one missing part is the support for other audio mime types often used in the browser (mpg, ogg, aac, flac, webm). Not sure if the mime type can be guessed automatically from the audioBytes or should be passed as a parameter ...

@oderwat
Copy link
Author

oderwat commented Oct 13, 2024

The approach above works quite good. Adding a "chain play" feature with some javascript should not be hard. But I actually do not need that really. My experiment worked out quite good. It is amazing what local models can do.

It can be guessed from the bytes but for that one needs to add more packages. It may be to special to be added in GONB?

About WASM, we are using https://github.com/maxence-charriere/go-app/ quite a lot in production to create PWA based application and I also have some private AI tools written with that, like my personal SD frontend that creates and delivers random images to the Divoom Pixoo-64 in my home-office. I did not yet try it with GONB. Playing audio from WASM with the browser is something I solved some time ago. But that is also pretty complicated and it is much easier to just use <audio>.

The Illuminate functionality seems very close to NotebookLM. To me most AI stuff is only really interesting if I can keep my privacy when using it. This limits what I want to feed into public models.

@janpfeifer
Copy link
Owner

That's really cool. I looked at https://github.com/maxence-charriere/go-app/ a long time ago, it seems to have improved quite significantly. I should try it in my next project! I have an old project of a multi-player RTS like game (with few units) in Go purely in Wasm, but directly using https://github.com/go-webapi/webapi . It uses WebGL for graphics, and it runs super smooth, it's really nice. But ... I had to do lots of heavy lifting of the basics. I also setup a background musing on the main menus of the game, but I haven't added sound effects yet. I should resume and open-source it...

The privacy issue in AI is not trivial ... since we are not able to run large models in our own computers. Btw, what are you using for text-to-speech ?

After I submit the new Docker with tools PR, I'll take a stab at the audio one.

@janpfeifer janpfeifer added the enhancement New feature or request label Oct 15, 2024
@oderwat
Copy link
Author

oderwat commented Oct 15, 2024

Here's a revised version for your GitHub comment:


I'm using xtts-api-server for text-to-speech. It occasionally generates some strange audio artifacts or hallucinates funny sounds, but overall, I'm quite happy with it.

For podcast generation, I use solar:10.7b-instruct-v1-q8_0, which works surprisingly well. I tested my idea on 20+ models and picked the ones that performed best, especially for generating JSON output. The main prompt is surprisingly literal:

var Person1 = "Hannah"
var Persona1 = "Friendly Female"
var Person2 = "Derek"
var Persona2 = "Mindful male"
var Language = "de"
var PodcastLanguage = "German"
var PodcastStyle = "Let them discuss the story in detail and what they see as the most interesting aspects"

prompt := `The story:

` + Story + `

Take this story and create a podcast with ` + Person1 + ` (` + Persona1 + `) and ` + Person2 + ` (` + Persona2 + `) as hosts.
` + PodcastStyle + `
Output the conversation as a transcript using JSON like in the following example:

[
  { 
    "person": "` + Person1 + `", 
    "text": "first line" 
  },{
    "person": "` + Person2 + `",
    "text": "second line"
  },{
    "person": "` + Person1 + `",
    "text": "third line"
  }
]

Use ` + PodcastLanguage + ` language for the conversation.
`

After grabbing a website with go-readability (cached by gonbcache), I unmarshal the result. For some models, I make small adjustments to the output, like adding missing outer brackets ([]) or cutting out extra JSON block markers.

I can assign specific voices to the hosts, like ones I created for Hannah Fry and Derek Muller for added realism.

Then, I generate an image using Stable Diffusion, prompted by the LLM, with a straightforward prompt that keeps some recognizable style:

prompt = `Use the quoted text to create a brief image prompt that illustrates the content: "` + line.Text + `".
        
Image Prompt: `

// ollama here

prompt = "(plain colorful sketch illustration:1.3), " + prompt + ", (colorful sketch illustration:1.3)",

// sd-webui here (using "realcartoonRealistic_v13" or other nice models)

The generated line is also passed to CoquiTTS for voice generation, then combined with timing adjustments to produce the final output.

I even translate the content into other languages (typically German). It works fairly well, and even when it messes up, it’s still informative and fun. For even more fun, I use CoquiTTS to add accents, like Polish, and change the characters’ interpretation of the story to be more playful or critical.

P.S.: I might release the notebook, though it requires backend services and models to function properly.


(Edit: Used AI for grammar xD)

@janpfeifer
Copy link
Owner

This is amazing Hans! I'm curious to see the notebook!! When it's ready, can I add a link/screenshot to GoNB homepage ?

Btw, the LLM solar:10.7b-instruct-v1-q8_0 is pretty large, so I assume it won't work on normal GPUs (maybe it works on an NVidia 4090). Do you run it on CPU, but since it only needs to run it once to generate the Story, it's ok to be slow ?

@oderwat
Copy link
Author

oderwat commented Oct 15, 2024

I use one RTX 3090 ti (24 GB VRAM) for all of my AI stuff. It is quite fast. Some seconds for the podcast script and then a few for the single images and TTS. I currently do not even use the audio output time for generating the next part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants