-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Pure Python audio chat app with Multimodal Live API #1551
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @freddyaboulton and reviewers,
Gemini is currently reviewing this pull request; a full review should follow in a few minutes. In the meantime, here's a quick summary to help everyone get up to speed:
This PR adds a pure-Python web application built with Gradio that enables audio input/output streaming for the multimodal API. The goal is to provide a more accessible entry point for developers who prefer Python, eliminating the need for HTML and JavaScript knowledge to get started with the multimodal API. The main changes involve creating a new app.py
file (gemini/gradio-voice/app.py) which contains a complete Gradio application for voice interaction with the Gemini API. The requirements.txt
file (gemini/gradio-voice/requirements.txt) has also been updated to include the necessary libraries: gradio_webrtc
, librosa
, and python-dotenv
.
The app.py
file implements a WebSocket connection to the Gemini API, handles audio encoding and decoding, and manages the real-time audio streaming between the user and the Gemini model. The Gradio interface provides a simple user experience for interacting with the voice chat functionality. The API key is securely handled via a password-protected textbox.
If there's anything unclear or missing from this summary, please let me know. I'll provide a more detailed review shortly.
And now, a little haiku to lighten the mood:
Python's gentle grace,
Streams of audio, flowing free,
Gemini speaks back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request introduces a Python-based Gradio web application that facilitates real-time voice chat with Gemini. This is a great addition, as it allows developers to interact with Gemini using a familiar language and a user-friendly interface. The code is generally well-structured, but there are some areas where improvements can be made to enhance robustness, error handling, and user experience.
gemini/gradio-voice/app.py
Outdated
assert self.config, "Config not set" | ||
try: | ||
self.ws = websockets.sync.client.connect(self.config.ws_url, timeout=30) | ||
initial_request = { | ||
"setup": { | ||
"model": self.config.model, | ||
} | ||
} | ||
self.ws.send(json.dumps(initial_request)) | ||
setup_response = json.loads(self.ws.recv()) | ||
print(f"Setup response: {setup_response}") | ||
except websockets.exceptions.WebSocketException as e: | ||
print(f"WebSocket connection failed: {str(e)}") | ||
self.ws = None | ||
except Exception as e: | ||
print(f"Setup failed: {str(e)}") | ||
self.ws = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling in _initialize_websocket
could be improved. Currently, both WebSocketException
and generic exceptions are caught and printed, but the function doesn't return any indication of failure. This can make it difficult for calling functions to handle connection errors appropriately. Consider raising the caught exceptions after printing the error message, or returning an error status. Additionally, it's a good practice to log the exception details for debugging purposes. How would you modify the code to propagate or handle these errors more effectively?
assert self.config, "Config not set" | |
try: | |
self.ws = websockets.sync.client.connect(self.config.ws_url, timeout=30) | |
initial_request = { | |
"setup": { | |
"model": self.config.model, | |
} | |
} | |
self.ws.send(json.dumps(initial_request)) | |
setup_response = json.loads(self.ws.recv()) | |
print(f"Setup response: {setup_response}") | |
except websockets.exceptions.WebSocketException as e: | |
print(f"WebSocket connection failed: {str(e)}") | |
self.ws = None | |
except Exception as e: | |
print(f"Setup failed: {str(e)}") | |
self.ws = None | |
def _initialize_websocket(self): | |
assert self.config, "Config not set" | |
try: | |
self.ws = websockets.sync.client.connect(self.config.ws_url, timeout=30) | |
initial_request = { | |
"setup": { | |
"model": self.config.model, | |
} | |
} | |
self.ws.send(json.dumps(initial_request)) | |
setup_response = json.loads(self.ws.recv()) | |
print(f"Setup response: {setup_response}") | |
return setup_response | |
except websockets.exceptions.WebSocketException as e: | |
print(f"WebSocket connection failed: {str(e)}") | |
raise | |
except Exception as e: | |
print(f"Setup failed: {str(e)}") | |
raise |
gemini/gradio-voice/app.py
Outdated
if not self.ws: | ||
self._initialize_websocket() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gemini/gradio-voice/app.py
Outdated
continue | ||
|
||
try: | ||
message = self.ws.recv(timeout=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gemini/gradio-voice/app.py
Outdated
|
||
try: | ||
message = self.ws.recv(timeout=5) | ||
msg = json.loads(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gemini/gradio-voice/requirements.txt
Outdated
@@ -0,0 +1,3 @@ | |||
gradio_webrtc==0.0.23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinning the gradio_webrtc
version is a good practice for reproducibility. However, it's generally recommended to specify a version range rather than a single version to allow for bug fixes and minor updates. Consider using a compatible version range like gradio-webrtc>=0.0.23,<0.1.0
to allow for updates while avoiding potentially breaking changes.
gradio_webrtc==0.0.23 | |
gradio_webrtc>=0.0.23,<0.1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also resolve spelling errors and respond to Gemini review comments. Deferring to Zack/Katie for further review
gemini/gradio-voice/app.py
Outdated
def _initialize_websocket(self) -> None: | ||
assert self.config, "Config not set" | ||
try: | ||
self.ws = websockets.sync.client.connect(self.config.ws_url, timeout=30) | ||
initial_request = { | ||
"setup": { | ||
"model": self.config.model, | ||
} | ||
} | ||
self.ws.send(json.dumps(initial_request)) | ||
setup_response = json.loads(self.ws.recv()) | ||
print(f"Setup response: {setup_response}") | ||
except websockets.exceptions.WebSocketException as e: | ||
print(f"WebSocket connection failed: {str(e)}") | ||
self.ws = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try using the genai.live.start_stream
method instead of raw websockets calls?
https://googleapis.github.io/python-genai/genai.html#module-genai.live
Thank you @holtskinner ! I will address the Gemini comments today/tomorrow. I think there is something wrong with the spellcheck as |
Hi @freddyaboulton, Regarding your question about the spellcheck flagging |
Hi @holtskinner @katiemn @ZackAkil - I switched the demo to use |
gemini/gradio-voice/app.py
Outdated
client = genai.Client(api_key=api_key) | ||
config = {"response_modalities": ["AUDIO"]} | ||
async with client.aio.live.connect( | ||
model="models/gemini-2.0-flash-exp", config=config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this demo connects to Gemini via Google AI Studio. This repository is for demos and code samples for Vertex AI. Please check out this repo to contribute Google AI Studio samples, https://github.com/google-gemini/cookbook/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out, Katie! This PR is indeed using the Gemini API, and it's a valid point that this repo is focused on Vertex AI. I'll work with freddyaboulton to determine the best next steps for this demo. It might be best suited for the Google Gemini cookbook as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, if you can switch this demo to use Vertex AI instead, we're happy to host it in this repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @holtskinner @katiemn - just switched over to use Vertex AI!
Description
The
multimodal-live-app
requires knowledge of html and js.Adding a pure-python webapp built with Gradio that supports audio input/output streaming will help developers get started building with the multimodal api without leaving their preferred language.
CONTRIBUTING
Guide.CODEOWNERS
for the file(s).nox -s format
from the repository root to format).