Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added Gemini Multimodal Live API Developer Guide #160

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

heiko-hotz
Copy link

Pull-Request Template

Thank you for your contribution! Please provide a brief description of your changes and ensure you've completed the checklist below.

Description

What does this PR do? Why is it necessary?

Fixes # (if applicable)

Checklist

  • Contribution Guidelines: I have read the Contribution Guidelines.
  • CLA: I have signed the CLA.
  • Authorship: I am listed as the author (if applicable).
  • Conventional Commits: My PR title and commit messages follow the Conventional Commits spec.
  • Code Format: I have run nox -s format to format the code.
  • Spelling: I have fixed any spelling errors, and added false positives to .github/actions/spelling/allow.txt if necessary.
  • Template: I have followed the aaie_notebook_template.ipynb if submitting a new jupyter notbook.
  • Sync: My Fork is synced with the upstream.
  • Documentations: I have updated relevant documentations (if applicable) in the docs folder.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi team,

I'm currently reviewing this pull request and will post my full review in a few minutes. In the meantime, here's a quick summary for you and other reviewers to get up to speed:

This pull request adds a Gemini Multimodal Live API Developer Guide to the applied-ai-engineering-samples repository. The guide is intended to provide a comprehensive, hands-on learning experience for developers looking to build real-time applications using Gemini's multimodal capabilities. The guide includes multiple parts:

  • Part 1: Introduction to Gemini's Multimodal Live API (basic concepts and SDK usage)
  • Part 2: WebSocket Development with Gemini Developer API (building a production-ready multimodal AI assistant)
  • Part 3: WebSocket Development with Vertex AI API (enterprise-grade implementation)

The changes include:

  • Addition of a new directory: genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide
  • Creation of several Markdown files explaining the project and its different parts
  • Creation of a .gitignore file to manage ignored files
  • Addition of a CONTRIBUTING.md file
  • Addition of a LICENSE file
  • Addition of several Jupyter notebooks and Python scripts to demonstrate different aspects of the Gemini AI capabilities

I haven't been able to fully assess the correctness of the code examples yet, but the overall structure and organization of the guide seem well-planned. I'll provide more detailed feedback in my full review.


Here's a little haiku to lighten the mood:

Code flows like a stream,
Reviews guide it to the sea,
Perfect, polished code.

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request adds a Gemini Multimodal Live API Developer Guide, which is a great addition. The documentation is comprehensive and well-structured, covering various aspects of building real-time multimodal applications with Gemini. The README files provide clear explanations and the patches include necessary files like .gitignore, CONTRIBUTING.md, and LICENSE. I have a few suggestions based on Google style guides and general best practices that would enhance clarity and maintainability.

Summary of Google style guide elements referenced:

  • Filenames should be lowercase with hyphens as word separators.
  • Use descriptive names for functions and variables.
  • Comments should be clear and concise.
  • Use sentence-style capitalization for commit messages.

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi team,

I'm currently reviewing this pull request and will post my full review in a few minutes. In the meantime, here's a quick summary for you and other reviewers to get up to speed:

This pull request adds a Gemini Multimodal Live API Developer Guide to the applied-ai-engineering-samples repository. The guide is designed to help developers build real-time applications that can see, hear, and interact naturally using Gemini's multimodal capabilities.

Here's a breakdown of the changes:

  • New Directory: A new directory, genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide, has been created to house the guide's content.
  • .gitignore: A comprehensive .gitignore file has been added to exclude common build artifacts and temporary files.
  • CONTRIBUTING.md: A CONTRIBUTING.md file has been added, outlining the contribution guidelines (currently indicating that contributions are not being accepted at this time).
  • LICENSE: An Apache 2.0 License file has been included.
  • README.md: A README.md file provides an overview of the guide, its structure, key concepts covered, and prerequisites.
  • Part 1 (Introduction): This section introduces basic concepts and SDK usage, covering text and audio interactions.
  • Part 2 (Gemini Developer API): This section focuses on WebSocket development using the Gemini Developer API, building towards a production-ready multimodal AI assistant.
  • Part 3 (Vertex AI API): This section demonstrates the same functionalities as Part 2, but using the Vertex AI API.
  • Individual Chapters: Each part contains several chapters with detailed explanations, code examples, and Jupyter notebooks.

The guide covers a wide range of topics, including real-time communication, audio processing, video integration, and production deployment considerations.

I hope this summary is helpful. Let me know if you have any questions.


And now, a little haiku to brighten your day:

Code flows like a stream,
Reviews guide the path to clear,
Gems of code emerge.

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request adds a Gemini Multimodal Live API Developer Guide, which is a valuable contribution. The documentation is comprehensive and well-structured, covering various aspects of building real-time multimodal applications with Gemini. The README files provide clear explanations, helpful guidance, and a good overview of the project. The inclusion of a .gitignore file is also a good practice.

I've reviewed the changes with a focus on correctness, efficiency, maintainability, and security, considering common best practices and language idioms. I also checked for adherence to standard style guides, leveraging Markdown for better readability.

Here's a summary of the style guide elements I referenced, which are common across multiple style guides:

  • File naming: Descriptive filenames in lowercase with hyphens.
  • Headings: Sentence-case for headings.
  • Lists: Hyphens for unordered lists.
  • Links: Descriptive link text.
  • Code blocks: Backticks for inline code and triple backticks for code blocks.
  • Images: Descriptive alt text for images.

A few suggestions are provided below to further enhance clarity, consistency, and address potential areas for improvement.

Comment on lines 71 to 78
| Feature | Part 2 - Dev API Chapter | Part 3 - Vertex AI Chapter |
|---------|----------------|-------------------|
| Basic WebSocket Setup | [Chapter 3](part_2_dev_api/chapter_03) | - |
| Text-to-Speech | [Chapter 4](part_2_dev_api/chapter_04) | - |
| Real-time Audio Chat | [Chapter 5](part_2_dev_api/chapter_05) | [Chapter 9](part_3_vertex_api/chapter_09) |
| Multimodal (Audio + Video) | [Chapter 6](part_2_dev_api/chapter_06) | [Chapter 10](part_3_vertex_api/chapter_10) |
| Function Calling & Instructions | [Chapter 7](part_2_dev_api/chapter_07) | [Chapter 11](part_3_vertex_api/chapter_11) |
| Production Deployment (Project Pastra)| [Chapter 8](part_2_dev_api/chapter_08) | [Chapter 12](part_3_vertex_api/chapter_12) |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The table would be more readable with aligned columns. Most Markdown style guides recommend aligning columns in tables for better readability.

Suggested change
| Feature | Part 2 - Dev API Chapter | Part 3 - Vertex AI Chapter |
|---------|----------------|-------------------|
| Basic WebSocket Setup | [Chapter 3](part_2_dev_api/chapter_03) | - |
| Text-to-Speech | [Chapter 4](part_2_dev_api/chapter_04) | - |
| Real-time Audio Chat | [Chapter 5](part_2_dev_api/chapter_05) | [Chapter 9](part_3_vertex_api/chapter_09) |
| Multimodal (Audio + Video) | [Chapter 6](part_2_dev_api/chapter_06) | [Chapter 10](part_3_vertex_api/chapter_10) |
| Function Calling & Instructions | [Chapter 7](part_2_dev_api/chapter_07) | [Chapter 11](part_3_vertex_api/chapter_11) |
| Production Deployment (Project Pastra)| [Chapter 8](part_2_dev_api/chapter_08) | [Chapter 12](part_3_vertex_api/chapter_12) |
| Feature | Part 2 - Dev API Chapter | Part 3 - Vertex AI Chapter |
|---------|--------------------------|--------------------------|
| Basic WebSocket Setup | [Chapter 3](part_2_dev_api/chapter_03) | - |
| Text-to-Speech | [Chapter 4](part_2_dev_api/chapter_04) | - |
| Real-time Audio Chat | [Chapter 5](part_2_dev_api/chapter_05) | [Chapter 9](part_3_vertex_api/chapter_09) |
| Multimodal (Audio + Video) | [Chapter 6](part_2_dev_api/chapter_06) | [Chapter 10](part_3_vertex_api/chapter_10) |
| Function Calling & Instructions | [Chapter 7](part_2_dev_api/chapter_07) | [Chapter 11](part_3_vertex_api/chapter_11) |
| Production Deployment (Project Pastra)| [Chapter 8](part_2_dev_api/chapter_08) | [Chapter 12](part_3_vertex_api/chapter_12) |

- Chapter 4: http://localhost:8000/chapter_04/
And so on...

3. Begin with Chapter 3 to understand the fundamentals of WebSocket communication with Gemini. Each subsequent chapter builds upon previous concepts, gradually introducing more complex features and capabilities. By Chapter 8, you'll have transformed the development prototype into Project Pastra - a production-ready AI assistant that demonstrates the future of human-AI interaction.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

This sentence could be improved for clarity and conciseness. Consider splitting it into two shorter sentences, which is generally recommended for better readability.

Suggested change
3. Begin with Chapter 3 to understand the fundamentals of WebSocket communication with Gemini. Each subsequent chapter builds upon previous concepts, gradually introducing more complex features and capabilities. By Chapter 8, you'll have transformed the development prototype into Project Pastra - a production-ready AI assistant that demonstrates the future of human-AI interaction.
Begin with Chapter 3 to understand the fundamentals of WebSocket communication with Gemini. Each subsequent chapter builds upon previous concepts, gradually introducing more complex features and capabilities. By Chapter 8, you'll have built Project Pastra, a production-ready AI assistant that demonstrates the future of human-AI interaction.

@@ -0,0 +1,36 @@
# How to Contribute

At this time, we're not looking to take contributions to this project.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

While the project isn't accepting contributions right now, it's good to keep the CONTRIBUTING.md welcoming. Perhaps suggest how people can still engage, like reporting issues or discussing ideas.

Suggested change
At this time, we're not looking to take contributions to this project.
While we are not actively accepting code contributions at this time, we welcome your feedback, bug reports, and feature requests. Please open an issue to share your thoughts or report any problems you encounter.

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi team,

I'm currently reviewing this pull request and will post my full review in a few minutes. In the meantime, here's a quick summary for you and other reviewers to get up to speed:

This pull request adds a Gemini Multimodal Live API Developer Guide to the applied-ai-engineering-samples repository. The guide is designed to help developers build real-time applications that can see, hear, and interact naturally using Gemini's multimodal capabilities.

Here's a breakdown of the changes:

  • New Directory: A new directory, genai-on-vertex-ai/gemini_2_0/gemini-multimodal-live-api-dev-guide, has been created to house the guide's content.
  • .gitignore: A comprehensive .gitignore file has been added to manage ignored files and directories within the guide's directory.
  • CONTRIBUTING.md: A CONTRIBUTING.md file has been added, outlining the contribution process (currently contributions are not being accepted).
  • LICENSE: An Apache 2.0 License file has been added.
  • README.md: A README.md file provides an overview of the guide, including what you'll learn, key concepts covered, guide structure, feature implementation, prerequisites, and key differences between the Development API and Vertex AI implementations.
  • Part 1 (Introduction): This section provides a foundational introduction to working with Google's Gemini AI model.
  • Part 2 (WebSocket Development with Gemini API): This section demonstrates how to work directly with the Gemini API using WebSockets, progressively building towards Project Pastra.
  • Part 3 (WebSocket Development with Vertex AI API): This section demonstrates how to work with Vertex AI using WebSockets, progressively building towards Project Pastra.
  • Chapters: Each part contains multiple chapters with detailed instructions and code examples.

I haven't fully reviewed the code yet, but this summary should give you a good starting point. I'll provide more detailed feedback shortly.


Here's a little poem to brighten your day:

Code flows like a river,
Tests like rapids, strong and swift,
Reviews guide the way.

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request adds a Gemini Multimodal Live API Developer Guide, which is a great addition. The documentation is well-written and comprehensive. The structure of the guide, progressing from basic concepts to advanced implementations, including both Development API and Vertex AI versions, is logical and helpful for developers. The Feature Implementation Guide and the Key Differences Between Dev API and Vertex AI sections are particularly useful for navigating the different implementations. A few minor suggestions below.

Summary of Google Python Style Guide elements referenced:

@@ -0,0 +1,36 @@
# How to Contribute

At this time, we're not looking to take contributions to this project.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

While contributions may not be actively sought at this time, it's generally good practice to keep contribution guidelines open-ended. This encourages future contributions and provides a clear process if someone does want to contribute. Consider rephrasing to something like, "We welcome contributions! At this time, we are prioritizing X, Y, and Z, so contributions in those areas may be reviewed more quickly. Please ensure all contributions adhere to the following guidelines:"

Suggested change
At this time, we're not looking to take contributions to this project.
We welcome contributions! At this time, we are prioritizing improvements to the documentation and examples, so contributions in those areas may be reviewed more quickly. Please ensure all contributions adhere to the following guidelines:

Comment on lines 71 to 78
| Feature | Part 2 - Dev API Chapter | Part 3 - Vertex AI Chapter |
|---------|----------------|-------------------|
| Basic WebSocket Setup | [Chapter 3](part_2_dev_api/chapter_03) | - |
| Text-to-Speech | [Chapter 4](part_2_dev_api/chapter_04) | - |
| Real-time Audio Chat | [Chapter 5](part_2_dev_api/chapter_05) | [Chapter 9](part_3_vertex_api/chapter_09) |
| Multimodal (Audio + Video) | [Chapter 6](part_2_dev_api/chapter_06) | [Chapter 10](part_3_vertex_api/chapter_10) |
| Function Calling & Instructions | [Chapter 7](part_2_dev_api/chapter_07) | [Chapter 11](part_3_vertex_api/chapter_11) |
| Production Deployment (Project Pastra)| [Chapter 8](part_2_dev_api/chapter_08) | [Chapter 12](part_3_vertex_api/chapter_12) |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The Feature Implementation Guide is helpful. Consider adding a column for "Part 1 - Intro Chapter" to show where basic concepts are introduced. Also, consider adding rows for other features like "Interruption Handling", "Error Handling", "Production Deployment", "Security Considerations", etc., to make the guide even more comprehensive.

heiko-hotz and others added 17 commits December 31, 2024 17:30
…uide/CONTRIBUTING.md

Co-authored-by: code-review-assist[bot] <182814678+code-review-assist[bot]@users.noreply.github.com>
…uide/part_1_intro/chapter_02/audio-to-audio.py

Co-authored-by: code-review-assist[bot] <182814678+code-review-assist[bot]@users.noreply.github.com>
…uide/part_1_intro/chapter_02/README.md

Co-authored-by: code-review-assist[bot] <182814678+code-review-assist[bot]@users.noreply.github.com>
…uide/README.md

Co-authored-by: code-review-assist[bot] <182814678+code-review-assist[bot]@users.noreply.github.com>
…uide/part_1_intro/README.md

Co-authored-by: code-review-assist[bot] <182814678+code-review-assist[bot]@users.noreply.github.com>
…uide/part_1_intro/chapter_02/README.md

Co-authored-by: code-review-assist[bot] <182814678+code-review-assist[bot]@users.noreply.github.com>
…uide/part_3_vertex_api/README.md

Co-authored-by: code-review-assist[bot] <182814678+code-review-assist[bot]@users.noreply.github.com>
…uide/part_2_dev_api/chapter_04/README.md

Co-authored-by: code-review-assist[bot] <182814678+code-review-assist[bot]@users.noreply.github.com>
…uide/part_2_dev_api/chapter_03/README.md

Co-authored-by: code-review-assist[bot] <182814678+code-review-assist[bot]@users.noreply.github.com>
…uide/part_1_intro/chapter_02/audio-to-audio.py

Co-authored-by: code-review-assist[bot] <182814678+code-review-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants