Next.js Gemini Audio Stream Realtime 🎙️

A real-time voice interface for Gemini 2.0, leveraging the Multimodal Live API. Built with Next.js, this project enables seamless voice, webcam, and text and screensharing interactions with Google's most advanced AI model.

🚀 Live Demo

Check out the live demo here

🛠️ Getting Started

Prerequisites

Google Cloud SDK installed and configured
Node.js or Bun runtime
Basic familiarity with Next.js

Setup

Clone and Install Dependencies:

git clone https://github.com/LivioGama/nextjs-gemini-audio-stream-realtime.git
cd nextjs-gemini-audio-stream-realtime
bun install

⚙️ Configure Environment

Copy .env.example to .env:

cp .env.example .env

Generate a service account:

Go to Google Cloud Console and select or create your project. Ensure Vertex AI API is enabled.

Create a new service account with permission roles/aiplatform.user and download the JSON key file.

Set Required Environment Variables:

NEXT_PUBLIC_PROXY_URL='ws://localhost:3000/gemini-ws'
NEXT_PUBLIC_MODEL='gemini-2.0-flash-exp'
NEXT_PUBLIC_API_HOST='us-central1-aiplatform.googleapis.com'
NEXT_PUBLIC_PROJECT_ID=<your-gcp-project-id>
GOOGLE_CLIENT_EMAIL="<retrieve-from-downloaded-google-account-service>"
GOOGLE_PRIVATE_KEY="<retrieve-from-downloaded-google-account-service>"

⚠️ If you get a ERR_OSSL_UNSUPPORTED error, be very careful with the GOOGLE_PRIVATE_KEY formatting, it must be a single line string between quotes, and the \n must be converted to actual \n characters.

I actually had to run this js script to get the correct one:

const original = '-----BEGIN PRIVATE KEY-----\n...'
console.log(original.replace(/\n/g, '\\n'))

🚀 Start Development Server

bun run dev

Visit http://localhost:3000 to see the application in action!

🔍 Caveats

This project is still in development. Known issues:

The camera devices work only on a HTTPS deployed version.
Cannot switch to Text Response without reconnecting.
The quota and limit are insanely quickly exceeded, you might have to take some 10 minutes break while using.
The websocket deployed on Vercel does not seem to work. It does when deployed to a custom VPS though.
To check your deployed websocket, you can use a tool like wscat by running:

wscat -c wss://gemini-ws.liviogama.com/gemini-ws

🤝 Contributing

Feel free to contribute if you have some free time! 🍻

🌐 API and architecture considerations

This project DOES NOT USE wss://{HOST}/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key= but wss://${HOST}/ws/google.cloud.aiplatform.v1beta1.LlmBidiService/BidiGenerateContent.

Therefore, it cannot be used with a Gemini Api Key, but only with a Google Cloud Service Account (by design)

Here is a breakdown comparison from Gemini itself:

Analogy: Imagine a conversation:

GenerativeService.BidiGenerateContent: Like writing a full letter, sending it, and receiving a reply letter. There's no back and forth during the process.
LlmBidiService/BidiGenerateContent: Like having a live phone conversation where you can talk, interrupt each other, and change the direction of the conversation.

Why the Difference?

Different APIs, Different Focus:
- GenerativeService (part of the Gemini API) is primarily focused on providing a simplified interface for general text generation tasks.
- LlmBidiService (part of the Vertex AI Platform) is designed for more complex AI application development and provides more granular control, including support for real-time, conversational interaction.
Streaming Requirements: Real-time interaction requires streaming. LlmBidiService is built from the ground up to handle this, while GenerativeService is not.

The key differences in a table:

Feature	`google.ai.generativelanguage.v1alpha.GenerativeService`	`google.cloud.aiplatform.v1beta1.LlmBidiService`
Service	Generative Language API (Google AI)	Vertex AI (Google Cloud Platform)
Version	`v1alpha` (Alpha)	`v1beta1` (Beta)
Purpose	General-purpose text generation	LLMs within Google Cloud environment
Authentication	Likely API Key	Likely OAuth 2.0 (Google Cloud credentials)
Likely Models	PaLM 2 or earlier generative language models provided directly by Google AI	Gemini, PaLM 2 (via Vertex AI), and other models available in Vertex AI
Infrastructure	Not directly tied to Google Cloud infrastructure	Tightly integrated with Google Cloud project management, model deployment, monitoring, etc.
Use Case	Quick prototyping and experimentation with generative AI	Production-grade applications, leveraging Google Cloud's features and infrastructure

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
consts		consts
doc		doc
hooks		hooks
public		public
storage		storage
types		types
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierignore		.prettierignore
README.md		README.md
bun.lockb		bun.lockb
custom.d.ts		custom.d.ts
next.config.ts		next.config.ts
package.json		package.json
prettier.config.js		prettier.config.js
server.ts		server.ts
tsconfig.json		tsconfig.json
tsconfig.server.json		tsconfig.server.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Next.js Gemini Audio Stream Realtime 🎙️

🚀 Live Demo

🛠️ Getting Started

Prerequisites

Setup

⚙️ Configure Environment

🚀 Start Development Server

🔍 Caveats

🤝 Contributing

🌐 API and architecture considerations

About

Releases

Packages

Languages

LivioGama/nextjs-gemini-audio-stream-realtime

Folders and files

Latest commit

History

Repository files navigation

Next.js Gemini Audio Stream Realtime 🎙️

🚀 Live Demo

🛠️ Getting Started

Prerequisites

Setup

⚙️ Configure Environment

🚀 Start Development Server

🔍 Caveats

🤝 Contributing

🌐 API and architecture considerations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages