Name		Name	Last commit message	Last commit date
parent directory ..
bigquery_sqls		bigquery_sqls
docs		docs
notebooks		notebooks
utils		utils
LICENSE		LICENSE
README.md		README.md
config.ini		config.ini
pyproject.toml		pyproject.toml

README.md

Vertex AI: Gemini Evaluations Playbook

Experiment, Evaluate & Analyze model performance for your use cases

✨ Overview

The Gemini Evaluations Playbook provides recipes to streamline the experimentation and evaluation of Generative AI models for your use cases using Vertex Generative AI Evaluation service. This enables you to track and align model performance with your objectives, while providing insights to optimize the model under different conditions and configurations.

📏 Experimentation and evaluation workflow

Prompting strategies and best practices are essential for getting started with Gemini, but they're only the first step. To ensure your Generative AI solution with Gemini delivers repeatable and scalable performance, you need a systematic experimentation and evaluation process. This involves meticulous tracking of each experimental configuration, including prompt templates (system instructions, context, and few-shot learning examples), and model parameters like temperature and max output tokens.

Your evaluation should go beyond overall results and report granular metrics for each experiment and not just final results for the evaluation exercise.

By following this process, you'll not only maximize your GenAI solution's performance but also identify anti-patterns and system-level design improvements early on. This proactive approach is far more efficient than discovering issues after deployment.

Note

Refer here for adding automation to your experimentation workflow with the Vertex AI Prompt Optimizer.

📏 Architecture

The following diagram depicts the architecture of the Gemini Evaluations Playbook. The architecture leverages

Vertex Generative AI Evaluation service for running evaluations
Google BigQuery for logging prompts, experiments and eval runs.

🧩 Key Features

The Gemini Evaluations Playbook (referred as Evals Playbook) provides following key features:

✅ Define, track and compare experiments

Define and track a hierarchical structure of tasks, experiments, and evaluation runs to systematically organize and track your evaluation efforts.

✅ Log evaluation results with prompts and responses

Manage and log experiment configurations and results to BigQuery, enabling comprehensive analysis.

✅ Customize evaluation runs

Customize evaluations by configuring prompt templates, generation parameters, safety settings, and evaluation metrics to match your specific use case.

✅ Comprehensive Metrics

Track a range of built-in and custom metrics to gain a holistic understanding of model performance.

✅ Iterative refinement

Analyze insights from evaluation to iteratively refine prompts, model configurations, and fine-tuning to achieve optimal outcomes.

🏁 Getting Started

STEP 1. Clone the repository

git clone https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && cd applied-ai-engineering-samples/genai-on-vertex-ai/gemini/evals_playbook

STEP 2. Prepare your environment

Start with 0_gemini_evals_playbook_setup notebook to install required libraries (using Poetry) and configure the necessary resources on Google Cloud. This includes setting up a BigQuery dataset and saving configuration parameters.

STEP 3. Experiment, evaluate, and analyze

Run the 1_gemini_evals_playbook_evaluate notebook to design experiments, assess model performance on your generative AI tasks, and analyze evaluation results including side-by-side comparison of results across different experiments and runs.

STEP 4. Optimize with grid search

Run the 2_gemini_evals_playbook_grid_search notebook to systematically explore different experiment configurations by testing various prompt templates or model settings (like temperature), or combinations of these using a grid-search style approach.

🧬 Repository Structure

.
├── bigquery_sqls
  └── evals_bigquery.sql
└── docs
└── notebooks
  └── 0_gemini_evals_playbook_setup.ipynb
  └── 1_gemini_evals_playbook_evaluate.ipynb
  └── 2_gemini_evals_playbook_gridsearch.ipynb
└── utils
  └── config.py
  └── evals_playbook.py
└── config.ini
└── pyproject.toml

Navigating repository structure

/evals_bigquery.sql: SQL queries to create BigQuery datasets and tables
/notebooks: Notebooks demonstrating the usage of Evals Playbook
/utils: Utility or helper functions for running notebooks
/congig.ini: Save and reuse configuration parameters created in0_gemini_evals_playbook_setup
/docs: Documentation explaining key concepts

📄 Documentation

🚧 Quotas and limits

Verify you have sufficient quota to run experiments and evaluations:

🪪 License

Distributed with the Apache-2.0 license.

Also contains code derived from the following third-party packages:

🙋 Getting Help

If you have any questions or if you found any problems with this repository, please report through GitHub issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals_playbook

evals_playbook

README.md

Vertex AI: Gemini Evaluations Playbook

Experiment, Evaluate & Analyze model performance for your use cases

✨ Overview

📏 Experimentation and evaluation workflow

📏 Architecture

🧩 Key Features

🏁 Getting Started

STEP 1. Clone the repository

STEP 2. Prepare your environment

STEP 3. Experiment, evaluate, and analyze

STEP 4. Optimize with grid search

🧬 Repository Structure

📄 Documentation

🚧 Quotas and limits

🪪 License

🙋 Getting Help

Files

evals_playbook

Directory actions

More options

Directory actions

More options

Latest commit

History

evals_playbook

Folders and files

parent directory

README.md

Vertex AI: Gemini Evaluations Playbook

Experiment, Evaluate & Analyze model performance for your use cases

✨ Overview

📏 Experimentation and evaluation workflow

📏 Architecture

🧩 Key Features

🏁 Getting Started

STEP 1. Clone the repository

STEP 2. Prepare your environment

STEP 3. Experiment, evaluate, and analyze

STEP 4. Optimize with grid search

🧬 Repository Structure

📄 Documentation

🚧 Quotas and limits

🪪 License

🙋 Getting Help