Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Evaluation Support to Arcee Python SDK #84

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

rivinduw
Copy link

@rivinduw rivinduw commented Oct 7, 2024

This PR introduces support for evaluations in the Arcee Python SDK.
Added start_evaluation function to arcee/api.py:

  • Allows users to initiate various types of evaluation jobs, including LLM-as-a-judge and lm-eval-harness benchmarks.

Usage Example for testing

import os
os.environ['ARCEE_API_URL'] = 'https://arcee-dev.dev.arcee.ai/api'
os.environ['ARCEE_ORG'] = 'rivinduorg'
os.environ['ARCEE_API_KEY'] = ''

openai_api_key = ''

import arcee
evaluation_params = {'evaluations_name': 'evals_test_oct7',
 'eval_type': 'llm_as_a_judge',
 'qa_set_name': 'mmlu_20q',
 'judge_model': {'model_name': 'gpt-4o',
  'custom_prompt': 'Evaluate which response better adheres to factual accuracy, clarity, and relevance.',
  'base_url': 'https://api.openai.com/v1',
  'api_key': openai_api_key},
 'deployment_model': {'model_name': 'gpt-4o-mini',
  'base_url': 'https://api.openai.com/v1',
  'api_key': openai_api_key},
 'reference_model': {'model_name': 'gpt-3.5-turbo-0125',
  'base_url': 'https://api.openai.com/v1',
  'api_key': openai_api_key}}

result = arcee.start_evaluation(**evaluation_params)
eval_status = arcee.get_evaluation_status(result['evaluations_id'])
image

@rivinduw rivinduw changed the title Evaluations api Add Evaluation Support to Arcee Python SDK Oct 7, 2024
@Jacobsolawetz
Copy link
Contributor

Jacobsolawetz commented Oct 7, 2024

Noticed evaluation with different params but same name resolves to same ID, should error

@rivinduw
Copy link
Author

rivinduw commented Oct 7, 2024

Noticed evaluation with different params but same name resolves to same ID, should error

Yup, params would get overwritten currently so we don't get two evaluations with the same name but different IDs.
Should we error here or in platform? I think start pretraining might have the same behavior

@rivinduw
Copy link
Author

rivinduw commented Oct 8, 2024

I have a local branch of platform to raises an error when evaluations have duplicates but thinking we should be consistent across all the other services too.

Currently corpus uploader https://github.com/arcee-ai/arcee-platform/blob/eaec257eca5e1061813babd70006983200b7d57e/backend/app/api/v2/services/corpus.py#L171 has the same logic to update with new params.

Pretraining https://github.com/arcee-ai/arcee-platform/blob/eaec257eca5e1061813babd70006983200b7d57e/backend/app/api/v2/services/pretraining.py#L65, , deployment etc it seems to either assumes the existing params have not changed or look up each field in supabase separately and throw a X with this name does not exist error.

Any thoughts on the best consistent way to deal with repeated start_x calls @mryave @nason ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants