Add Evaluation Support to Arcee Python SDK #84

rivinduw · 2024-10-07T02:39:05Z

This PR introduces support for evaluations in the Arcee Python SDK.
Added start_evaluation function to arcee/api.py:

Allows users to initiate various types of evaluation jobs, including LLM-as-a-judge and lm-eval-harness benchmarks.

Usage Example for testing

import os
os.environ['ARCEE_API_URL'] = 'https://arcee-dev.dev.arcee.ai/api'
os.environ['ARCEE_ORG'] = 'rivinduorg'
os.environ['ARCEE_API_KEY'] = ''

openai_api_key = ''

import arcee
evaluation_params = {'evaluations_name': 'evals_test_oct7',
 'eval_type': 'llm_as_a_judge',
 'qa_set_name': 'mmlu_20q',
 'judge_model': {'model_name': 'gpt-4o',
  'custom_prompt': 'Evaluate which response better adheres to factual accuracy, clarity, and relevance.',
  'base_url': 'https://api.openai.com/v1',
  'api_key': openai_api_key},
 'deployment_model': {'model_name': 'gpt-4o-mini',
  'base_url': 'https://api.openai.com/v1',
  'api_key': openai_api_key},
 'reference_model': {'model_name': 'gpt-3.5-turbo-0125',
  'base_url': 'https://api.openai.com/v1',
  'api_key': openai_api_key}}

result = arcee.start_evaluation(**evaluation_params)
eval_status = arcee.get_evaluation_status(result['evaluations_id'])

Jacobsolawetz · 2024-10-07T13:34:20Z

Noticed evaluation with different params but same name resolves to same ID, should error

rivinduw · 2024-10-07T15:25:56Z

Noticed evaluation with different params but same name resolves to same ID, should error

Yup, params would get overwritten currently so we don't get two evaluations with the same name but different IDs.
Should we error here or in platform? I think start pretraining might have the same behavior

rivinduw · 2024-10-08T06:28:12Z

I have a local branch of platform to raises an error when evaluations have duplicates but thinking we should be consistent across all the other services too.

Currently corpus uploader https://github.com/arcee-ai/arcee-platform/blob/eaec257eca5e1061813babd70006983200b7d57e/backend/app/api/v2/services/corpus.py#L171 has the same logic to update with new params.

Pretraining https://github.com/arcee-ai/arcee-platform/blob/eaec257eca5e1061813babd70006983200b7d57e/backend/app/api/v2/services/pretraining.py#L65, , deployment etc it seems to either assumes the existing params have not changed or look up each field in supabase separately and throw a X with this name does not exist error.

Any thoughts on the best consistent way to deal with repeated start_x calls @mryave @nason ?

rivinduw added 3 commits August 28, 2024 12:20

add start_evaluation

c674c86

debug

fab2931

add status and fix linting errors

e9a88ef

rivinduw changed the title ~~Evaluations api~~ Add Evaluation Support to Arcee Python SDK Oct 7, 2024

rivinduw requested review from mryave and Jacobsolawetz October 7, 2024 08:48

Merge branch 'main' into evals-api

5e05136

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Evaluation Support to Arcee Python SDK #84

Add Evaluation Support to Arcee Python SDK #84

rivinduw commented Oct 7, 2024 •

edited

Loading

Jacobsolawetz commented Oct 7, 2024 •

edited

Loading

rivinduw commented Oct 7, 2024

rivinduw commented Oct 8, 2024

Add Evaluation Support to Arcee Python SDK #84

Are you sure you want to change the base?

Add Evaluation Support to Arcee Python SDK #84

Conversation

rivinduw commented Oct 7, 2024 • edited Loading

Usage Example for testing

Jacobsolawetz commented Oct 7, 2024 • edited Loading

rivinduw commented Oct 7, 2024

rivinduw commented Oct 8, 2024

rivinduw commented Oct 7, 2024 •

edited

Loading

Jacobsolawetz commented Oct 7, 2024 •

edited

Loading