evaluation

Here are 1,078 public repositories matching this topic...

EXP-Tools / steam-discount

steam 特惠游戏榜单（自动刷新）

steam crawler evaluation rank discount zero playing

Updated May 17, 2024
Python

Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated May 17, 2024
TypeScript

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated May 16, 2024
TypeScript

lunary-ai / lunary

Star

The production toolkit for LLMs. Observability, prompt management and evaluations.

testing ai monitoring evaluation logs self-hosted openai hacktoberfest observability prompts llm langchain

Updated May 17, 2024
TypeScript

langwatch / langwatch

Star

🤖 Build AI applications with confidence ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.

ai analytics evaluation openai gpt datasets observability llm prompt-engineering

Updated May 16, 2024
TypeScript

tatsu-lab / alpaca_eval

Star

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

nlp deep-learning leaderboard evaluation instruction-following foundation-models large-language-models rlhf

Updated May 16, 2024
Jupyter Notebook

langchain-ai / langsmith-sdk

Star

LangSmith Client SDK Implementations

evaluation language-model observability

Updated May 17, 2024
Python

microsoft / llmops-promptflow-template

Star

LLMOps with Prompt Flow is a "LLMOps template and guidance" to help you build LLM-infused apps using Prompt Flow. It offers a range of features including Centralized Code Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B Deployment, reporting for all runs and experiments and so on.

Updated May 17, 2024
Python

TIGER-AI-Lab / TIGERScore

Star

Official code for paper "TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks" (TMLR 2024)

metrics evaluation language-model llm

Updated May 16, 2024
Jupyter Notebook

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated May 16, 2024
Python

Striveworks / valor

Star

Valor is a centralized evaluation store which makes it easy to measure, explore, and rank model performance.

computer-vision evaluation classification object-detection image-segmentation evaluation-metrics model-evaluation mlops

Updated May 16, 2024
Python

langchain-ai / langsmith-docs

Star

Documentation for langsmith

testing documentation evaluation tracing langchain langsmith

Updated May 16, 2024
MDX

MinhVuong2000 / LLMReasonCert

Star

Official Implementation of ACL2024 paper "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs"(https://arxiv.org/abs/2402.11199).

framework evaluation knowledge-graph reasoning evaluation-framework llms faithfulness

Updated May 16, 2024
Python

symflower / eval-dev-quality

Star

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

evaluation software-development software-quality evaluation-framework llms

Updated May 16, 2024
Go

open-compass / VLMEvalKit

Star

Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 40+ HF models, 20+ benchmarks

computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip claude openai-api gpt4 large-language-models llm chatgpt llava qwen gpt-4v

Updated May 16, 2024
Python

huggingface / lighteval

Star

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.

evaluation evaluation-metrics evaluation-framework huggingface

Updated May 16, 2024
Python

langwatch / langevals

Star

LangEvals aggregates various language model evaluators into a single platform, providing a standard interface for a multitude of scores and LLM guardrails, for you to protect and benchmark your LLM models and pipelines.

evaluation openai guardrails llm

Updated May 16, 2024
Python

r-lib / evaluate

Star

A version of eval for R that returns more information about what happened

r parsing repl evaluation r-package

Updated May 16, 2024
R

VectorInstitute / cyclops

Star

Toolkit for evaluating and monitoring AI models in clinical settings

machine-learning deep-learning evaluation healthcare physionet mimic-iii electronic-health-record clinical-research eicu-crd clinical-data clinical-decision-support drift-detection model-monitoring data-drift omop-cdm mimic-iv electronic-medical-record

Updated May 16, 2024
Python

microsoft / rag-experiment-accelerator

Star

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

experiment information-retrieval azure evaluation indexing openai sparse vectors chunking acs embedding dense rag llm genai

Updated May 16, 2024
Python

Improve this page

Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

Here are 1,078 public repositories matching this topic...

EXP-Tools / steam-discount

promptfoo / promptfoo

langfuse / langfuse

lunary-ai / lunary

langwatch / langwatch

tatsu-lab / alpaca_eval

langchain-ai / langsmith-sdk

microsoft / llmops-promptflow-template

TIGER-AI-Lab / TIGERScore

athina-ai / athina-evals

Striveworks / valor

langchain-ai / langsmith-docs

MinhVuong2000 / LLMReasonCert

symflower / eval-dev-quality

open-compass / VLMEvalKit

huggingface / lighteval

langwatch / langevals

r-lib / evaluate

VectorInstitute / cyclops

microsoft / rag-experiment-accelerator

Improve this page

Add this topic to your repo