evaluation
Here are 1,078 public repositories matching this topic...
Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.
-
Updated
May 17, 2024 - TypeScript
🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
-
Updated
May 16, 2024 - TypeScript
The production toolkit for LLMs. Observability, prompt management and evaluations.
-
Updated
May 17, 2024 - TypeScript
🤖 Build AI applications with confidence ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.
-
Updated
May 16, 2024 - TypeScript
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
-
Updated
May 16, 2024 - Jupyter Notebook
LangSmith Client SDK Implementations
-
Updated
May 17, 2024 - Python
LLMOps with Prompt Flow is a "LLMOps template and guidance" to help you build LLM-infused apps using Prompt Flow. It offers a range of features including Centralized Code Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B Deployment, reporting for all runs and experiments and so on.
-
Updated
May 17, 2024 - Python
Official code for paper "TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks" (TMLR 2024)
-
Updated
May 16, 2024 - Jupyter Notebook
Python SDK for running evaluations on LLM generated responses
-
Updated
May 16, 2024 - Python
Valor is a centralized evaluation store which makes it easy to measure, explore, and rank model performance.
-
Updated
May 16, 2024 - Python
Documentation for langsmith
-
Updated
May 16, 2024 - MDX
Official Implementation of ACL2024 paper "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs"(https://arxiv.org/abs/2402.11199).
-
Updated
May 16, 2024 - Python
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
-
Updated
May 16, 2024 - Go
Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 40+ HF models, 20+ benchmarks
-
Updated
May 16, 2024 - Python
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
-
Updated
May 16, 2024 - Python
LangEvals aggregates various language model evaluators into a single platform, providing a standard interface for a multitude of scores and LLM guardrails, for you to protect and benchmark your LLM models and pipelines.
-
Updated
May 16, 2024 - Python
A version of eval for R that returns more information about what happened
-
Updated
May 16, 2024 - R
Toolkit for evaluating and monitoring AI models in clinical settings
-
Updated
May 16, 2024 - Python
The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.
-
Updated
May 16, 2024 - Python
Improve this page
Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."