A collection of LLM eval examples using the CircleCI Evals Orb.
Before running any of the examples, you'll need:
- A CircleCI account connected to your code. You can sign up for free.
- An OpenAI account. Sign up for an OpenAI account at openai.com to access their platform and API. Once logged into your OpenAI account, generate your API key. Make note of the
API Key
andOrganization ID
.
Depending on your choice of evaluation provider, you will also need one of the following:
- A Braintrust account. Sign up for a Braintrust account at braintrustdata.com to access their platform and API. Once logged into your Braintrust account, generate an
API Key
and make note of it. - A LangSmith account. Sign up for a LangSmith account at langsmith.com to use their language models API. Once logged into your LangSmith account, go to the API Keys page in your account settings to generate an API key. Copy this key to authenticate when using the LangSmith API.
The API keys will allow you to authenticate and interact with the APIs of your LLMOps tools to leverage their services.
See their documentation for more details on capabilities and usage.
Fork this repo to run evaluations on a LLM-based application using the evals orb.
This repository includes evaluations that can be run on two evaluation platforms: Braintrust and LangSmith. Each example folder contains instructions and sample code to run evaluations.
- Enter your credentials into CircleCI, which get stored as environment variables on a new context.
- Update the CircleCI configuration file with your newly-created context.
- Select an evaluation platform where you want to run evaluations.
Entering your OpenAI, Braintrust, and LangSmith credentials into CircleCI is easy.
Just navigate to Project Settings
> LLMOps
and fill out the form by Clicking Set up Integration
.
This will create a context with environment variables for the credentials you've set up above.
ai-llm-eval-examples
). This will be used in the next step to update context
value in the CircleCI configuration file.
💡 You can also optionally store a GITHUB_TOKEN
as an environment variable on this context, if you'd like your pipelines to post summarized eval job results as comments on GitHub pull requests.
Once your credentials have been entered, make sure you update the evals context
parameter in the .circleci/run_evals_config.yml
file with the name of the context you just created in Step 1.
This will ensure that your credentials get used properly by the evaluation scripts in the following steps.
# WORKFLOWS
workflows:
braintrust-evals:
when: << pipeline.parameters.run-braintrust-evals >>
jobs:
- run-braintrust-evals:
context:
- ai-llm-evals-orb-examples # Replace this with your context name
langsmith-evals:
when: << pipeline.parameters.run-langsmith-evals >>
jobs:
- run-langsmith-evals:
context:
- ai-llm-evals-orb-examples # Replace this with your context name
The Braintrust example imports from HuggingFace an evaluation dataset of news articles, and uses ChatGPT to help classify them into category. The dataset contains both the news article and the expected category for each of them. As an evaluation metric, we use the Levenshtein distance, which tells us how distant the answer provided by ChatGPT is from the expected answer. Each individual test case is scored, and a summary score for the whole dataset is also available.
In the LangSmith example, we instantiate the dataset ourselves. Ahead of triggering your evaluation via CircleCI, run the following commands:
cd ./experiments/ai-langsmith
pip install -r ./requirements.txt
python dataset.py
The dataset contains a list of topics which we want ChatGPT to write poems about. It also contains, for each topic, a letter or word which should not be included in the poem. In our evaluation, we use the LangSmith ConstraintEvaluator
to verify whether our LLM has accurately avoided using the letter or word. By accessing the LangSmith platform we are able to access all scores by test case.
Whichever evaluation platform you choose, as evals are run through the evals orb, CircleCI stores the summary of eval results as a job artifact.
If a GITHUB_TOKEN
has been set up, the orb will also post summarized eval results as a PR comment:
The .circleci/run_evals_config.yml
file uses the evals orb to define jobs that run the evaluation code in each example folder. The orb handles setting up the evaluation environment, executing the evaluations, and collecting the results.
For example, the Braintrust job runs the Python script in braintrust/eval_tutorial.py
by passing it as the cmd
parameter. It saves the evaluation results to the location specified with evals_result_location
.
Similarly, the LangSmith job runs the Python script in langsmith/eval.py
.
To change where the results of the evaluation are being saved, go to the evals/eval
step, and add the parameter evals_result_location
:
Note: the evals orb will make the directory if it does not exist.
- evals/eval:
circle_pipeline_id: << pipeline.id >>
eval_platform: ...
evals_result_location: "./my-results-here"
cmd: ...
To enable the evals orb to post eval job summaries on GitHub pull requests:
- Generate a GitHub Personal Access Token with
repo
scope. - Add this token as the environment variable
GITHUB_TOKEN
in CircleCI project settings. Alternatively, you could include this secret in the context that was created when you entered your credentials as part of LLMOps Integration.
The examples included in this repository use dynamic configuration to selectively run only the evaluations defined in the folder that changed. So, for changes committed to the folder braintrust
, only your Braintrust evaluations will be run; for changes committed to the folder langsmith
, only your LangSmith evaluations will be run.
.
├── README.md
├── braintrust
│ ├── eval_tutorial.py
│ ├── README.md
│ └── requirements.txt
└── langsmith
├── dataset.py
├── eval.py
├── README.md
└── requirements.txt
The CircleCI Evals Orb simplifies the definition and execution of evaluation jobs using popular third-party tools, and generates reports of evaluation results.
Given the volatile nature of evaluations, evaluations orchestrated by the evals orb do not halt the pipeline if an evaluation fails. This approach ensures that the inherent flakiness of evaluations does not disrupt the development cycle.
Instead, a summary of the evaluation results can optionally be presented:
- as an artifact within the CircleCI pipeline job
- as a comment on the corresponding GitHub pull request (requires a GitHub Personal Access Token)
The evals orb accepts the following parameters:
Some of the parameters are optional based on the eval platform being used.
-
circle_pipeline_id
: CircleCI Pipeline ID -
cmd
: Command to run the evaluation -
eval_platform
: Evaluation platform (e.g.braintrust
,langsmith
etc.; default:braintrust
) -
evals_result_location
: Location to save evaluation results (default:./results
)
braintrust_experiment_name
(optional): Braintrust experiment name- If no value is provided, an experiment name will be auto-generated based on an MD5 hash of
<CIRCLE_PIPELINE_ID>_<CIRCLE_WORKFLOW_ID>
.
- If no value is provided, an experiment name will be auto-generated based on an MD5 hash of
-
langsmith_endpoint
(optional): LangSmith API endpoint (default:https://api.smith.langchain.com
) -
langsmith_experiment_name
(optional): LangSmith experiment name- If no value is provided, an experiment name will be auto-generated based on an MD5 hash of
<CIRCLE_PIPELINE_ID>_<CIRCLE_WORKFLOW_ID>
.
- If no value is provided, an experiment name will be auto-generated based on an MD5 hash of
Let us know if you have any feedback trying these out.
Submit an issue on GitHub, or reach out to us at ai-feedback@circleci.com.