Skip to content

CircleCI-Public/llm-eval-examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLM Eval Examples

A collection of LLM eval examples using the CircleCI Evals Orb 1.x.x.

Prerequisites

Before running any of the examples, you'll need:

  • A CircleCI account connected to your code. You can sign up for free.
  • An OpenAI account. Sign up for an OpenAI account at openai.com to access their platform and API. Once logged into your OpenAI account, generate your API key. Make note of the API Key and Organization ID.

Depending on your choice of evaluation provider, you will also need one of the following:

  • A Braintrust account. Sign up for a Braintrust account at braintrustdata.com to access their platform and API. Once logged into your Braintrust account, generate an API Key and make note of it.
  • A LangSmith account. Sign up for a LangSmith account at langsmith.com to use their language models API. Once logged into your LangSmith account, go to the API Keys page in your account settings to generate an API key. Copy this key to authenticate when using the LangSmith API.

The API keys will allow you to authenticate and interact with the APIs of your LLMOps tools to leverage their services.

See their documentation for more details on capabilities and usage.

Getting started

Fork this repo to run evaluations on a LLM-based application using the CircleCI Evals Orb 1.x.x.

This repository includes evaluations that can be run on two evaluation platforms: Braintrust and LangSmith. Each example folder contains instructions and sample code to run evaluations.

Here's the process...

  1. Enter your credentials into CircleCI, which get stored as environment variables on a new context.
  2. Update the CircleCI configuration file with your newly-created context.
  3. Select an evaluation platform where you want to run evaluations.

Step 1. Enter credentials into CircleCI

Entering your OpenAI, Braintrust, and LangSmith credentials into CircleCI is easy.

Just navigate to Project Settings > LLMOps and fill out the form by Clicking Set up Integration.

Create Context

This will create a context with environment variables for the credentials you've set up above.

⚠️ Please take note of the generated context name (e.g. ai-llm-eval-examples). This will be used in the next step to update context value in the CircleCI configuration file.

LLMOps Integration Context

πŸ’‘ You can also optionally store a GITHUB_TOKEN as an environment variable on this context, if you'd like your pipelines to post summarized eval job results as comments on GitHub pull requests (only available for projects integrated through GitHub OAuth).

Step 2. Update CircleCI config with your newly-created context

Once your credentials have been entered, make sure you update the evals context parameter in the .circleci/run_evals_config.yml file with the name of the context you just created in Step 1.

This will ensure that your credentials get used properly by the evaluation scripts in the following steps.

# WORKFLOWS
workflows:
  braintrust-evals:
    when: << pipeline.parameters.run-braintrust-evals >>
    jobs:
      - run-braintrust-evals:
          context:
            - ai-llm-evals-orb-examples # Replace this with your context name
            - slack-notification-access-token # Replace this with your context name where SLACK_ACCESS_TOKEN is stored
  langsmith-evals:
    when: << pipeline.parameters.run-langsmith-evals >>
    jobs:
      - run-langsmith-evals:
          context:
            - ai-llm-evals-orb-examples # Replace this with your context name
            - slack-notification-access-token # Replace this with your context name where SLACK_ACCESS_TOKEN is stored
  custom-command-evals:
    when: << pipeline.parameters.run-custom-evals >>
    jobs:
      - run-custom-evals:
          context:
            - ai-llm-eval-examples # Replace this with your context name
            - slack-notification-access-token # Replace this with your context name where SLACK_ACCESS_TOKEN is stored

Step 3. Select an evaluation platform

The evals orb supports running evaluations on two evaluation platforms: Braintrust and LangSmith. The orb also supports running a custom evaluation command without specifying an eval_platform. In this case, the eval_results_location parameter is optional.

Braintrust

The Braintrust example imports from HuggingFace an evaluation dataset of news articles, and uses ChatGPT to help classify them into category. The dataset contains both the news article and the expected category for each of them. As an evaluation metric, we use the Levenshtein distance, which tells us how distant the answer provided by ChatGPT is from the expected answer. Each individual test case is scored, and a summary score for the whole dataset is also available.

CircleCI-llmops

LangSmith

In the LangSmith example, we instantiate the dataset ourselves. Ahead of triggering your evaluation via CircleCI, run the following commands:

cd ./langsmith
pip install -r ./requirements.txt
python dataset.py

The dataset contains a list of topics which we want ChatGPT to write poems about. It also contains, for each topic, a letter or word which should not be included in the poem. In our evaluation, we use the LangSmith ConstraintEvaluator to verify whether our LLM has accurately avoided using the letter or word. By accessing the LangSmith platform we are able to access all scores by test case.

The Results

Whichever evaluation platform you choose, as evals are run through the CircleCI Evals Orb 1.x.x, CircleCI stores the summary of eval results as a job artifact.

Screenshot 2024-04-30 at 10 19 53

If a GITHUB_TOKEN has been set up, the orb will also post summarized eval results as a PR comment:

Screenshot 2024-04-30 at 10 21 48

Configure evals output location

The .circleci/run_evals_config.yml file uses the CircleCI Evals Orb 1.x.x to define jobs that run the evaluation code in each example folder. The orb handles setting up the evaluation environment, executing the evaluations, and collecting the results.

For example, the Braintrust job runs the Python script in braintrust/eval_tutorial.py by passing it as the cmd parameter. It saves the evaluation results to the location specified with evals_result_location.

Similarly, the LangSmith job runs the Python script in langsmith/eval.py.

To change where the results of the evaluation are being saved, go to the evals/eval step, and add the parameter evals_result_location:

Note: the CircleCI Evals Orb 1.x.x will make the directory if it does not exist.

- evals/eval:
    circle_pipeline_id: << pipeline.id >>
    eval_platform: ...
    evals_result_location: "./my-results-here"
    cmd: ...

Configure the CircleCI Evals Orb 1.x.x to post eval job summaries on GitHub pull requests

Warning

Currently, this feature is available only to GitHub projects integrated throug OAuth. To find out which GitHub account type you have, refer to the GitHub OAuth integration page of our Docs.

A note about dynamic configuration

The examples included in this repository use dynamic configuration to selectively run only the evaluations defined in the folder that changed. So, for changes committed to the folder braintrust, only your Braintrust evaluations will be run; for changes committed to the folder langsmith, only your LangSmith evaluations will be run.

.
β”œβ”€β”€ README.md
β”œβ”€β”€ braintrust
β”‚Β Β  β”œβ”€β”€ README.md
β”‚Β Β  β”œβ”€β”€ eval_tutorial.py
β”‚Β Β  └── requirements.txt
β”œβ”€β”€ custom-command
β”‚Β Β  β”œβ”€β”€ conftest.py
β”‚Β Β  β”œβ”€β”€ requirements.txt
β”‚Β Β  └── tests
└── langsmith
    β”œβ”€β”€ README.md
    β”œβ”€β”€ dataset.py
    β”œβ”€β”€ eval.py
    └── requirements.txt

The CircleCI Evals Orb

The CircleCI Evals Orb 1.x.x simplifies the definition and execution of evaluation jobs using popular third-party tools, and generates reports of evaluation results.

Given the volatile nature of evaluations, evaluations orchestrated by the CircleCI Evals Orb 1.x.x do not halt the pipeline if an evaluation fails. This approach ensures that the inherent flakiness of evaluations does not disrupt the development cycle.

Instead, a summary of the evaluation results can optionally be presented:

Orb Parameters

The CircleCI Evals Orb 1.x.x accepts the following parameters:

Some of the parameters are optional based on the eval platform being used.

Common parameters

  • circle_pipeline_id: CircleCI Pipeline ID

  • cmd: Command to run the evaluation

  • eval_platform: Evaluation platform (e.g. braintrust, langsmith etc.; default: braintrust)

  • evals_result_location: Location to save evaluation results (default: ./results)

Braintrust-specific parameters

  • braintrust_experiment_name (optional): Braintrust experiment name
    • If no value is provided, an experiment name will be auto-generated based on an MD5 hash of <CIRCLE_PIPELINE_ID>_<CIRCLE_WORKFLOW_ID>.

LangSmith-specific parameters

  • langsmith_endpoint (optional): LangSmith API endpoint (default: https://api.smith.langchain.com)

  • langsmith_experiment_name (optional): LangSmith experiment name

    • If no value is provided, an experiment name will be auto-generated based on an MD5 hash of <CIRCLE_PIPELINE_ID>_<CIRCLE_WORKFLOW_ID>.

Happy Evaluating!

Let us know if you have any feedback trying these out.

Submit an issue on GitHub, or reach out to us at ai-feedback@circleci.com.

About

Collection of examples using ai-evals orb

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages