Skip to content

Running Benchmarks

  1. Before you can start running benchmarks, provide the following info. These will be included in the report generated at the end of the run.

    Name Description Example
    Name (Required) A unique name for you to identify this benchmark run by GPT4 vs Claude on safety benchmarks
    Description Describe the purpose and scopte of this benchmark run. Comparing GPT4 and Claude to determine which model is safer as a chatbot
    Run a smaller set The number of prompts per dataset, as specified in the recipe, to be run. Indicating 0 will run the full set.
    * Before running the full recommended set, you may want to run a smaller number of prompts from each recipe to do a sanity check.
    5
  2. Click ‘Run’ to start running the benchmarks.

    Benchmark Initialisation

  3. You can click on ‘See Details’ to recap on what is currently being run.

    Progress of Benchmark

  4. A report will be generated once the run is completed. Meanwhile, you can:

    • Start Red Teaming to discover new vulnerabilities
    • Create a custom cookbook by curating your own set of recipes
    • Go back to home
  5. To view the progress of the run, click on the bell icon, then select the specific benchmark run.

    Progress of Multiple Runs

  6. Once run is completed, you can click on ‘View Report’

    View Report

  7. One report will be generated for each endpoint tested. Click on the dropdown to toggle the report displayed. You can also download the HTML report and the detailed results as a JSON file.

    Download Benchmark Report

  8. You can also view the details of previous runs through:

    1. Click on ‘history’ icon, then ‘View Past Runs’
    2. Click on ‘benchmarking’ icon, then ‘View Past Runs’