Running Benchmarks
-
Before you can start running benchmarks, provide the following info. These will be included in the report generated at the end of the run.
Name Description Example Name (Required) A unique name for you to identify this benchmark run by GPT4 vs Claude on safety benchmarks
Description Describe the purpose and scope of this benchmark run. Comparing GPT4 and Claude to determine which model is safer as a chatbot Run a smaller set The number of prompts per dataset, as specified in the recipe, to be run. Indicating 0 will run the full set.
* Before running the full recommended set, you may want to run a smaller number of prompts from each recipe to do a sanity check.5 -
Click ‘Run’ to start running the benchmarks.
-
You can click on ‘See Details’ to recap on what is currently being run.
-
A report will be generated once the run is completed. Meanwhile, you can:
- Start Red Teaming to discover new vulnerabilities
- Create a custom cookbook by curating your own set of recipes
- Go back to home
-
To view the progress of the run, click on the bell icon, then select the specific benchmark run.
-
Once run is completed, you can click on ‘View Report’
-
One report will be generated for each endpoint tested. Click on the dropdown to toggle the report displayed. You can also download the HTML report and the detailed results as a JSON file.
-
You can also view the details of previous runs through:
- Click on ‘history’ icon, then ‘View Past Runs’
- Click on ‘benchmarking’ icon, then ‘View Past Runs’