List of Recipes

Name	Description	Datasets	Metrics	Prompt Templates
AdvGLUE advglue.json	AdvGLUE is a comprehensive robustness evaluation benchmark that concentrates on assessing the adversarial robustness of language models. It encompasses textual adversarial attacks from various perspectives and hierarchies, encompassing word-level transformations and sentence-level manipulations. A higher grade indicates that the system under test is more resilient to changes in the sentences.	advglue-all	advglue
Analogical Similarity analogical-similarity.json	This recipe measures the model’s ability in discriminating between different degrees of analogical similarity in two given episodes. The higher the grade, the better the sytem is performing this capability.	analogical-similarity	relaxstrmatch	analogical-similarity
AnswerCarefully Information for Cantonese answercarefully-ca.json	This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards.	answercarefully-ca	answercarefully-annotator
AnswerCarefully Information for Chinese answercarefully-cn.json	This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards.	answercarefully-cn	answercarefully-annotator
AnswerCarefully Information for English answercarefully-en.json	This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards.	answercarefully-en	answercarefully-annotator
AnswerCarefully Information for Farsi answercarefully-fa.json	This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards.	answercarefully-fa	answercarefully-annotator
AnswerCarefully Information for French answercarefully-fr.json	This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards.	answercarefully-fr	answercarefully-annotator
AnswerCarefully Information for Japanese answercarefully-jp.json	This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards.	answercarefully-jp	answercarefully-annotator
AnswerCarefully Information for Korean answercarefully-kr.json	This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards.	answercarefully-kr	answercarefully-annotator
AnswerCarefully Information for Kwashili answercarefully-kw.json	This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards.	answercarefully-kw	answercarefully-annotator
AnswerCarefully Information for Malay answercarefully-my.json	This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards.	answercarefully-my	answercarefully-annotator
AnswerCarefully Information for Telugu answercarefully-tl.json	This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards.	answercarefully-tl	answercarefully-annotator
ARC arc.json	To measure model's ability in answering genuine grade-school level, multiple-choice science questions on the easy and challenge sets. The higher the grade, the better the sytem is performing this capability.	arc-challenge, arc-easy	exactstrmatch	mcq-template
Auto Categorisation auto-categorisation.json	This recipe measures the model's ability in categorising elements into groups. The higher the grade, the system performs better in this capability.	auto-categorisation	relaxstrmatch	complete-sentence
Bias Benchmark for QA bbq.json	This recipe measures the model's ability in attesting social biases against people belonging to protected classes along nine social dimensions relevant for US English-speaking context with age ambiguous and disambiguous questions. The higher the grade, the better the system is in handling social biases.	bbq-lite-age-ambiguous, bbq-lite-age-disamb, bbq-lite-disability-status-ambiguous, bbq-lite-disability-status-disamb, bbq-lite-gender-ambiguous, bbq-lite-gender-disamb, bbq-lite-nationality-ambiguous, bbq-lite-nationality-disamb, bbq-lite-physical-appearance-ambiguous, bbq-lite-physical-appearance-disamb, bbq-lite-race-ethnicity-ambiguous, bbq-lite-race-ethnicity-disamb, bbq-lite-race-x-gender-ambiguous, bbq-lite-race-x-gender-disamb, bbq-lite-race-x-ses-ambiguous, bbq-lite-race-x-ses-disamb, bbq-lite-religion-ambiguous, bbq-lite-religion-disamb, bbq-lite-ses-ambiguous, bbq-lite-ses-disamb, bbq-lite-sexual-orientation-ambiguous, bbq-lite-sexual-orientation-disamb	exactstrmatch	mcq-template
Bias - Occupation bias-occupation.json	This recipe tests if a targeted system thinks that a certain demographic group can handle a job role better than others.	occupation	gpt4annotator	job-role-bias
Cause and Effect cause-and-effect.json	This recipe measure the model's ability in understanding the cause and effect in a sentence. The higher the grade, the better the system is in performing this capability.	cause-and-effect	relaxstrmatch
Chinese Bias Becnhmark for QA - Lite cbbq-lite.json	This recipe measures a system's ability to respond to safety-related, biased, and stereotypical questions in Chinese. A higher grade indicates that the system under test is safer, less biased, and less stereotypical.	cbbq-lite-age-ambiguous, cbbq-lite-age-disamb, cbbq-lite-disability-ambiguous, cbbq-lite-disability-disamb, cbbq-lite-disease-ambiguous, cbbq-lite-disease-disamb, cbbq-lite-educational-qualification-ambiguous, cbbq-lite-educational-qualification-disamb, cbbq-lite-ethnicity-ambiguous, cbbq-lite-ethnicity-disamb, cbbq-lite-gender-ambiguous, cbbq-lite-gender-disamb, cbbq-lite-household-registration-ambiguous, cbbq-lite-household-registration-disamb, cbbq-lite-nationality-ambiguous, cbbq-lite-nationality-disamb, cbbq-lite-physical-appearance-ambiguous, cbbq-lite-physical-appearance-disamb, cbbq-lite-race-ambiguous, cbbq-lite-race-disamb, cbbq-lite-region-ambiguous, cbbq-lite-region-disamb, cbbq-lite-SES-ambiguous, cbbq-lite-SES-disamb, cbbq-lite-sexual-orientation-ambiguous, cbbq-lite-sexual-orientation-disamb	exactstrmatch	chinese-mcq-template
ChallengingToxicityPrompts challenging-toxicity-prompts-completion.json	This recipe measures the model's ability in generating toxic phrases or sentences when given a toxic phrase. The higher the grade, the less toxic the system is.	challenging-toxicity-prompts	toxicity-classifier	complete-sentence
Chinese Linguistics & Cognition Challenge (CLCC) clcc.json	Created by Beijing Academy of AI, This benchmark is designed to test for various aspects of natural language understanding and cognitive processing in the context of the Chinese language. It may involve evaluating AI models' abilities to interpret, analyze, and generate Chinese text, understand semantics and syntax. By running this benchmark, you consent to sending your application's predicted results to BAAI for evaluation.	clcc	flagjudgeannotator	clcc-template
Commonsense Morality (Easy) commonsense-morality-easy.json	This recipe measures whether how the system performs when given prompts embedded the body of moral standards and principles that most people intuitively accept. The higher the grade, the better the system performs.	commonsense-morality-easy-variation1	exactstrmatch
Commonsense Morality (Hard) commonsense-morality-hard.json	This recipe measures whether how the system performs when given prompts embedded the body of moral standards and principles that most people intuitively accept. The higher the grade, the better the system performs.	commonsense-morality-hard-variation1	exactstrmatch
Contextual Parametric Knowledge Conflicts contextual-parametric-knowledge-conflicts.json	This recipe measures the model’s ability in making informed predictions. The higher the grade, the better the system can perform for this capability.	contextual-parametric-knowledge-conflicts	relaxstrmatch	mcq-template
COQA Conversational Q&A coqa-conversational-qna.json	To measure the ability of machines to understand a text passage and answer a series of interconnected questions. The higher the grade, the better the system performs for this capability.	coqa-conversational-qna	exactstrmatch	answer-template
CValues - The benchmark for measuring alignment in LLMs using Chinese Langauges. cvalues.json	This recipe measures the alignment ability of LLMs in both safety and responsibility in Chinese Language. A higher grade means that the LLMs are more aligned to human's values.	cvalues	exactstrmatch	chinese-mcq-template
Cybersecurity Evaluation - Prompt Injection for Chinese cyberseceval-cn.json	This recipe is created to test model's susceptibility to prompt injection	cyberseceval-promptinjection2-cn	cybersecevalannotator2
CyberSecEval- Prompt Injections 3 cyberseceval-en.json	This recipe is created to test model's susceptibility to prompt injection	cyberseceval-promptinjection2-en	cybersecevalannotator2
Cybersecurity Evaluation - Prompt Injection for French cyberseceval-fr.json	This recipe is created to test model's susceptibility to prompt injection	cyberseceval-promptinjection2-fr	cybersecevalannotator2
Cybersecurity Evaluation - Prompt Injection for Japanese cyberseceval-jp.json	This recipe is created to test model's susceptibility to prompt injection	cyberseceval-promptinjection2-jp	cybersecevalannotator2
Cybersecurity Evaluation - Prompt Injection for Korean cyberseceval-kr.json	This recipe is created to test model's susceptibility to prompt injection	cyberseceval-promptinjection2-kr	cybersecevalannotator2
Cybersecurity Evaluation - Prompt Injection for Kwashili cyberseceval-kw.json	This recipe is created to test model's susceptibility to prompt injection	cyberseceval-promptinjection2-kw	cybersecevalannotator2
Cybersecurity Evaluation - Prompt Injection for Malay cyberseceval-my.json	This recipe is created to test model's susceptibility to prompt injection	cyberseceval-promptinjection2-my	cybersecevalannotator2
prompt injection cyberseceval.json	This recipe is created to test model's susceptibility to prompt injection	cyberseceval_promptinjection	cybersecevalannotator
EnronEmail enron-email.json	The Enron email dataset is aimed at evaluating if there is leakage of private information. The dataset was used in DecodingTrust as part of its privacy evaluation.	privacy-enronemails	leakagerate	enronemail-templatea, enronemail-templateb, enronemail-templatec, enronemail-templated
Gender Bias genderbias-text2image-prompts.json	The gender occupational bias is a set of gender neutral text-to-image prompts that are likely to result in models favouring the generation of one gender over the other. The occupations included were proposed in the paper: https://arxiv.org/abs/2211.03759	gender-occupation-text2-image-prompts	genderbias_metric
GRE Reading Comprehension gre-reading-comprehension.json	To measure the model's ability to summarize text, infer and deduce knowledge from context, and match the context. The higher the grade, the better the system is in performing this capability.	gre-reading-comprehension	relaxstrmatch	answer-template
GSM8K-TEST gsm8k-test.json	To measure model's ability in solving grade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. The higher the grade, the better the system can perform this capability. This recipe is based on the test set of GSM8K	gsm8k-test	exactstrmatch-gsm8k-test	gsm8k-test-template
GSM8K gsm8k.json	To measure model's ability in solvinggrade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. The higher the grade, the better the system can perform this capability.	gsm8k	exactstrmatch	mcq-template
HellaSwag hellaswag.json	To measure commonsense with questions that are trivial for humans but difficult for statistical models. The higher the grade, the better the system performs this capability.	hellaswag	exactstrmatch	mcq-template
I2P i2p-text2image-prompts.json	The I2P dataset consists of prompts that are designed to query for inappropriate images.	i2p-text2image-prompts	nudenet_metric, q16_metric
Jailbreak-DAN jailbreak-dan.json	This recipe assesses whether the system will be jailbroken using the common jailbreak methods. The higher the grade, the better the system is protected against common jailbreaks.	jailbreak-dan	reverseexactstrmatch
Jailbreak Prompts jailbreak-prompts.json	This recipe assesses the model's or application's ability to identify attempted jailbreaks and reject the generation of harmful queries.	prompt_injection_jailbreak	refusal
Medical MCQ medical-mcq.json	In this recipe, we measure how well a system is able to answer healthcare questions using various datasets related to healthcare. The higher the grade, the better the system performs in answering healthare questions.	medmcqa, medqa-us, pubmedqa	exactstrmatch	mcq-template
MLCommons AI Safety Benchmark - Child Abuse and Exploitation mlc-cae.json	This recipe measures the model's responses in dealing with prompts that contain, describe, enable, encourage, or endorse the sexual abuse of children. A high grade indicates that the system under test handles the response well.	mlc-cae-malicious-user, mlc-cae-typical-user	llamaguardannotator
MLCommons AI Safety Benchmark - Chemical, Biological, Radiological, Nuclear, and Explosives mlc-cbr.json	This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the creation of weapons of mass destruction. A high grade indicates that the system under test handles the response well.	mlc-cbr-malicious-user, mlc-cbr-typical-user	llamaguardannotator
MLCommons AI Safety Benchmark - Hate mlc-hat.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-hat-malicious-user, mlc-hat-typical-user	llamaguardannotator
ML Commons Ailuminte - Intellectual Property Violation - Cantonese mlc-ipv-ca.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ipv-ca	mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Chinese mlc-ipv-cn.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ipv-cn	mlcipv-annotator
MLCommons AILuminate - Intellectual Property Violation - English mlc-ipv-en.json	This recipe evaluates the application's or model's responses that violate the intellectual property rights of any party.A high grade indicates that the system under test handles the response well.	mlc-ipv-en	mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Farsi mlc-ipv-fa.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ipv-fa	mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - French mlc-ipv-fr.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ipv-fr	mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Japanese mlc-ipv-jp.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ipv-jp	mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Korean mlc-ipv-kr.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ipv-kr	mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Kwashili mlc-ipv-kw.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ipv-kw	mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Malay mlc-ipv-my.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ipv-my	mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Telugu mlc-ipv-tl.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ipv-tl	mlcipv-annotator
ML Commons Ailuminte - Non-Violent Crimes - Chinese mlc-ncr-cn.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ncr-cn	mlcncr-annotator
MLCommons AILuminate - Non-Violent Crimes - English mlc-ncr-en.json	This recipe evaluates the application's or model's responses that enable, encourage, or endorse the commission of non-violent crimes. A high grade indicates that the system under test handles the response well.	mlc-ncr-en	mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - Farsi mlc-ncr-fa.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ncr-fa	mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - French mlc-ncr-fr.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ncr-fr	mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - Japanese mlc-ncr-jp.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ncr-jp	mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - Korean mlc-ncr-kr.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ncr-kr	mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - Kwashili mlc-ncr-kw.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ncr-kw	mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - Malay mlc-ncr-my.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ncr-my	mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - Telugu mlc-ncr-tl.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-ncr-tl	mlcncr-annotator
MLCommons AI Safety Benchmark - Non-Violent Crimes mlc-nvc.json	This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of non-violent crimes. A high grade indicates that the system under test handles the response well.	mlc-nvc-malicious-user, mlc-nvc-typical-user	llamaguardannotator
ML Commons Ailuminte - Privacy - Chinese mlc-prv-cn.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-prv-cn	mlcprv-annotator
MLCommons AILuminate - Privacy - English mlc-prv-en.json	This recipe evaluates the application's or model's responses in dealing with prompts that contain sensitive, nonpublic personal information that could undermine someone's physical, digital, or financial security. A high grade indicates that the system under test handles the response well.	mlc-prv-en	mlcprv-annotator
ML Commons Ailuminte - Privacy - French mlc-prv-fr.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-prv-fr	mlcprv-annotator
ML Commons Ailuminte - Privacy - Japanese mlc-prv-jp.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-prv-jp	mlcprv-annotator
ML Commons Ailuminte - Privacy - Korean mlc-prv-kr.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-prv-kr	mlcprv-annotator
ML Commons Ailuminte - Privacy - Kwashili mlc-prv-kw.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-prv-kw	mlcprv-annotator
ML Commons Ailuminte - Privacy - Malay mlc-prv-my.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-prv-my	mlcprv-annotator
ML Commons Ailuminte - Privacy - Telugu mlc-prv-tl.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-prv-tl	mlcprv-annotator
MLCommons AI Safety Benchmark - Sex-related Crimes mlc-scr.json	This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of sex-related crimes. A high grade indicates that the system under test handles the response well.	mlc-scr-malicious-user, mlc-scr-typical-user	llamaguardannotator
MLCommons AI Safety Benchmark - Suicide & Self-Harm mlc-ssh.json	This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse acts of intentional self-harm. A high grade indicates that the system under test handles the response well.	mlc-ssh-vulnerable-user, mlc-ssh-typical-user	llamaguardannotator
ML Commons Ailuminte - Violent Crimes - Chinese mlc-vcr-cn.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-vcr-cn	mlcvcr-annotator
MLCommons AILuminate - Violent Crimes - English mlc-vcr-en.json	This recipe evaluates the application's or model's responses that enable, encourage, or endorse the commission of violent crimes. A high grade indicates that the system under test handles the response well.	mlc-vcr-en	mlcvcr-annotator
ML Commons Ailuminte - Violent Crimes - French mlc-vcr-fr.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-vcr-fr	mlcvcr-annotator
ML Commons Ailuminte - Violent Crimes - Japanese mlc-vcr-jp.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-vcr-jp	mlcvcr-annotator
ML Commons Ailuminte - Violent Crimes - Korean mlc-vcr-kr.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-vcr-kr	mlcvcr-annotator
ML Commons Ailuminte - Violent Crimes - Kwashili mlc-vcr-kw.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-vcr-kw	mlcvcr-annotator
ML Commons Ailuminte - Violent Crimes - Malay mlc-vcr-my.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-vcr-my	mlcvcr-annotator
ML Commons Ailuminte - Violent Crimes - Telugu mlc-vcr-tl.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	mlc-vcr-tl	mlcvcr-annotator
MLCommons AI Safety Benchmark - Violent Crimes mlc-vcr.json	This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of violent crimes. A high grade indicates that the system under test handles the response well.	mlc-vcr-malicious-user, mlc-vcr-typical-user	llamaguardannotator
MMLU - Medical related questions. mmlu-medical.json	MMLU benchmark measures model's ability in answering accurately across 57 tasks including elementary mathemathics, US history, computer science, law, and more. In this recipe, we extracted all the tasks related to healthare. The higher the grade, the better the system performs in answering healthcare related questions.	mmlu-anatomy, mmlu-clinicalknowledge, mmlu-collegebiology, mmlu-collegemedicine, mmlu-medicalgenetics, mmlu-professionalmedicine	exactstrmatch	mcq-template
MMLU mmlu.json	To measure model's ability in answering accurately across 57 tasks including elementary mathemathics, US history, computer science, law, and more. The higher the grade, the better the system performs.	mmlu-all	exactstrmatch	mmlu
RealToxicityPrompts (Completion) real-toxicity-prompts-completion.json	RealToxicityPrompts aims to measure the model's ability in identify toxic sentences. The higher the grade, the less toxic the system is.	real-toxicity-prompts	toxicity-classifier	real-toxicity-prompt-template2
RealtimeQA realtime-qa.json	RealTime QA is a dynamic question answering (QA) platform that inquires about the present. The higher the grade, the better the system is handling hallucination.	realtimeqa-past	exactstrmatch
Singapore Legal Glossary sg-legal-glossary.json	To measure the model's ability in understanding legal term used commonly in Singapore.	sg-legal-glossary	bertscore	legal-term-template
University Tutorial Questions in Singapore's Legal Classes sg-university-tutorial-questions-legal.json	To measure the model's ability in understanding legal context in Singapore.	sg-university-tutorial-questions-legal	bertscore	answer-template
Facts about Singapore MCQ singapore-facts-mcq.json	This recipe assesses the accuracy of factual statements about Singapore using a multiple-choice format. A higher score reflects the target's ability to answer questions about Singapore with a high level of precision and detail. By achieving a high grade, the target demonstrates a thorough understanding and knowledge of Singapore's key events, transport system, and facts.	singapore-transport-system, singapore-iconic-places, singapore-political-history, singapore-public-housing	exactstrmatch	singapore-facts-mcq
Facts about Singapore TF singapore-facts-tf.json	This recipe assesses the accuracy of factual statements about Singapore using a True/False format. A higher score reflects the target's ability to answer questions about Singapore with a high level of precision and detail. By achieving a high grade, the target demonstrates a thorough understanding and knowledge of Singapore's key events, transport system, and facts.	singapore-facts-tnf, singapore-places-tnf, singapore-food-tnf	exactstrmatch	singapore-facts-tf
Facts about Singapore singapore-facts.json	This recipe evaluates the accuracy of facts about Singapore. A higher score reflects the target's ability to answer questions about Singapore with a high level of precision and detail. By achieving a high grade, the target demonstrates a thorough understanding and knowledge of Singapore's key events, transport system, and facts.	singapore-transport-system, singapore-iconic-places, singapore-political-history, singapore-public-housing, singapore-facts-tnf, singapore-places-tnf, singapore-food-tnf	exactstrmatch
POMFA specific corrections singapore-pofma-true.json	This recipe evaluates the accuracy of corrective POFMA statements that are facts about Singapore. This is a sanity check!	singapore-pofma-true-statements	relaxstrmatch
Singapore POMFA False Statements singapore-pofma.json	This recipe evaluates the accuracy of statements deemed to false and in violation of POFMA. A higher score reflects the target's ability to answer correctly that the statements are false.	singapore-pofma-statements-2024, singapore-pofma-statements-2023	relaxstrmatch
Safety questions written in Singapore's context singapore-safety.json	This recipe evaluates whether a system under test can perform well when a harmful question is written in Singapore's context. By achieving a high grade, the target demonstrates that it is lower in risk.	singapore-safety-questions	llamaguardannotator
squad-shifts-tnf squad-shifts-tnf.json	Zero-shot reading comprehension on paragraphs and questions from squadshifts. Augmented to true/false statement. The higher the grade, the better the model performs this capability.	squad-shifts-tnf	relaxstrmatch
SQUAD-V2 squad-v2.json	Hallucination evaluation with SQuAD2.0	squad-v2	exactstrmatch-f1-squad-v2	squad-v2-template
TAMIL-KURAL-CLASSIFICATION tamil-kural-classification.json	This recipe is used to test the comprehension abilities for the Thirukkural. Thirukkural is a classic Tamil literature composed by the ancient Tamil poet Thiruvalluvar. It consists of 1330 couplets (kurals) that are grouped into 133 chapters, each containing 10 couplets. The higher the grade, the better the model performs in this task.	tamil-kural-classification	exactstrmatch	tamil-templatekuralclassification
TAMIL-NEWS-CLASSIFICATION tamil-tamilnews-classification.json	This recipe is to classify a static set of Tamil News. The task is to classify news to its respective category. The higher the grade, the better the model performs in this task.	tamil-tamilnews-classification	exactstrmatch	tamil-templatenewsclassification
TAMIL-TANGLISH-TWEETS tamil-tanglish-tweets.json	This recipe is used to perform sentiment analysis on code-mixed Tamil and English tweets. The higher the grade, the better the model performs in this task.	tamil-tanglish-tweets	exactstrmatch	tamil-templatetanglishsa
TruthfulQA (MCQ) truthfulqa-mcq.json	TruthfulQA aims to measure the truthfulness of a model. The higher the grade, the more truthful the system is.	truthfulqa-mcq	exactstrmatch	mcq-template
TruthfulQA truthfulqa-open-ended.json	TruthfulQA aims to measure the truthfulness of a model. This is an experimental recipe using Rouge Score	truthfulqa-open-ended	rougescorer
UCI Adult Dataset uciadult.json	This recipe contains the easy variation of UCI adult dataset where LLM is expected to classify the attributes correctly. The higher the score, the better the model is.	uciadult	exactstrmatch	uciadult-template
Winobias winobias.json	Winobias dataset is used to test representational harm in LLMs. LLM should refuse to answer. The higher the score, the better the model is.	winobias-type1	reverseexactstrmatch
Winograde winogrande.json	To measure model's ability in answering commonsense reasoning, expert-crafted pronoun resolution problems that are designed to be unsolvable for statistical models. The higher the score, the better the model is.	winogrande	exactstrmatch	mcq-template