Skip to content

List of Recipes

Name Description Datasets Metrics Prompt Templates
AdvGLUE
advglue.json
AdvGLUE is a comprehensive robustness evaluation benchmark that concentrates on assessing the adversarial robustness of language models. It encompasses textual adversarial attacks from various perspectives and hierarchies, encompassing word-level transformations and sentence-level manipulations. A higher grade indicates that the system under test is more resilient to changes in the sentences. advglue-all advglue
Analogical Similarity
analogical-similarity.json
This recipe measures the model’s ability in discriminating between different degrees of analogical similarity in two given episodes. The higher the grade, the better the sytem is performing this capability. analogical-similarity relaxstrmatch analogical-similarity
AnswerCarefully Information for Cantonese
answercarefully-ca.json
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. answercarefully-ca answercarefully-annotator
AnswerCarefully Information for Chinese
answercarefully-cn.json
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. answercarefully-cn answercarefully-annotator
AnswerCarefully Information for English
answercarefully-en.json
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. answercarefully-en answercarefully-annotator
AnswerCarefully Information for Farsi
answercarefully-fa.json
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. answercarefully-fa answercarefully-annotator
AnswerCarefully Information for French
answercarefully-fr.json
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. answercarefully-fr answercarefully-annotator
AnswerCarefully Information for Japanese
answercarefully-jp.json
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. answercarefully-jp answercarefully-annotator
AnswerCarefully Information for Korean
answercarefully-kr.json
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. answercarefully-kr answercarefully-annotator
AnswerCarefully Information for Kwashili
answercarefully-kw.json
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. answercarefully-kw answercarefully-annotator
AnswerCarefully Information for Malay
answercarefully-my.json
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. answercarefully-my answercarefully-annotator
AnswerCarefully Information for Telugu
answercarefully-tl.json
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. answercarefully-tl answercarefully-annotator
ARC
arc.json
To measure model's ability in answering genuine grade-school level, multiple-choice science questions on the easy and challenge sets. The higher the grade, the better the sytem is performing this capability. arc-challenge, arc-easy exactstrmatch mcq-template
Auto Categorisation
auto-categorisation.json
This recipe measures the model's ability in categorising elements into groups. The higher the grade, the system performs better in this capability. auto-categorisation relaxstrmatch complete-sentence
Bias Benchmark for QA
bbq.json
This recipe measures the model's ability in attesting social biases against people belonging to protected classes along nine social dimensions relevant for US English-speaking context with age ambiguous and disambiguous questions. The higher the grade, the better the system is in handling social biases. bbq-lite-age-ambiguous, bbq-lite-age-disamb, bbq-lite-disability-status-ambiguous, bbq-lite-disability-status-disamb, bbq-lite-gender-ambiguous, bbq-lite-gender-disamb, bbq-lite-nationality-ambiguous, bbq-lite-nationality-disamb, bbq-lite-physical-appearance-ambiguous, bbq-lite-physical-appearance-disamb, bbq-lite-race-ethnicity-ambiguous, bbq-lite-race-ethnicity-disamb, bbq-lite-race-x-gender-ambiguous, bbq-lite-race-x-gender-disamb, bbq-lite-race-x-ses-ambiguous, bbq-lite-race-x-ses-disamb, bbq-lite-religion-ambiguous, bbq-lite-religion-disamb, bbq-lite-ses-ambiguous, bbq-lite-ses-disamb, bbq-lite-sexual-orientation-ambiguous, bbq-lite-sexual-orientation-disamb exactstrmatch mcq-template
Bias - Occupation
bias-occupation.json
This recipe tests if a targeted system thinks that a certain demographic group can handle a job role better than others. occupation gpt4annotator job-role-bias
Cause and Effect
cause-and-effect.json
This recipe measure the model's ability in understanding the cause and effect in a sentence. The higher the grade, the better the system is in performing this capability. cause-and-effect relaxstrmatch
Chinese Bias Becnhmark for QA - Lite
cbbq-lite.json
This recipe measures a system's ability to respond to safety-related, biased, and stereotypical questions in Chinese. A higher grade indicates that the system under test is safer, less biased, and less stereotypical. cbbq-lite-age-ambiguous, cbbq-lite-age-disamb, cbbq-lite-disability-ambiguous, cbbq-lite-disability-disamb, cbbq-lite-disease-ambiguous, cbbq-lite-disease-disamb, cbbq-lite-educational-qualification-ambiguous, cbbq-lite-educational-qualification-disamb, cbbq-lite-ethnicity-ambiguous, cbbq-lite-ethnicity-disamb, cbbq-lite-gender-ambiguous, cbbq-lite-gender-disamb, cbbq-lite-household-registration-ambiguous, cbbq-lite-household-registration-disamb, cbbq-lite-nationality-ambiguous, cbbq-lite-nationality-disamb, cbbq-lite-physical-appearance-ambiguous, cbbq-lite-physical-appearance-disamb, cbbq-lite-race-ambiguous, cbbq-lite-race-disamb, cbbq-lite-region-ambiguous, cbbq-lite-region-disamb, cbbq-lite-SES-ambiguous, cbbq-lite-SES-disamb, cbbq-lite-sexual-orientation-ambiguous, cbbq-lite-sexual-orientation-disamb exactstrmatch chinese-mcq-template
ChallengingToxicityPrompts
challenging-toxicity-prompts-completion.json
This recipe measures the model's ability in generating toxic phrases or sentences when given a toxic phrase. The higher the grade, the less toxic the system is. challenging-toxicity-prompts toxicity-classifier complete-sentence
Chinese Linguistics & Cognition Challenge (CLCC)
clcc.json
Created by Beijing Academy of AI, This benchmark is designed to test for various aspects of natural language understanding and cognitive processing in the context of the Chinese language. It may involve evaluating AI models' abilities to interpret, analyze, and generate Chinese text, understand semantics and syntax. By running this benchmark, you consent to sending your application's predicted results to BAAI for evaluation. clcc flagjudgeannotator clcc-template
Commonsense Morality (Easy)
commonsense-morality-easy.json
This recipe measures whether how the system performs when given prompts embedded the body of moral standards and principles that most people intuitively accept. The higher the grade, the better the system performs. commonsense-morality-easy-variation1 exactstrmatch
Commonsense Morality (Hard)
commonsense-morality-hard.json
This recipe measures whether how the system performs when given prompts embedded the body of moral standards and principles that most people intuitively accept. The higher the grade, the better the system performs. commonsense-morality-hard-variation1 exactstrmatch
Contextual Parametric Knowledge Conflicts
contextual-parametric-knowledge-conflicts.json
This recipe measures the model’s ability in making informed predictions. The higher the grade, the better the system can perform for this capability. contextual-parametric-knowledge-conflicts relaxstrmatch mcq-template
COQA Conversational Q&A
coqa-conversational-qna.json
To measure the ability of machines to understand a text passage and answer a series of interconnected questions. The higher the grade, the better the system performs for this capability. coqa-conversational-qna exactstrmatch answer-template
CValues - The benchmark for measuring alignment in LLMs using Chinese Langauges.
cvalues.json
This recipe measures the alignment ability of LLMs in both safety and responsibility in Chinese Language. A higher grade means that the LLMs are more aligned to human's values. cvalues exactstrmatch chinese-mcq-template
Cybersecurity Evaluation - Prompt Injection for Chinese
cyberseceval-cn.json
This recipe is created to test model's susceptibility to prompt injection cyberseceval-promptinjection2-cn cybersecevalannotator2
Cybersecurity Evaluation - Prompt Injection for English
cyberseceval-en.json
This recipe is created to test model's susceptibility to prompt injection cyberseceval-promptinjection2-en cybersecevalannotator2
Cybersecurity Evaluation - Prompt Injection for French
cyberseceval-fr.json
This recipe is created to test model's susceptibility to prompt injection cyberseceval-promptinjection2-fr cybersecevalannotator2
Cybersecurity Evaluation - Prompt Injection for Japanese
cyberseceval-jp.json
This recipe is created to test model's susceptibility to prompt injection cyberseceval-promptinjection2-jp cybersecevalannotator2
Cybersecurity Evaluation - Prompt Injection for Korean
cyberseceval-kr.json
This recipe is created to test model's susceptibility to prompt injection cyberseceval-promptinjection2-kr cybersecevalannotator2
Cybersecurity Evaluation - Prompt Injection for Kwashili
cyberseceval-kw.json
This recipe is created to test model's susceptibility to prompt injection cyberseceval-promptinjection2-kw cybersecevalannotator2
Cybersecurity Evaluation - Prompt Injection for Malay
cyberseceval-my.json
This recipe is created to test model's susceptibility to prompt injection cyberseceval-promptinjection2-my cybersecevalannotator2
prompt injection
cyberseceval.json
This recipe is created to test model's susceptibility to prompt injection cyberseceval_promptinjection cybersecevalannotator
EnronEmail
enron-email.json
The Enron email dataset is aimed at evaluating if there is leakage of private information. The dataset was used in DecodingTrust as part of its privacy evaluation. privacy-enronemails leakagerate enronemail-templatea, enronemail-templateb, enronemail-templatec, enronemail-templated
Gender Bias
genderbias-text2image-prompts.json
The gender occupational bias is a set of gender neutral text-to-image prompts that are likely to result in models favouring the generation of one gender over the other. The occupations included were proposed in the paper: https://arxiv.org/abs/2211.03759 gender-occupation-text2-image-prompts genderbias_metric
GRE Reading Comprehension
gre-reading-comprehension.json
To measure the model's ability to summarize text, infer and deduce knowledge from context, and match the context. The higher the grade, the better the system is in performing this capability. gre-reading-comprehension relaxstrmatch answer-template
GSM8K-TEST
gsm8k-test.json
To measure model's ability in solving grade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. The higher the grade, the better the system can perform this capability. This recipe is based on the test set of GSM8K gsm8k-test exactstrmatch-gsm8k-test gsm8k-test-template
GSM8K
gsm8k.json
To measure model's ability in solvinggrade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. The higher the grade, the better the system can perform this capability. gsm8k exactstrmatch mcq-template
HellaSwag
hellaswag.json
To measure commonsense with questions that are trivial for humans but difficult for statistical models. The higher the grade, the better the system performs this capability. hellaswag exactstrmatch mcq-template
I2P
i2p-text2image-prompts.json
The I2P dataset consists of prompts that are designed to query for inappropriate images. i2p-text2image-prompts nudenet_metric, q16_metric
Jailbreak-DAN
jailbreak-dan.json
This recipe assesses whether the system will be jailbroken using the common jailbreak methods. The higher the grade, the better the system is protected against common jailbreaks. jailbreak-dan reverseexactstrmatch
Jailbreak Prompts
jailbreak-prompts.json
This recipe assesses the model's or application's ability to identify attempted jailbreaks and reject the generation of harmful queries. prompt_injection_jailbreak refusal
Medical MCQ
medical-mcq.json
In this recipe, we measure how well a system is able to answer healthcare questions using various datasets related to healthcare. The higher the grade, the better the system performs in answering healthare questions. medmcqa, medqa-us, pubmedqa exactstrmatch mcq-template
MLCommons AI Safety Benchmark - Child Abuse and Exploitation
mlc-cae.json
This recipe measures the model's responses in dealing with prompts that contain, describe, enable, encourage, or endorse the sexual abuse of children. A high grade indicates that the system under test handles the response well. mlc-cae-malicious-user, mlc-cae-typical-user llamaguardannotator
MLCommons AI Safety Benchmark - Chemical, Biological, Radiological, Nuclear, and Explosives
mlc-cbr.json
This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the creation of weapons of mass destruction. A high grade indicates that the system under test handles the response well. mlc-cbr-malicious-user, mlc-cbr-typical-user llamaguardannotator
MLCommons AI Safety Benchmark - Hate
mlc-hat.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-hat-malicious-user, mlc-hat-typical-user llamaguardannotator
ML Commons Ailuminte - Intellectual Property Violation - Cantonese
mlc-ipv-ca.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ipv-ca mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Chinese
mlc-ipv-cn.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ipv-cn mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - English
mlc-ipv-en.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ipv-en mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Farsi
mlc-ipv-fa.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ipv-fa mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - French
mlc-ipv-fr.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ipv-fr mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Japanese
mlc-ipv-jp.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ipv-jp mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Korean
mlc-ipv-kr.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ipv-kr mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Kwashili
mlc-ipv-kw.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ipv-kw mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Malay
mlc-ipv-my.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ipv-my mlcipv-annotator
ML Commons Ailuminte - Intellectual Property Violation - Telugu
mlc-ipv-tl.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ipv-tl mlcipv-annotator
ML Commons Ailuminte - Non-Violent Crimes - Chinese
mlc-ncr-cn.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ncr-cn mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - English
mlc-ncr-en.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ncr-en mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - Farsi
mlc-ncr-fa.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ncr-fa mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - French
mlc-ncr-fr.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ncr-fr mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - Japanese
mlc-ncr-jp.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ncr-jp mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - Korean
mlc-ncr-kr.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ncr-kr mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - Kwashili
mlc-ncr-kw.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ncr-kw mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - Malay
mlc-ncr-my.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ncr-my mlcncr-annotator
ML Commons Ailuminte - Non-Violent Crimes - Telugu
mlc-ncr-tl.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-ncr-tl mlcncr-annotator
MLCommons AI Safety Benchmark - Non-Violent Crimes
mlc-nvc.json
This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of non-violent crimes. A high grade indicates that the system under test handles the response well. mlc-nvc-malicious-user, mlc-nvc-typical-user llamaguardannotator
ML Commons Ailuminte - Privacy - Chinese
mlc-prv-cn.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-prv-cn mlcprv-annotator
ML Commons Ailuminte - Privacy - English
mlc-prv-en.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-prv-en mlcprv-annotator
ML Commons Ailuminte - Privacy - French
mlc-prv-fr.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-prv-fr mlcprv-annotator
ML Commons Ailuminte - Privacy - Japanese
mlc-prv-jp.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-prv-jp mlcprv-annotator
ML Commons Ailuminte - Privacy - Korean
mlc-prv-kr.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-prv-kr mlcprv-annotator
ML Commons Ailuminte - Privacy - Kwashili
mlc-prv-kw.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-prv-kw mlcprv-annotator
ML Commons Ailuminte - Privacy - Malay
mlc-prv-my.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-prv-my mlcprv-annotator
ML Commons Ailuminte - Privacy - Telugu
mlc-prv-tl.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-prv-tl mlcprv-annotator
MLCommons AI Safety Benchmark - Sex-related Crimes
mlc-scr.json
This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of sex-related crimes. A high grade indicates that the system under test handles the response well. mlc-scr-malicious-user, mlc-scr-typical-user llamaguardannotator
MLCommons AI Safety Benchmark - Suicide & Self-Harm
mlc-ssh.json
This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse acts of intentional self-harm. A high grade indicates that the system under test handles the response well. mlc-ssh-vulnerable-user, mlc-ssh-typical-user llamaguardannotator
ML Commons Ailuminte - Violent Crimes - Chinese
mlc-vcr-cn.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-vcr-cn mlcvcr-annotator
ML Commons Ailuminte - Violent Crimes - English
mlc-vcr-en.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-vcr-en mlcvcr-annotator
ML Commons Ailuminte - Violent Crimes - French
mlc-vcr-fr.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-vcr-fr mlcvcr-annotator
ML Commons Ailuminte - Violent Crimes - Japanese
mlc-vcr-jp.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-vcr-jp mlcvcr-annotator
ML Commons Ailuminte - Violent Crimes - Korean
mlc-vcr-kr.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-vcr-kr mlcvcr-annotator
ML Commons Ailuminte - Violent Crimes - Kwashili
mlc-vcr-kw.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-vcr-kw mlcvcr-annotator
ML Commons Ailuminte - Violent Crimes - Malay
mlc-vcr-my.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-vcr-my mlcvcr-annotator
ML Commons Ailuminte - Violent Crimes - Telugu
mlc-vcr-tl.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. mlc-vcr-tl mlcvcr-annotator
MLCommons AI Safety Benchmark - Violent Crimes
mlc-vcr.json
This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of violent crimes. A high grade indicates that the system under test handles the response well. mlc-vcr-malicious-user, mlc-vcr-typical-user llamaguardannotator
MMLU - Medical related questions.
mmlu-medical.json
MMLU benchmark measures model's ability in answering accurately across 57 tasks including elementary mathemathics, US history, computer science, law, and more. In this recipe, we extracted all the tasks related to healthare. The higher the grade, the better the system performs in answering healthcare related questions. mmlu-anatomy, mmlu-clinicalknowledge, mmlu-collegebiology, mmlu-collegemedicine, mmlu-medicalgenetics, mmlu-professionalmedicine exactstrmatch mcq-template
MMLU
mmlu.json
To measure model's ability in answering accurately across 57 tasks including elementary mathemathics, US history, computer science, law, and more. The higher the grade, the better the system performs. mmlu-all exactstrmatch mmlu
Ragas RAG Evaluation Recipe
ragas-evaluation.json
rag-sample-dataset answercorrectness, answerrelevance, contextprecision, contextrecall, faithfulness
RealToxicityPrompts (Completion)
real-toxicity-prompts-completion.json
RealToxicityPrompts aims to measure the model's ability in identify toxic sentences. The higher the grade, the less toxic the system is. real-toxicity-prompts toxicity-classifier real-toxicity-prompt-template2
RealtimeQA
realtime-qa.json
RealTime QA is a dynamic question answering (QA) platform that inquires about the present. The higher the grade, the better the system is handling hallucination. realtimeqa-past exactstrmatch
Singapore Legal Glossary
sg-legal-glossary.json
To measure the model's ability in understanding legal term used commonly in Singapore. sg-legal-glossary bertscore legal-term-template
University Tutorial Questions in Singapore's Legal Classes
sg-university-tutorial-questions-legal.json
To measure the model's ability in understanding legal context in Singapore. sg-university-tutorial-questions-legal bertscore answer-template
Facts about Singapore
singapore-facts.json
This recipe evaluates the accuracy of facts about Singapore. A higher score reflects the target's ability to answer questions about Singapore with a high level of precision and detail. By achieving a high grade, the target demonstrates a thorough understanding and knowledge of Singapore's key events, transport system, and facts. singapore-transport-system, singapore-iconic-places, singapore-political-history, singapore-public-housing, singapore-facts-tnf, singapore-places-tnf, singapore-food-tnf exactstrmatch
Safety questions written in Singapore's context
singapore-safety.json
This recipe evaluates whether a system under test can perform well when a harmful question is written in Singapore's context. By achieving a high grade, the target demonstrates that it is lower in risk. singapore-safety-questions llamaguardannotator
squad-shifts-tnf
squad-shifts-tnf.json
Zero-shot reading comprehension on paragraphs and questions from squadshifts. Augmented to true/false statement. The higher the grade, the better the model performs this capability. squad-shifts-tnf relaxstrmatch
SQUAD-V2
squad-v2.json
Hallucination evaluation with SQuAD2.0 squad-v2 exactstrmatch-f1-squad-v2 squad-v2-template
TAMIL-KURAL-CLASSIFICATION
tamil-kural-classification.json
This recipe is used to test the comprehension abilities for the Thirukkural. Thirukkural is a classic Tamil literature composed by the ancient Tamil poet Thiruvalluvar. It consists of 1330 couplets (kurals) that are grouped into 133 chapters, each containing 10 couplets. The higher the grade, the better the model performs in this task. tamil-kural-classification exactstrmatch tamil-templatekuralclassification
TAMIL-NEWS-CLASSIFICATION
tamil-tamilnews-classification.json
This recipe is to classify a static set of Tamil News. The task is to classify news to its respective category. The higher the grade, the better the model performs in this task. tamil-tamilnews-classification exactstrmatch tamil-templatenewsclassification
TAMIL-TANGLISH-TWEETS
tamil-tanglish-tweets.json
This recipe is used to perform sentiment analysis on code-mixed Tamil and English tweets. The higher the grade, the better the model performs in this task. tamil-tanglish-tweets exactstrmatch tamil-templatetanglishsa
TruthfulQA (MCQ)
truthfulqa-mcq.json
TruthfulQA aims to measure the truthfulness of a model. The higher the grade, the more truthful the system is. truthfulqa-mcq exactstrmatch mcq-template
TruthfulQA
truthfulqa-open-ended.json
TruthfulQA aims to measure the truthfulness of a model. This is an experimental recipe using Rouge Score truthfulqa-open-ended rougescorer
UCI Adult Dataset
uciadult.json
This recipe contains the easy variation of UCI adult dataset where LLM is expected to classify the attributes correctly. The higher the score, the better the model is. uciadult exactstrmatch uciadult-template
Winobias
winobias.json
Winobias dataset is used to test representational harm in LLMs. LLM should refuse to answer. The higher the score, the better the model is. winobias-type1 reverseexactstrmatch
Winograde
winogrande.json
To measure model's ability in answering commonsense reasoning, expert-crafted pronoun resolution problems that are designed to be unsolvable for statistical models. The higher the score, the better the model is. winogrande exactstrmatch mcq-template