List of Recipes
Name | Description | Datasets | Metrics | Prompt Templates |
---|---|---|---|---|
AdvGLUE advglue.json |
AdvGLUE is a comprehensive robustness evaluation benchmark that concentrates on assessing the adversarial robustness of language models. It encompasses textual adversarial attacks from various perspectives and hierarchies, encompassing word-level transformations and sentence-level manipulations. A higher grade indicates that the system under test is more resilient to changes in the sentences. | advglue-all | advglue | |
Analogical Similarity analogical-similarity.json |
This recipe measures the model’s ability in discriminating between different degrees of analogical similarity in two given episodes. The higher the grade, the better the sytem is performing this capability. | analogical-similarity | relaxstrmatch | analogical-similarity |
AnswerCarefully Information for Cantonese answercarefully-ca.json |
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. | answercarefully-ca | answercarefully-annotator | |
AnswerCarefully Information for Chinese answercarefully-cn.json |
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. | answercarefully-cn | answercarefully-annotator | |
AnswerCarefully Information for English answercarefully-en.json |
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. | answercarefully-en | answercarefully-annotator | |
AnswerCarefully Information for Farsi answercarefully-fa.json |
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. | answercarefully-fa | answercarefully-annotator | |
AnswerCarefully Information for French answercarefully-fr.json |
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. | answercarefully-fr | answercarefully-annotator | |
AnswerCarefully Information for Japanese answercarefully-jp.json |
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. | answercarefully-jp | answercarefully-annotator | |
AnswerCarefully Information for Korean answercarefully-kr.json |
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. | answercarefully-kr | answercarefully-annotator | |
AnswerCarefully Information for Kwashili answercarefully-kw.json |
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. | answercarefully-kw | answercarefully-annotator | |
AnswerCarefully Information for Malay answercarefully-my.json |
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. | answercarefully-my | answercarefully-annotator | |
AnswerCarefully Information for Telugu answercarefully-tl.json |
This dataset includes a subset of the safety-focused testing dataset 'AnswerCarefully'. The subset focuses on Information Hazards. | answercarefully-tl | answercarefully-annotator | |
ARC arc.json |
To measure model's ability in answering genuine grade-school level, multiple-choice science questions on the easy and challenge sets. The higher the grade, the better the sytem is performing this capability. | arc-challenge, arc-easy | exactstrmatch | mcq-template |
Auto Categorisation auto-categorisation.json |
This recipe measures the model's ability in categorising elements into groups. The higher the grade, the system performs better in this capability. | auto-categorisation | relaxstrmatch | complete-sentence |
Bias Benchmark for QA bbq.json |
This recipe measures the model's ability in attesting social biases against people belonging to protected classes along nine social dimensions relevant for US English-speaking context with age ambiguous and disambiguous questions. The higher the grade, the better the system is in handling social biases. | bbq-lite-age-ambiguous, bbq-lite-age-disamb, bbq-lite-disability-status-ambiguous, bbq-lite-disability-status-disamb, bbq-lite-gender-ambiguous, bbq-lite-gender-disamb, bbq-lite-nationality-ambiguous, bbq-lite-nationality-disamb, bbq-lite-physical-appearance-ambiguous, bbq-lite-physical-appearance-disamb, bbq-lite-race-ethnicity-ambiguous, bbq-lite-race-ethnicity-disamb, bbq-lite-race-x-gender-ambiguous, bbq-lite-race-x-gender-disamb, bbq-lite-race-x-ses-ambiguous, bbq-lite-race-x-ses-disamb, bbq-lite-religion-ambiguous, bbq-lite-religion-disamb, bbq-lite-ses-ambiguous, bbq-lite-ses-disamb, bbq-lite-sexual-orientation-ambiguous, bbq-lite-sexual-orientation-disamb | exactstrmatch | mcq-template |
Bias - Occupation bias-occupation.json |
This recipe tests if a targeted system thinks that a certain demographic group can handle a job role better than others. | occupation | gpt4annotator | job-role-bias |
Cause and Effect cause-and-effect.json |
This recipe measure the model's ability in understanding the cause and effect in a sentence. The higher the grade, the better the system is in performing this capability. | cause-and-effect | relaxstrmatch | |
Chinese Bias Becnhmark for QA - Lite cbbq-lite.json |
This recipe measures a system's ability to respond to safety-related, biased, and stereotypical questions in Chinese. A higher grade indicates that the system under test is safer, less biased, and less stereotypical. | cbbq-lite-age-ambiguous, cbbq-lite-age-disamb, cbbq-lite-disability-ambiguous, cbbq-lite-disability-disamb, cbbq-lite-disease-ambiguous, cbbq-lite-disease-disamb, cbbq-lite-educational-qualification-ambiguous, cbbq-lite-educational-qualification-disamb, cbbq-lite-ethnicity-ambiguous, cbbq-lite-ethnicity-disamb, cbbq-lite-gender-ambiguous, cbbq-lite-gender-disamb, cbbq-lite-household-registration-ambiguous, cbbq-lite-household-registration-disamb, cbbq-lite-nationality-ambiguous, cbbq-lite-nationality-disamb, cbbq-lite-physical-appearance-ambiguous, cbbq-lite-physical-appearance-disamb, cbbq-lite-race-ambiguous, cbbq-lite-race-disamb, cbbq-lite-region-ambiguous, cbbq-lite-region-disamb, cbbq-lite-SES-ambiguous, cbbq-lite-SES-disamb, cbbq-lite-sexual-orientation-ambiguous, cbbq-lite-sexual-orientation-disamb | exactstrmatch | chinese-mcq-template |
ChallengingToxicityPrompts challenging-toxicity-prompts-completion.json |
This recipe measures the model's ability in generating toxic phrases or sentences when given a toxic phrase. The higher the grade, the less toxic the system is. | challenging-toxicity-prompts | toxicity-classifier | complete-sentence |
Chinese Linguistics & Cognition Challenge (CLCC) clcc.json |
Created by Beijing Academy of AI, This benchmark is designed to test for various aspects of natural language understanding and cognitive processing in the context of the Chinese language. It may involve evaluating AI models' abilities to interpret, analyze, and generate Chinese text, understand semantics and syntax. By running this benchmark, you consent to sending your application's predicted results to BAAI for evaluation. | clcc | flagjudgeannotator | clcc-template |
Commonsense Morality (Easy) commonsense-morality-easy.json |
This recipe measures whether how the system performs when given prompts embedded the body of moral standards and principles that most people intuitively accept. The higher the grade, the better the system performs. | commonsense-morality-easy-variation1 | exactstrmatch | |
Commonsense Morality (Hard) commonsense-morality-hard.json |
This recipe measures whether how the system performs when given prompts embedded the body of moral standards and principles that most people intuitively accept. The higher the grade, the better the system performs. | commonsense-morality-hard-variation1 | exactstrmatch | |
Contextual Parametric Knowledge Conflicts contextual-parametric-knowledge-conflicts.json |
This recipe measures the model’s ability in making informed predictions. The higher the grade, the better the system can perform for this capability. | contextual-parametric-knowledge-conflicts | relaxstrmatch | mcq-template |
COQA Conversational Q&A coqa-conversational-qna.json |
To measure the ability of machines to understand a text passage and answer a series of interconnected questions. The higher the grade, the better the system performs for this capability. | coqa-conversational-qna | exactstrmatch | answer-template |
CValues - The benchmark for measuring alignment in LLMs using Chinese Langauges. cvalues.json |
This recipe measures the alignment ability of LLMs in both safety and responsibility in Chinese Language. A higher grade means that the LLMs are more aligned to human's values. | cvalues | exactstrmatch | chinese-mcq-template |
Cybersecurity Evaluation - Prompt Injection for Chinese cyberseceval-cn.json |
This recipe is created to test model's susceptibility to prompt injection | cyberseceval-promptinjection2-cn | cybersecevalannotator2 | |
Cybersecurity Evaluation - Prompt Injection for English cyberseceval-en.json |
This recipe is created to test model's susceptibility to prompt injection | cyberseceval-promptinjection2-en | cybersecevalannotator2 | |
Cybersecurity Evaluation - Prompt Injection for French cyberseceval-fr.json |
This recipe is created to test model's susceptibility to prompt injection | cyberseceval-promptinjection2-fr | cybersecevalannotator2 | |
Cybersecurity Evaluation - Prompt Injection for Japanese cyberseceval-jp.json |
This recipe is created to test model's susceptibility to prompt injection | cyberseceval-promptinjection2-jp | cybersecevalannotator2 | |
Cybersecurity Evaluation - Prompt Injection for Korean cyberseceval-kr.json |
This recipe is created to test model's susceptibility to prompt injection | cyberseceval-promptinjection2-kr | cybersecevalannotator2 | |
Cybersecurity Evaluation - Prompt Injection for Kwashili cyberseceval-kw.json |
This recipe is created to test model's susceptibility to prompt injection | cyberseceval-promptinjection2-kw | cybersecevalannotator2 | |
Cybersecurity Evaluation - Prompt Injection for Malay cyberseceval-my.json |
This recipe is created to test model's susceptibility to prompt injection | cyberseceval-promptinjection2-my | cybersecevalannotator2 | |
prompt injection cyberseceval.json |
This recipe is created to test model's susceptibility to prompt injection | cyberseceval_promptinjection | cybersecevalannotator | |
EnronEmail enron-email.json |
The Enron email dataset is aimed at evaluating if there is leakage of private information. The dataset was used in DecodingTrust as part of its privacy evaluation. | privacy-enronemails | leakagerate | enronemail-templatea, enronemail-templateb, enronemail-templatec, enronemail-templated |
Gender Bias genderbias-text2image-prompts.json |
The gender occupational bias is a set of gender neutral text-to-image prompts that are likely to result in models favouring the generation of one gender over the other. The occupations included were proposed in the paper: https://arxiv.org/abs/2211.03759 | gender-occupation-text2-image-prompts | genderbias_metric | |
GRE Reading Comprehension gre-reading-comprehension.json |
To measure the model's ability to summarize text, infer and deduce knowledge from context, and match the context. The higher the grade, the better the system is in performing this capability. | gre-reading-comprehension | relaxstrmatch | answer-template |
GSM8K-TEST gsm8k-test.json |
To measure model's ability in solving grade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. The higher the grade, the better the system can perform this capability. This recipe is based on the test set of GSM8K | gsm8k-test | exactstrmatch-gsm8k-test | gsm8k-test-template |
GSM8K gsm8k.json |
To measure model's ability in solvinggrade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. The higher the grade, the better the system can perform this capability. | gsm8k | exactstrmatch | mcq-template |
HellaSwag hellaswag.json |
To measure commonsense with questions that are trivial for humans but difficult for statistical models. The higher the grade, the better the system performs this capability. | hellaswag | exactstrmatch | mcq-template |
I2P i2p-text2image-prompts.json |
The I2P dataset consists of prompts that are designed to query for inappropriate images. | i2p-text2image-prompts | nudenet_metric, q16_metric | |
Jailbreak-DAN jailbreak-dan.json |
This recipe assesses whether the system will be jailbroken using the common jailbreak methods. The higher the grade, the better the system is protected against common jailbreaks. | jailbreak-dan | reverseexactstrmatch | |
Jailbreak Prompts jailbreak-prompts.json |
This recipe assesses the model's or application's ability to identify attempted jailbreaks and reject the generation of harmful queries. | prompt_injection_jailbreak | refusal | |
Medical MCQ medical-mcq.json |
In this recipe, we measure how well a system is able to answer healthcare questions using various datasets related to healthcare. The higher the grade, the better the system performs in answering healthare questions. | medmcqa, medqa-us, pubmedqa | exactstrmatch | mcq-template |
MLCommons AI Safety Benchmark - Child Abuse and Exploitation mlc-cae.json |
This recipe measures the model's responses in dealing with prompts that contain, describe, enable, encourage, or endorse the sexual abuse of children. A high grade indicates that the system under test handles the response well. | mlc-cae-malicious-user, mlc-cae-typical-user | llamaguardannotator | |
MLCommons AI Safety Benchmark - Chemical, Biological, Radiological, Nuclear, and Explosives mlc-cbr.json |
This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the creation of weapons of mass destruction. A high grade indicates that the system under test handles the response well. | mlc-cbr-malicious-user, mlc-cbr-typical-user | llamaguardannotator | |
MLCommons AI Safety Benchmark - Hate mlc-hat.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-hat-malicious-user, mlc-hat-typical-user | llamaguardannotator | |
ML Commons Ailuminte - Intellectual Property Violation - Cantonese mlc-ipv-ca.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ipv-ca | mlcipv-annotator | |
ML Commons Ailuminte - Intellectual Property Violation - Chinese mlc-ipv-cn.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ipv-cn | mlcipv-annotator | |
ML Commons Ailuminte - Intellectual Property Violation - English mlc-ipv-en.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ipv-en | mlcipv-annotator | |
ML Commons Ailuminte - Intellectual Property Violation - Farsi mlc-ipv-fa.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ipv-fa | mlcipv-annotator | |
ML Commons Ailuminte - Intellectual Property Violation - French mlc-ipv-fr.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ipv-fr | mlcipv-annotator | |
ML Commons Ailuminte - Intellectual Property Violation - Japanese mlc-ipv-jp.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ipv-jp | mlcipv-annotator | |
ML Commons Ailuminte - Intellectual Property Violation - Korean mlc-ipv-kr.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ipv-kr | mlcipv-annotator | |
ML Commons Ailuminte - Intellectual Property Violation - Kwashili mlc-ipv-kw.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ipv-kw | mlcipv-annotator | |
ML Commons Ailuminte - Intellectual Property Violation - Malay mlc-ipv-my.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ipv-my | mlcipv-annotator | |
ML Commons Ailuminte - Intellectual Property Violation - Telugu mlc-ipv-tl.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ipv-tl | mlcipv-annotator | |
ML Commons Ailuminte - Non-Violent Crimes - Chinese mlc-ncr-cn.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ncr-cn | mlcncr-annotator | |
ML Commons Ailuminte - Non-Violent Crimes - English mlc-ncr-en.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ncr-en | mlcncr-annotator | |
ML Commons Ailuminte - Non-Violent Crimes - Farsi mlc-ncr-fa.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ncr-fa | mlcncr-annotator | |
ML Commons Ailuminte - Non-Violent Crimes - French mlc-ncr-fr.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ncr-fr | mlcncr-annotator | |
ML Commons Ailuminte - Non-Violent Crimes - Japanese mlc-ncr-jp.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ncr-jp | mlcncr-annotator | |
ML Commons Ailuminte - Non-Violent Crimes - Korean mlc-ncr-kr.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ncr-kr | mlcncr-annotator | |
ML Commons Ailuminte - Non-Violent Crimes - Kwashili mlc-ncr-kw.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ncr-kw | mlcncr-annotator | |
ML Commons Ailuminte - Non-Violent Crimes - Malay mlc-ncr-my.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ncr-my | mlcncr-annotator | |
ML Commons Ailuminte - Non-Violent Crimes - Telugu mlc-ncr-tl.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-ncr-tl | mlcncr-annotator | |
MLCommons AI Safety Benchmark - Non-Violent Crimes mlc-nvc.json |
This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of non-violent crimes. A high grade indicates that the system under test handles the response well. | mlc-nvc-malicious-user, mlc-nvc-typical-user | llamaguardannotator | |
ML Commons Ailuminte - Privacy - Chinese mlc-prv-cn.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-prv-cn | mlcprv-annotator | |
ML Commons Ailuminte - Privacy - English mlc-prv-en.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-prv-en | mlcprv-annotator | |
ML Commons Ailuminte - Privacy - French mlc-prv-fr.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-prv-fr | mlcprv-annotator | |
ML Commons Ailuminte - Privacy - Japanese mlc-prv-jp.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-prv-jp | mlcprv-annotator | |
ML Commons Ailuminte - Privacy - Korean mlc-prv-kr.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-prv-kr | mlcprv-annotator | |
ML Commons Ailuminte - Privacy - Kwashili mlc-prv-kw.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-prv-kw | mlcprv-annotator | |
ML Commons Ailuminte - Privacy - Malay mlc-prv-my.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-prv-my | mlcprv-annotator | |
ML Commons Ailuminte - Privacy - Telugu mlc-prv-tl.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-prv-tl | mlcprv-annotator | |
MLCommons AI Safety Benchmark - Sex-related Crimes mlc-scr.json |
This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of sex-related crimes. A high grade indicates that the system under test handles the response well. | mlc-scr-malicious-user, mlc-scr-typical-user | llamaguardannotator | |
MLCommons AI Safety Benchmark - Suicide & Self-Harm mlc-ssh.json |
This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse acts of intentional self-harm. A high grade indicates that the system under test handles the response well. | mlc-ssh-vulnerable-user, mlc-ssh-typical-user | llamaguardannotator | |
ML Commons Ailuminte - Violent Crimes - Chinese mlc-vcr-cn.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-vcr-cn | mlcvcr-annotator | |
ML Commons Ailuminte - Violent Crimes - English mlc-vcr-en.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-vcr-en | mlcvcr-annotator | |
ML Commons Ailuminte - Violent Crimes - French mlc-vcr-fr.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-vcr-fr | mlcvcr-annotator | |
ML Commons Ailuminte - Violent Crimes - Japanese mlc-vcr-jp.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-vcr-jp | mlcvcr-annotator | |
ML Commons Ailuminte - Violent Crimes - Korean mlc-vcr-kr.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-vcr-kr | mlcvcr-annotator | |
ML Commons Ailuminte - Violent Crimes - Kwashili mlc-vcr-kw.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-vcr-kw | mlcvcr-annotator | |
ML Commons Ailuminte - Violent Crimes - Malay mlc-vcr-my.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-vcr-my | mlcvcr-annotator | |
ML Commons Ailuminte - Violent Crimes - Telugu mlc-vcr-tl.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | mlc-vcr-tl | mlcvcr-annotator | |
MLCommons AI Safety Benchmark - Violent Crimes mlc-vcr.json |
This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of violent crimes. A high grade indicates that the system under test handles the response well. | mlc-vcr-malicious-user, mlc-vcr-typical-user | llamaguardannotator | |
MMLU - Medical related questions. mmlu-medical.json |
MMLU benchmark measures model's ability in answering accurately across 57 tasks including elementary mathemathics, US history, computer science, law, and more. In this recipe, we extracted all the tasks related to healthare. The higher the grade, the better the system performs in answering healthcare related questions. | mmlu-anatomy, mmlu-clinicalknowledge, mmlu-collegebiology, mmlu-collegemedicine, mmlu-medicalgenetics, mmlu-professionalmedicine | exactstrmatch | mcq-template |
MMLU mmlu.json |
To measure model's ability in answering accurately across 57 tasks including elementary mathemathics, US history, computer science, law, and more. The higher the grade, the better the system performs. | mmlu-all | exactstrmatch | mmlu |
Ragas RAG Evaluation Recipe ragas-evaluation.json |
rag-sample-dataset | answercorrectness, answerrelevance, contextprecision, contextrecall, faithfulness | ||
RealToxicityPrompts (Completion) real-toxicity-prompts-completion.json |
RealToxicityPrompts aims to measure the model's ability in identify toxic sentences. The higher the grade, the less toxic the system is. | real-toxicity-prompts | toxicity-classifier | real-toxicity-prompt-template2 |
RealtimeQA realtime-qa.json |
RealTime QA is a dynamic question answering (QA) platform that inquires about the present. The higher the grade, the better the system is handling hallucination. | realtimeqa-past | exactstrmatch | |
Singapore Legal Glossary sg-legal-glossary.json |
To measure the model's ability in understanding legal term used commonly in Singapore. | sg-legal-glossary | bertscore | legal-term-template |
University Tutorial Questions in Singapore's Legal Classes sg-university-tutorial-questions-legal.json |
To measure the model's ability in understanding legal context in Singapore. | sg-university-tutorial-questions-legal | bertscore | answer-template |
Facts about Singapore singapore-facts.json |
This recipe evaluates the accuracy of facts about Singapore. A higher score reflects the target's ability to answer questions about Singapore with a high level of precision and detail. By achieving a high grade, the target demonstrates a thorough understanding and knowledge of Singapore's key events, transport system, and facts. | singapore-transport-system, singapore-iconic-places, singapore-political-history, singapore-public-housing, singapore-facts-tnf, singapore-places-tnf, singapore-food-tnf | exactstrmatch | |
Safety questions written in Singapore's context singapore-safety.json |
This recipe evaluates whether a system under test can perform well when a harmful question is written in Singapore's context. By achieving a high grade, the target demonstrates that it is lower in risk. | singapore-safety-questions | llamaguardannotator | |
squad-shifts-tnf squad-shifts-tnf.json |
Zero-shot reading comprehension on paragraphs and questions from squadshifts. Augmented to true/false statement. The higher the grade, the better the model performs this capability. | squad-shifts-tnf | relaxstrmatch | |
SQUAD-V2 squad-v2.json |
Hallucination evaluation with SQuAD2.0 | squad-v2 | exactstrmatch-f1-squad-v2 | squad-v2-template |
TAMIL-KURAL-CLASSIFICATION tamil-kural-classification.json |
This recipe is used to test the comprehension abilities for the Thirukkural. Thirukkural is a classic Tamil literature composed by the ancient Tamil poet Thiruvalluvar. It consists of 1330 couplets (kurals) that are grouped into 133 chapters, each containing 10 couplets. The higher the grade, the better the model performs in this task. | tamil-kural-classification | exactstrmatch | tamil-templatekuralclassification |
TAMIL-NEWS-CLASSIFICATION tamil-tamilnews-classification.json |
This recipe is to classify a static set of Tamil News. The task is to classify news to its respective category. The higher the grade, the better the model performs in this task. | tamil-tamilnews-classification | exactstrmatch | tamil-templatenewsclassification |
TAMIL-TANGLISH-TWEETS tamil-tanglish-tweets.json |
This recipe is used to perform sentiment analysis on code-mixed Tamil and English tweets. The higher the grade, the better the model performs in this task. | tamil-tanglish-tweets | exactstrmatch | tamil-templatetanglishsa |
TruthfulQA (MCQ) truthfulqa-mcq.json |
TruthfulQA aims to measure the truthfulness of a model. The higher the grade, the more truthful the system is. | truthfulqa-mcq | exactstrmatch | mcq-template |
TruthfulQA truthfulqa-open-ended.json |
TruthfulQA aims to measure the truthfulness of a model. This is an experimental recipe using Rouge Score | truthfulqa-open-ended | rougescorer | |
UCI Adult Dataset uciadult.json |
This recipe contains the easy variation of UCI adult dataset where LLM is expected to classify the attributes correctly. The higher the score, the better the model is. | uciadult | exactstrmatch | uciadult-template |
Winobias winobias.json |
Winobias dataset is used to test representational harm in LLMs. LLM should refuse to answer. The higher the score, the better the model is. | winobias-type1 | reverseexactstrmatch | |
Winograde winogrande.json |
To measure model's ability in answering commonsense reasoning, expert-crafted pronoun resolution problems that are designed to be unsolvable for statistical models. The higher the score, the better the model is. | winogrande | exactstrmatch | mcq-template |