Skip to content

List of Recipes

Name Description Datasets Metrics Prompt Templates
RealtimeQA RealTime QA is a dynamic question answering (QA) platform that inquires about the present. The higher the grade, the better the system is handling hallucination. 1. realtimeqa-past 1. exactstrmatch -
squad-shifts-tnf Zero-shot reading comprehension on paragraphs and questions from squadshifts. Augmented to true/false statement. The higher the grade, the better the model performs this capability. 1. squad-shifts-tnf 1. relaxstrmatch -
MLCommons AI Safety Benchmark - Non-Violent Crimes This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of non-violent crimes. A high grade indicates that the system under test handles the response well. 1. mlc-nvc-malicious-user
2. mlc-nvc-typical-user
1. llamaguardannotator -
TAMIL-KURAL-CLASSIFICATION This recipe is used to test the comprehension abilities for the Thirukkural. Thirukkural is a classic Tamil literature composed by the ancient Tamil poet Thiruvalluvar. It consists of 1330 couplets (kurals) that are grouped into 133 chapters, each containing 10 couplets. The higher the grade, the better the model performs in this task. 1. tamil-kural-classification 1. exactstrmatch 1. tamil-templatekuralclassification
University Tutorial Questions in Singapore's Legal Classes To measure the model's ability in understanding legal context in Singapore. 1. sg-university-tutorial-questions-legal 1. bertscore 1. answer-template
MLCommons AI Safety Benchmark - Child Abuse and Exploitation This recipe measures the model's responses in dealing with prompts that contain, describe, enable, encourage, or endorse the sexual abuse of children. A high grade indicates that the system under test handles the response well. 1. mlc-cae-malicious-user
2. mlc-cae-typical-user
1. llamaguardannotator -
TAMIL-TANGLISH-TWEETS This recipe is used to perform sentiment analysis on code-mixed Tamil and English tweets. The higher the grade, the better the model performs in this task. 1. tamil-tanglish-tweets 1. exactstrmatch 1. tamil-templatetanglishsa
ChallengingToxicityPrompts This recipe measures the model's ability in generating toxic phrases or sentences when given a toxic phrase. The higher the grade, the less toxic the system is. 1. challenging-toxicity-prompts 1. toxicity-classifier 1. complete-sentence
Singapore Legal Glossary To measure the model's ability in understanding legal term used commonly in Singapore. 1. sg-legal-glossary 1. bertscore 1. legal-term-template
EnronEmail The Enron email dataset is aimed at evaluating if there is leakage of private information. The dataset was used in DecodingTrust as part of its privacy evaluation. 1. privacy-enronemails 1. leakagerate 1. enronemail-templatea
2. enronemail-templateb
3. enronemail-templatec
4. enronemail-templated
Bias Benchmark for QA This recipe measures the model's ability in attesting social biases against people belonging to protected classes along nine social dimensions relevant for US English-speaking context with age ambiguous and disambiguous questions. The higher the grade, the better the system is in handling social biases. 1. bbq-lite-age-ambiguous
2. bbq-lite-age-disamb
3. bbq-lite-disability-status-ambiguous
4. bbq-lite-disability-status-disamb
5. bbq-lite-gender-ambiguous
6. bbq-lite-gender-disamb
7. bbq-lite-nationality-ambiguous
8. bbq-lite-nationality-disamb
9. bbq-lite-physical-appearance-ambiguous
10. bbq-lite-physical-appearance-disamb
11. bbq-lite-race-ethnicity-ambiguous
12. bbq-lite-race-ethnicity-disamb
13. bbq-lite-race-x-gender-ambiguous
14. bbq-lite-race-x-gender-disamb
15. bbq-lite-race-x-ses-ambiguous
16. bbq-lite-race-x-ses-ambiguous
17. bbq-lite-religion-ambiguous
18. bbq-lite-religion-disamb
19. bbq-lite-ses-ambiguous
20. bbq-lite-ses-disamb
21. bbq-lite-sexual-orientation-ambiguous
22. bbq-lite-sexual-orientation-disamb
1. exactstrmatch 1. mcq-template
Chinese Bias Becnhmark for QA - Lite This recipe measures a system's ability to respond to safety-related, biased, and stereotypical questions in Chinese. A higher grade indicates that the system under test is safer, less biased, and less stereotypical. 1. cbbq-lite-age-ambiguous
2. cbbq-lite-age-disamb
3. cbbq-lite-disability-ambiguous
4. cbbq-lite-disability-disamb
5. cbbq-lite-disease-ambiguous
6. cbbq-lite-disease-disamb
7. cbbq-lite-educational-qualification-ambiguous
8. cbbq-lite-educational-qualification-disamb
9. cbbq-lite-ethnicity-ambiguous
10. cbbq-lite-ethnicity-disamb
11. cbbq-lite-gender-ambiguous
12. cbbq-lite-gender-disamb
13. cbbq-lite-household-registration-ambiguous
14. cbbq-lite-household-registration-disamb
15. cbbq-lite-nationality-ambiguous
16. cbbq-lite-nationality-disamb
17. cbbq-lite-physical-appearance-ambiguous
18. cbbq-lite-physical-appearance-disamb
19. cbbq-lite-race-ambiguous
20. cbbq-lite-race-disamb
21. cbbq-lite-region-ambiguous
22. cbbq-lite-region-disamb
23. cbbq-lite-SES-ambiguous
24. cbbq-lite-SES-disamb
25. cbbq-lite-sexual-orientation-ambiguous
26. cbbq-lite-sexual-orientation-disamb
1. exactstrmatch 1. chinese-mcq-template
Cause and Effect This recipe measure the model's ability in understanding the cause and effect in a sentence. The higher the grade, the better the system is in performing this capability. 1. cause-and-effect 1. relaxstrmatch -
TruthfulQA (MCQ) TruthfulQA aims to measure the truthfulness of a model. The higher the grade, the more truthful the system is. 1. truthfulqa-mcq 1. exactstrmatch 1. mcq-template
GRE Reading Comprehension To measure the model's ability to summarize text, infer and deduce knowledge from context, and match the context. The higher the grade, the better the system is in performing this capability. 1. gre-reading-comprehension 1. relaxstrmatch 1. answer-template
Safety questions written in Singapore's context This recipe evaluates whether a system under test can perform well when a harmful question is written in Singapore's context. By achieving a high grade, the target demonstrates that it is lower in risk. 1. singapore-safety-questions 1. llamaguardannotator -
COQA Conversational Q&A To measure the ability of machines to understand a text passage and answer a series of interconnected questions. The higher the grade, the better the system performs for this capability. 1. coqa-conversational-qna 1. exactstrmatch 1. answer-template
Bias - Occupation This recipe tests if a targeted system thinks that a certain demographic group can handle a job role better than others. 1. occupation 1. gpt4annotator 1. job-role-bias
MLCommons AI Safety Benchmark - Violent Crimes This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of violent crimes. A high grade indicates that the system under test handles the response well. 1. mlc-vcr-malicious-user
2. mlc-vcr-typical-user
1. llamaguardannotator -
Commonsense Morality (Easy) This recipe measures whether how the system performs when given prompts embedded the body of moral standards and principles that most people intuitively accept. The higher the grade, the better the system performs. 1. commonsense-morality-easy-variation1 1. exactstrmatch -
MLCommons AI Safety Benchmark - Hate This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. 1. mlc-hat-malicious-user
2. mlc-hat-typical-user
1. llamaguardannotator -
Winograde To measure model's ability in answering commonsense reasoning, expert-crafted pronoun resolution problems that are designed to be unsolvable for statistical models. The higher the score, the better the model is. 1. winogrande 1. exactstrmatch 1. mcq-template
ARC To measure model's ability in answering genuine grade-school level, multiple-choice science questions on the easy and challenge sets. The higher the grade, the better the sytem is performing this capability. 1. arc-challenge
2. arc-easy
1. exactstrmatch 1. mcq-template
Contextual Parametric Knowledge Conflicts This recipe measures the model’s ability in making informed predictions. The higher the grade, the better the system can perform for this capability. 1. contextual-parametric-knowledge-conflicts 1. relaxstrmatch 1. mcq-template
Commonsense Morality (Hard) This recipe measures whether how the system performs when given prompts embedded the body of moral standards and principles that most people intuitively accept. The higher the grade, the better the system performs. 1. commonsense-morality-hard-variation1 1. exactstrmatch -
RealToxicityPrompts (Completion) RealToxicityPrompts aims to measure the model's ability in identify toxic sentences. The higher the grade, the less toxic the system is. 1. real-toxicity-prompts 1. toxicity-classifier 1. real-toxicity-prompt-template2
Analogical Similarity This recipe measures the model’s ability in discriminating between different degrees of analogical similarity in two given episodes. The higher the grade, the better the sytem is performing this capability. 1. analogical-similarity 1. relaxstrmatch 1. analogical-similarity
GSM8K To measure model's ability in solvinggrade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. The higher the grade, the better the system can perform this capability. 1. gsm8k 1. exactstrmatch 1. mcq-template
MLCommons AI Safety Benchmark - Sex-related Crimes This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of sex-related crimes. A high grade indicates that the system under test handles the response well. 1. mlc-scr-malicious-user
2. mlc-scr-typical-user
1. llamaguardannotator -
MLCommons AI Safety Benchmark - Chemical, Biological, Radiological, Nuclear, and Explosives This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the creation of weapons of mass destruction. A high grade indicates that the system under test handles the response well. 1. mlc-cbr-malicious-user
2. mlc-cbr-typical-user
1. llamaguardannotator -
Jailbreak-DAN This recipe assesses whether the system will be jailbroken using the common jailbreak methods. The higher the grade, the better the system is protected against common jailbreaks. 1. jailbreak-dan 1. reverseexactstrmatch -
Auto Categorisation This recipe measures the model's ability in categorising elements into groups. The higher the grade, the system performs better in this capability. 1. auto-categorisation 1. relaxstrmatch 1. complete-sentence
Medical MCQ In this recipe, we measure how well a system is able to answer healthcare questions using various datasets related to healthcare. The higher the grade, the better the system performs in answering healthare questions. 1. medmcqa
2. medqa-us
3. pubmedqa
1. exactstrmatch 1. mcq-template
TAMIL-NEWS-CLASSIFICATION This recipe is to classify a static set of Tamil News. The task is to classify news to its respective category. The higher the grade, the better the model performs in this task. 1. tamil-tamilnews-classification 1. exactstrmatch 1. tamil-templatenewsclassification
CValues - The benchmark for measuring alignment in LLMs using Chinese Langauges. This recipe measures the alignment ability of LLMs in both safety and responsibility in Chinese Language. A higher grade means that the LLMs are more aligned to human's values. 1. cvalues 1. exactstrmatch 1. chinese-mcq-template
TruthfulQA TruthfulQA aims to measure the truthfulness of a model. This is an experimental recipe using Rouge Score 1. truthfulqa-open-ended 1. rougescorer 1. mcq-template
Facts about Singapore This recipe evaluates the accuracy of facts about Singapore. A higher score reflects the target's ability to answer questions about Singapore with a high level of precision and detail. By achieving a high grade, the target demonstrates a thorough understanding and knowledge of Singapore's key events, transport system, and facts. 1. singapore-transport-system
2. singapore-iconic-places
3. singapore-political-history
4. singapore-public-housing
5. singapore-facts-tnf
6. singapore-places-tnf
7. singapore-food-tnf
1. exactstrmatch -
AdvGLUE AdvGLUE is a comprehensive robustness evaluation benchmark that concentrates on assessing the adversarial robustness of language models. It encompasses textual adversarial attacks from various perspectives and hierarchies, encompassing word-level transformations and sentence-level manipulations. A higher grade indicates that the system under test is more resilient to changes in the sentences. 1. advglue-all 1. advglue -
MMLU To measure model's ability in answering accurately across 57 tasks including elementary mathemathics, US history, computer science, law, and more. The higher the grade, the better the system performs. 1. mmlu-all 1. exactstrmatch 1. mmlu
HellaSwag To measure commonsense with questions that are trivial for humans but difficult for statistical models. The higher the grade, the better the system performs this capability. 1. hellaswag 1. exactstrmatch 1. mcq-template
UCI Adult Dataset This recipe contains the easy variation of UCI adult dataset where LLM is expected to classify the attributes correctly. The higher the score, the better the model is. 1. uciadult 1. exactstrmatch 1. uciadult-template
MMLU - Medical related questions. MMLU benchmark measures model's ability in answering accurately across 57 tasks including elementary mathemathics, US history, computer science, law, and more. In this recipe, we extracted all the tasks related to healthare. The higher the grade, the better the system performs in answering healthcare related questions. 1. mmlu-anatomy
2. mmlu-clinicalknowledge
3. mmlu-collegebiology
4. mmlu-collegemedicine
5. mmlu-medicalgenetics
6. mmlu-professionalmedicine
1. exactstrmatch 1. mcq-template
MLCommons AI Safety Benchmark - Suicide & Self-Harm This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse acts of intentional self-harm. A high grade indicates that the system under test handles the response well. 1. mlc-ssh-vulnerable-user
2. mlc-ssh-typical-user
1. llamaguardannotator -
Winobias Winobias dataset is used to test representational harm in LLMs. LLM should refuse to answer. The higher the score, the better the model is. 1. winobias-type1 1. reverseexactstrmatch -