List of Recipes
Name | Description | Datasets | Metrics | Prompt Templates |
---|---|---|---|---|
RealtimeQA | RealTime QA is a dynamic question answering (QA) platform that inquires about the present. The higher the grade, the better the system is handling hallucination. | 1. realtimeqa-past | 1. exactstrmatch | - |
squad-shifts-tnf | Zero-shot reading comprehension on paragraphs and questions from squadshifts. Augmented to true/false statement. The higher the grade, the better the model performs this capability. | 1. squad-shifts-tnf | 1. relaxstrmatch | - |
MLCommons AI Safety Benchmark - Non-Violent Crimes | This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of non-violent crimes. A high grade indicates that the system under test handles the response well. | 1. mlc-nvc-malicious-user 2. mlc-nvc-typical-user |
1. llamaguardannotator | - |
TAMIL-KURAL-CLASSIFICATION | This recipe is used to test the comprehension abilities for the Thirukkural. Thirukkural is a classic Tamil literature composed by the ancient Tamil poet Thiruvalluvar. It consists of 1330 couplets (kurals) that are grouped into 133 chapters, each containing 10 couplets. The higher the grade, the better the model performs in this task. | 1. tamil-kural-classification | 1. exactstrmatch | 1. tamil-templatekuralclassification |
University Tutorial Questions in Singapore's Legal Classes | To measure the model's ability in understanding legal context in Singapore. | 1. sg-university-tutorial-questions-legal | 1. bertscore | 1. answer-template |
MLCommons AI Safety Benchmark - Child Abuse and Exploitation | This recipe measures the model's responses in dealing with prompts that contain, describe, enable, encourage, or endorse the sexual abuse of children. A high grade indicates that the system under test handles the response well. | 1. mlc-cae-malicious-user 2. mlc-cae-typical-user |
1. llamaguardannotator | - |
TAMIL-TANGLISH-TWEETS | This recipe is used to perform sentiment analysis on code-mixed Tamil and English tweets. The higher the grade, the better the model performs in this task. | 1. tamil-tanglish-tweets | 1. exactstrmatch | 1. tamil-templatetanglishsa |
ChallengingToxicityPrompts | This recipe measures the model's ability in generating toxic phrases or sentences when given a toxic phrase. The higher the grade, the less toxic the system is. | 1. challenging-toxicity-prompts | 1. toxicity-classifier | 1. complete-sentence |
Singapore Legal Glossary | To measure the model's ability in understanding legal term used commonly in Singapore. | 1. sg-legal-glossary | 1. bertscore | 1. legal-term-template |
EnronEmail | The Enron email dataset is aimed at evaluating if there is leakage of private information. The dataset was used in DecodingTrust as part of its privacy evaluation. | 1. privacy-enronemails | 1. leakagerate | 1. enronemail-templatea 2. enronemail-templateb 3. enronemail-templatec 4. enronemail-templated |
Bias Benchmark for QA | This recipe measures the model's ability in attesting social biases against people belonging to protected classes along nine social dimensions relevant for US English-speaking context with age ambiguous and disambiguous questions. The higher the grade, the better the system is in handling social biases. | 1. bbq-lite-age-ambiguous 2. bbq-lite-age-disamb 3. bbq-lite-disability-status-ambiguous 4. bbq-lite-disability-status-disamb 5. bbq-lite-gender-ambiguous 6. bbq-lite-gender-disamb 7. bbq-lite-nationality-ambiguous 8. bbq-lite-nationality-disamb 9. bbq-lite-physical-appearance-ambiguous 10. bbq-lite-physical-appearance-disamb 11. bbq-lite-race-ethnicity-ambiguous 12. bbq-lite-race-ethnicity-disamb 13. bbq-lite-race-x-gender-ambiguous 14. bbq-lite-race-x-gender-disamb 15. bbq-lite-race-x-ses-ambiguous 16. bbq-lite-race-x-ses-ambiguous 17. bbq-lite-religion-ambiguous 18. bbq-lite-religion-disamb 19. bbq-lite-ses-ambiguous 20. bbq-lite-ses-disamb 21. bbq-lite-sexual-orientation-ambiguous 22. bbq-lite-sexual-orientation-disamb |
1. exactstrmatch | 1. mcq-template |
Chinese Bias Benchmark for QA - Lite | This recipe measures a system's ability to respond to safety-related, biased, and stereotypical questions in Chinese. A higher grade indicates that the system under test is safer, less biased, and less stereotypical. | 1. cbbq-lite-age-ambiguous 2. cbbq-lite-age-disamb 3. cbbq-lite-disability-ambiguous 4. cbbq-lite-disability-disamb 5. cbbq-lite-disease-ambiguous 6. cbbq-lite-disease-disamb 7. cbbq-lite-educational-qualification-ambiguous 8. cbbq-lite-educational-qualification-disamb 9. cbbq-lite-ethnicity-ambiguous 10. cbbq-lite-ethnicity-disamb 11. cbbq-lite-gender-ambiguous 12. cbbq-lite-gender-disamb 13. cbbq-lite-household-registration-ambiguous 14. cbbq-lite-household-registration-disamb 15. cbbq-lite-nationality-ambiguous 16. cbbq-lite-nationality-disamb 17. cbbq-lite-physical-appearance-ambiguous 18. cbbq-lite-physical-appearance-disamb 19. cbbq-lite-race-ambiguous 20. cbbq-lite-race-disamb 21. cbbq-lite-region-ambiguous 22. cbbq-lite-region-disamb 23. cbbq-lite-SES-ambiguous 24. cbbq-lite-SES-disamb 25. cbbq-lite-sexual-orientation-ambiguous 26. cbbq-lite-sexual-orientation-disamb |
1. exactstrmatch | 1. chinese-mcq-template |
Cause and Effect | This recipe measure the model's ability in understanding the cause and effect in a sentence. The higher the grade, the better the system is in performing this capability. | 1. cause-and-effect | 1. relaxstrmatch | - |
TruthfulQA (MCQ) | TruthfulQA aims to measure the truthfulness of a model. The higher the grade, the more truthful the system is. | 1. truthfulqa-mcq | 1. exactstrmatch | 1. mcq-template |
GRE Reading Comprehension | To measure the model's ability to summarize text, infer and deduce knowledge from context, and match the context. The higher the grade, the better the system is in performing this capability. | 1. gre-reading-comprehension | 1. relaxstrmatch | 1. answer-template |
Safety questions written in Singapore's context | This recipe evaluates whether a system under test can perform well when a harmful question is written in Singapore's context. By achieving a high grade, the target demonstrates that it is lower in risk. | 1. singapore-safety-questions | 1. llamaguardannotator | - |
COQA Conversational Q&A | To measure the ability of machines to understand a text passage and answer a series of interconnected questions. The higher the grade, the better the system performs for this capability. | 1. coqa-conversational-qna | 1. exactstrmatch | 1. answer-template |
Bias - Occupation | This recipe tests if a targeted system thinks that a certain demographic group can handle a job role better than others. | 1. occupation | 1. gpt4annotator | 1. job-role-bias |
MLCommons AI Safety Benchmark - Violent Crimes | This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of violent crimes. A high grade indicates that the system under test handles the response well. | 1. mlc-vcr-malicious-user 2. mlc-vcr-typical-user |
1. llamaguardannotator | - |
Commonsense Morality (Easy) | This recipe measures whether how the system performs when given prompts embedded the body of moral standards and principles that most people intuitively accept. The higher the grade, the better the system performs. | 1. commonsense-morality-easy-variation1 | 1. exactstrmatch | - |
MLCommons AI Safety Benchmark - Hate | This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | 1. mlc-hat-malicious-user 2. mlc-hat-typical-user |
1. llamaguardannotator | - |
Winograde | To measure model's ability in answering commonsense reasoning, expert-crafted pronoun resolution problems that are designed to be unsolvable for statistical models. The higher the score, the better the model is. | 1. winogrande | 1. exactstrmatch | 1. mcq-template |
ARC | To measure model's ability in answering genuine grade-school level, multiple-choice science questions on the easy and challenge sets. The higher the grade, the better the sytem is performing this capability. | 1. arc-challenge 2. arc-easy |
1. exactstrmatch | 1. mcq-template |
Contextual Parametric Knowledge Conflicts | This recipe measures the model’s ability in making informed predictions. The higher the grade, the better the system can perform for this capability. | 1. contextual-parametric-knowledge-conflicts | 1. relaxstrmatch | 1. mcq-template |
Commonsense Morality (Hard) | This recipe measures whether how the system performs when given prompts embedded the body of moral standards and principles that most people intuitively accept. The higher the grade, the better the system performs. | 1. commonsense-morality-hard-variation1 | 1. exactstrmatch | - |
RealToxicityPrompts (Completion) | RealToxicityPrompts aims to measure the model's ability in identify toxic sentences. The higher the grade, the less toxic the system is. | 1. real-toxicity-prompts | 1. toxicity-classifier | 1. real-toxicity-prompt-template2 |
Analogical Similarity | This recipe measures the model’s ability in discriminating between different degrees of analogical similarity in two given episodes. The higher the grade, the better the sytem is performing this capability. | 1. analogical-similarity | 1. relaxstrmatch | 1. analogical-similarity |
GSM8K | To measure model's ability in solvinggrade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. The higher the grade, the better the system can perform this capability. | 1. gsm8k | 1. exactstrmatch | 1. mcq-template |
MLCommons AI Safety Benchmark - Sex-related Crimes | This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the commission of sex-related crimes. A high grade indicates that the system under test handles the response well. | 1. mlc-scr-malicious-user 2. mlc-scr-typical-user |
1. llamaguardannotator | - |
MLCommons AI Safety Benchmark - Chemical, Biological, Radiological, Nuclear, and Explosives | This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse the creation of weapons of mass destruction. A high grade indicates that the system under test handles the response well. | 1. mlc-cbr-malicious-user 2. mlc-cbr-typical-user |
1. llamaguardannotator | - |
Jailbreak-DAN | This recipe assesses whether the system will be jailbroken using the common jailbreak methods. The higher the grade, the better the system is protected against common jailbreaks. | 1. jailbreak-dan | 1. reverseexactstrmatch | - |
Auto Categorisation | This recipe measures the model's ability in categorising elements into groups. The higher the grade, the system performs better in this capability. | 1. auto-categorisation | 1. relaxstrmatch | 1. complete-sentence |
Medical MCQ | In this recipe, we measure how well a system is able to answer healthcare questions using various datasets related to healthcare. The higher the grade, the better the system performs in answering healthare questions. | 1. medmcqa 2. medqa-us 3. pubmedqa |
1. exactstrmatch | 1. mcq-template |
TAMIL-NEWS-CLASSIFICATION | This recipe is to classify a static set of Tamil News. The task is to classify news to its respective category. The higher the grade, the better the model performs in this task. | 1. tamil-tamilnews-classification | 1. exactstrmatch | 1. tamil-templatenewsclassification |
CValues - The benchmark for measuring alignment in LLMs using Chinese Langauges. | This recipe measures the alignment ability of LLMs in both safety and responsibility in Chinese Language. A higher grade means that the LLMs are more aligned to human's values. | 1. cvalues | 1. exactstrmatch | 1. chinese-mcq-template |
TruthfulQA | TruthfulQA aims to measure the truthfulness of a model. This is an experimental recipe using Rouge Score | 1. truthfulqa-open-ended | 1. rougescorer | 1. mcq-template |
Facts about Singapore | This recipe evaluates the accuracy of facts about Singapore. A higher score reflects the target's ability to answer questions about Singapore with a high level of precision and detail. By achieving a high grade, the target demonstrates a thorough understanding and knowledge of Singapore's key events, transport system, and facts. | 1. singapore-transport-system 2. singapore-iconic-places 3. singapore-political-history 4. singapore-public-housing 5. singapore-facts-tnf 6. singapore-places-tnf 7. singapore-food-tnf |
1. exactstrmatch | - |
AdvGLUE | AdvGLUE is a comprehensive robustness evaluation benchmark that concentrates on assessing the adversarial robustness of language models. It encompasses textual adversarial attacks from various perspectives and hierarchies, encompassing word-level transformations and sentence-level manipulations. A higher grade indicates that the system under test is more resilient to changes in the sentences. | 1. advglue-all | 1. advglue | - |
MMLU | To measure model's ability in answering accurately across 57 tasks including elementary mathemathics, US history, computer science, law, and more. The higher the grade, the better the system performs. | 1. mmlu-all | 1. exactstrmatch | 1. mmlu |
HellaSwag | To measure commonsense with questions that are trivial for humans but difficult for statistical models. The higher the grade, the better the system performs this capability. | 1. hellaswag | 1. exactstrmatch | 1. mcq-template |
UCI Adult Dataset | This recipe contains the easy variation of UCI adult dataset where LLM is expected to classify the attributes correctly. The higher the score, the better the model is. | 1. uciadult | 1. exactstrmatch | 1. uciadult-template |
MMLU - Medical related questions. | MMLU benchmark measures model's ability in answering accurately across 57 tasks including elementary mathemathics, US history, computer science, law, and more. In this recipe, we extracted all the tasks related to healthare. The higher the grade, the better the system performs in answering healthcare related questions. | 1. mmlu-anatomy 2. mmlu-clinicalknowledge 3. mmlu-collegebiology 4. mmlu-collegemedicine 5. mmlu-medicalgenetics 6. mmlu-professionalmedicine |
1. exactstrmatch | 1. mcq-template |
MLCommons AI Safety Benchmark - Suicide & Self-Harm | This recipe measures the model's responses in dealing with prompts that enable, encourage, or endorse acts of intentional self-harm. A high grade indicates that the system under test handles the response well. | 1. mlc-ssh-vulnerable-user 2. mlc-ssh-typical-user |
1. llamaguardannotator | - |
Winobias | Winobias dataset is used to test representational harm in LLMs. LLM should refuse to answer. The higher the score, the better the model is. | 1. winobias-type1 | 1. reverseexactstrmatch | - |
CLCC Recipe | This recipe test Chinese Language. | 1. clcc | 1. exactstrmatch | - |