squad-shifts-tnf |
Zero-shot reading comprehension on paragraphs and questions from squadshifts |
- |
https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/squad_shifts |
BBQ-lite on religion - Ambiguous Questions |
This dataset is an excerpt from Bias Benchmark from QA on religion, containing only the ambiguous questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
advglue |
Adversarial GLUE Benchmark (AdvGLUE) is a comprehensive robustness evaluation benchmark that focuses on the adversarial robustness evaluation of language models. |
- |
https://github.com/AI-secure/adversarial-glue |
Food in Singapore |
Contain prompts that test model's udnerstanding in Food, in True/False format |
Apache-2.0 |
IMDA |
MedMCQA |
MedMCQ is a large-scale, multiple-choice question answering dataset designed to address realworld medical entrance exam questions. It contains more than 194k high-quality AIIMS & NEET PG entrnace exam MCQs covering 2.4k healthcare topics and 21 medical subjects. |
MIT License |
https://github.com/MedMCQA/MedMCQA |
tamil-thirukural |
This dataset is used to test the comprehension abilities for the Thirukkural. Thirukkural is a classic Tamil literature composed by the ancient Tamil poet Thiruvalluvar. It consists of 1330 couplets (kurals) that are grouped into 133 chapters, each containing 10 couplets. |
Creative Commons Attribution 4.0 International |
https://github.com/vijayanandrp/Thirukkural-Tamil-Dataset |
RealtimeQA |
RealTime QA is a dynamic question answering (QA) platform that inquires about the present. |
- |
https://github.com/realtimeqa/realtimeqa_public |
sg-university-tutorial-questions-legal |
Contain tutorial questions ans answers from Singapore's Universities to test model's ability in understanding legal context in Singapore |
- |
- |
Chinese Version - Bias Benchmark for QA |
This aims to measure social biases across 9 different categories in Chinese language. |
- |
https://github.com/YFHuangxxxx/CBBQ |
BBQ-lite on gender - Disambiguated Questions |
This dataset is an excerpt from Bias Benchmark from QA on gender, containing only the disambiguated questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
ARC-Easy |
A dataset of genuine grade-school level, multiple-choice science questions in advanced Q&A. This is the challenge set. |
CC BY-SA |
https://allenai.org/data/arc |
BBQ-lite on nationality - Ambiguous Questions |
This dataset is an excerpt from Bias Benchmark from QA on nationality, containing only the ambiguous questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
m3exam-vietnamese |
This dataset contains local exam questions for testing Vietnamese capability. |
CC BY-NC-SA |
https://github.com/DAMO-NLP-SG/M3Exam |
BBQ-lite on physical-appearance - Disambiguated Questions |
This dataset is an excerpt from Bias Benchmark from QA on physical-appearance, containing only the disambiguated questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
tanglish-tweets-SA |
Code-mixed Tamil and English tweets curated for the sentiment analysis task. |
CC0: Public Domain |
- |
Ethics-Commonsense-Hard |
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. |
MIT License |
https://github.com/hendrycks/ethics/ |
uciadult |
The UCI Adult dataset has been used widely used to assess fairness. |
Creative Commons Attribution 4.0 International |
https://archive.ics.uci.edu/dataset/2/adult |
BBQ-lite on age - Disambiguated Questions |
This dataset is an excerpt from Bias Benchmark from QA on age, containing only the disambiguated questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
MLC Baseline - Suicide & Self-Harm |
Responses that enable, encourage, or endorse acts of intentional self-harm. |
- |
- |
sg-legal-glossary |
A list of singapore legal terms extracted from SICC and Judiciary websites. |
- |
https://www.sicc.gov.sg/glossary-of-legal-terms |
MLC Baseline - Hate |
Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics. |
- |
- |
BBQ-lite on age - Ambiguous Questions |
This dataset is an excerpt from Bias Benchmark from QA on age, containing only the ambiguous questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
cause_and_effect |
This recipe measures if the model is able to distinguish and understand cause and effect. The higher the grade, the better the system can perform this capability. |
- |
https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/cause_and_effect |
BBQ-lite on race-x-gender - Ambiguous Questions |
This dataset is an excerpt from Bias Benchmark from QA on race-x-gender, containing only the ambiguous questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
TruthfulQA (MCQ Version) |
TruthfulQA aims to measure the the truthfulness of a model. This dataset uses a multiple choice format. |
Apache-2.0 |
https://github.com/sylinrl/TruthfulQA |
Ethics-Commonsense-Hard |
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. |
MIT License |
https://github.com/hendrycks/ethics/ |
PubMedQA |
PubMedQA is a large-scale, multiple-choice question answering dataset designed to address realworld medical entrance exam questions. It contains more than 194k high-quality AIIMS & NEET PG entrnace exam MCQs covering 2.4k healthcare topics and 21 medical subjects. |
MIT License |
https://github.com/MedMCQA/MedMCQA |
Facts about Singapore in True and False |
Contain prompts that contains facts about Singapore, in True/False format |
Apache-2.0 |
IMDA |
RealToxicityPrompts |
RealToxicityPrompts aims to measure the model's ability in identify toxic sentences. |
Apache-2.0 |
https://github.com/allenai/real-toxicity-prompts/ |
gre_reading_comprehension |
To measure the model/'s ability to summarize text, infer and deduce knowledge from context, and match the context. |
- |
https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/gre_reading_comprehension |
MMLU College Medicine |
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU College Medicine |
MIT License |
https://github.com/hendrycks/test |
BBQ-lite on race-x-ses - Disambiguated Questions |
This dataset is an excerpt from Bias Benchmark from QA on race-x-ses, containing only the disambiguated questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
Singapore Polical History |
Contain questions about Singapore's key historical events in political. |
Apache-2.0 |
IMDA |
ChallengingToxicityPrompts |
ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. |
CC BY-SA 4.0 |
https://github.com/AI-secure/DecodingTrust |
MLC Baseline - Violent Crimes |
Responses that enable, encourage, or endorse the commission of violent crimes. |
- |
- |
MMLU Clinical Knowledge |
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Clinical Knowledge |
MIT License |
https://github.com/hendrycks/test |
coqa_conversational_question_answering |
To measure the ability of machines to understand a text passage and answer a series of interconnected questions |
- |
- |
Singapore Transport System |
Contain questions about Singapore's public housing system |
Apache-2.0 |
IMDA |
ChallengingToxicityPrompts |
ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. |
CC BY-SA 4.0 |
https://github.com/AI-secure/DecodingTrust |
BBQ-lite on race-x-ses - Ambiguous Questions |
This dataset is an excerpt from Bias Benchmark from QA on race-x-ses, containing only the ambiguous questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives |
Responses that enable, encourage, or endorse the creation of weapons of mass destruction. |
- |
- |
BBQ-lite on gender - Ambiguous Questions |
This dataset is an excerpt from Bias Benchmark from QA on gender, containing only the ambiguous questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on sexual-orientation - Ambiguous Questions |
This dataset is an excerpt from Bias Benchmark from QA on sexual-orientation, containing only the ambiguous questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives |
Responses that enable, encourage, or endorse the creation of weapons of mass destruction. |
- |
- |
MLC Baseline - Hate |
Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics. |
- |
- |
Winogrande |
This dataset is used for commonsense reasoning, expert-crafted pronoun resolution problems designed to be unsolvable for statistical models. |
Apache-2.0 |
https://github.com/allenai/winogrande |
Contextual Parametric Knowledge Conflicts |
To measure the model/’s ability in making informed predictions |
- |
https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/contextual_parametric_knowledge_conflicts |
MMLU Anatomy |
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Anatomy |
MIT License |
https://github.com/hendrycks/test |
Singapore Transport System |
Contain questions about Singapore's transport system |
Apache-2.0 |
IMDA |
truthfulqa |
TruthfulQA aims to measure the the truthfulness of a model. |
Apache-2.0 |
https://github.com/sylinrl/TruthfulQA |
ARC-Easy |
A dataset of genuine grade-school level, multiple-choice science questions in advanced Q&A. This is the easy set. |
CC BY-SA |
https://allenai.org/data/arc |
Iconic Places in Singapore |
Contain questions about Singapore's iconic places. |
Apache-2.0 |
IMDA |
MLC Baseline - Sex-Related Crimes |
Responses that enable, encourage, or endorse the commission of sex-related crimes. |
- |
- |
Analogical Similarity |
To measure the model’s ability in discriminating between different degrees of analogical similarity in two given episodes |
- |
https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/analogical_similarity |
BBQ-lite on race-ethnicity - Disambiguated Questions |
This dataset is an excerpt from Bias Benchmark from QA on race-ethnicity, containing only the disambiguated questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
GSM8K |
This dataset consists of high quality grade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. |
MIT License |
https://github.com/openai/grade-school-math |
MLC Baseline - Child Abuse and Exploitation |
Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children. |
- |
- |
m3exam-indonesian |
This dataset contains local exam questions for testing Indonesian capability. |
CC BY-NC-SA |
https://github.com/DAMO-NLP-SG/M3Exam |
BBQ-lite on race-x-gender - Disambiguated Questions |
This dataset is an excerpt from Bias Benchmark from QA on race-x-gender, containing only the disambiguated questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
MLC Baseline - Sex-Related Crimes |
Responses that enable, encourage, or endorse the commission of sex-related crimes. |
- |
- |
BBQ-lite on disability-status - Ambiguous Questions |
This dataset is an excerpt from Bias Benchmark from QA on disability-status, containing only the ambiguous questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
MMLU College Biology |
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU College Biology |
MIT License |
https://github.com/hendrycks/test |
ChallengingToxicityPrompts |
ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. |
CC BY-SA 4.0 |
https://github.com/AI-secure/DecodingTrust |
BBQ-lite on religion - Disambiguated Questions |
This dataset is an excerpt from Bias Benchmark from QA on religion, containing only the disambiguated questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
MLC Baseline - Child Abuse and Exploitation |
Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children. |
- |
- |
MLC Baseline - Non-Violent Crimes |
Responses that enable, encourage, or endorse the commission of non-violent crimes. |
- |
- |
MLC Baseline - Non-Violent Crimes |
Responses that enable, encourage, or endorse the commission of non-violent crimes. |
- |
- |
List of Occupations |
This dataset contains a list of occupations collected from various websites (e.g., MoM of Singapore). The dataset can be used to test job-related applications. |
MIT License |
IMDA |
In-the-wild Jailbreak Dataset |
This dataset contains jailbreak prompts that were collected from various places (e.g., Discord). |
MIT license |
https://github.com/verazuo/jailbreak_llms |
MLC Baseline - Suicide & Self-Harm |
Responses that enable, encourage, or endorse acts of intentional self-harm. |
- |
- |
MLC Baseline - Violent Crimes |
Responses that enable, encourage, or endorse the commission of violent crimes. |
- |
- |
Ethics-Commonsense-Hard |
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. |
MIT License |
https://github.com/hendrycks/ethics/ |
auto_categorization |
To measure the model's ability in categorising elements into groups. |
- |
https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/auto_categorization |
BBQ-lite on sexual-orientation - Disambiguated Questions |
This dataset is an excerpt from Bias Benchmark from QA on sexual-orientation, containing only the disambiguated questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
tamil-news-classification |
This dataset is used to classify a static set of Tamil News. The task is to classify news to its respective category. The dataset has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". |
GNU General Public License v3.0 |
- |
C-Values - Values of Chinese Large Language attack_modules |
This dataset contains prompts that can measure the alignment ability of LLMs in both safety and responsibility in Chinese Language. |
Apache-2.0 license |
- |
BBQ-lite on physical-appearance - Ambiguous Questions |
This dataset is an excerpt from Bias Benchmark from QA on physical-appearance, containing only the ambiguous questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
enronemail |
The Enron email dataset is aimed at evaluating if there is leakage of private information. The dataset was used in DecodingTrust as part of its privacy evaluation. |
Apache-2.0 |
https://github.com/ParakweetLabs/EmailIntentDataSet |
m3exam-thai |
This dataset contains local exam questions for testing Thai capability. |
CC BY-NC-SA |
https://github.com/DAMO-NLP-SG/M3Exam |
MMLU Medical Genetics |
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Medical Genetics |
MIT License |
https://github.com/hendrycks/test |
BBQ-lite on ses - Ambiguous Questions |
This dataset is an excerpt from Bias Benchmark from QA on ses, containing only the ambiguous questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
truthfulqa |
TruthfulQA aims to measure the the truthfulness of a model. |
Apache-2.0 |
https://github.com/sylinrl/TruthfulQA |
Ethics-Commonsense-Hard |
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. |
MIT License |
https://github.com/hendrycks/ethics/ |
Places in Singapore |
Contain prompts that test model's udnerstanding places in Singapore, in True/False format |
Apache-2.0 |
IMDA |
BBQ-lite on nationality - Disambiguated Questions |
This dataset is an excerpt from Bias Benchmark from QA on nationality, containing only the disambiguated questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
MMLU Professional Medicine.json |
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Professional Medicine.json |
MIT License |
https://github.com/hendrycks/test |
BBQ-lite on race-ethnicity - Ambiguous Questions |
This dataset is an excerpt from Bias Benchmark from QA on race-ethnicity, containing only the ambiguous questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
winobias-variation1 |
This dataset contains gender-bias based on the professions from the Labor Force Statistics (https://www.bls.gov/cps/cpsaat11.htm), which contain some gender-bias. |
MIT License |
https://github.com/uclanlp/corefBias/tree/master/WinoBias/wino |
HellaSwag |
This dataset is used to evaluate commonsense with questions that are trivial for humans but difficult for state-of-the-art models. |
MIT License |
https://github.com/rowanz/hellaswag |
BBQ-lite on ses - Disambiguated Questions |
This dataset is an excerpt from Bias Benchmark from QA on ses, containing only the disambiguated questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
Safety Benchmark (Singapore Context) |
Contain prompts that test safety in Singapore-context |
Apache-2.0 |
IMDA |
MedQA (US) |
MedQA is a free-form multiple-choice OpenQA dataset for solving medical problems. These are collected from the professional medical board exams. We extracted the list of MCQ questions from the US in this dataset. |
MIT License |
https://github.com/jind11/MedQA |
uciadult |
The UCI adult dataset, created in 1996, is used to train models to predict whether a person's income will exceed $50K/yr based on census data. Also known as "Census Income" dataset. |
Creative Commons Attribution 4.0 International |
https://archive.ics.uci.edu/dataset/2/adult |
MMLU |
This dataset covers 57 tasks including elementary mathemathics, US history, computer science, law, and more. |
MIT license |
https://github.com/hendrycks/test |
BBQ-lite on disability-status - Disambiguated Questions |
This dataset is an excerpt from Bias Benchmark from QA on disability-status, containing only the disambiguated questions. |
CC-BY-4.0 license |
https://arxiv.org/pdf/2110.08193v2 |
Chinese Linguistics & Cognition Challenge (CLCC) |
This dataset is a subjective benchmark created by BAAI FlagEval group. |
CC-BY-4.0 license |
https://flageval.baai.ac.cn/ |