Skip to content

List of Datasets

Name Description License Reference
squad-shifts-tnf Zero-shot reading comprehension on paragraphs and questions from squadshifts - https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/squad_shifts
BBQ-lite on religion - Ambiguous Questions This dataset is an excerpt from Bias Benchmark from QA on religion, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
advglue Adversarial GLUE Benchmark (AdvGLUE) is a comprehensive robustness evaluation benchmark that focuses on the adversarial robustness evaluation of language models. - https://github.com/AI-secure/adversarial-glue
Food in Singapore Contain prompts that test model's udnerstanding in Food, in True/False format Apache-2.0 IMDA
MedMCQA MedMCQ is a large-scale, multiple-choice question answering dataset designed to address realworld medical entrance exam questions. It contains more than 194k high-quality AIIMS & NEET PG entrnace exam MCQs covering 2.4k healthcare topics and 21 medical subjects. MIT License https://github.com/MedMCQA/MedMCQA
tamil-thirukural This dataset is used to test the comprehension abilities for the Thirukkural. Thirukkural is a classic Tamil literature composed by the ancient Tamil poet Thiruvalluvar. It consists of 1330 couplets (kurals) that are grouped into 133 chapters, each containing 10 couplets. Creative Commons Attribution 4.0 International https://github.com/vijayanandrp/Thirukkural-Tamil-Dataset
RealtimeQA RealTime QA is a dynamic question answering (QA) platform that inquires about the present. - https://github.com/realtimeqa/realtimeqa_public
sg-university-tutorial-questions-legal Contain tutorial questions ans answers from Singapore's Universities to test model's ability in understanding legal context in Singapore - -
Chinese Version - Bias Benchmark for QA This aims to measure social biases across 9 different categories in Chinese language. - https://github.com/YFHuangxxxx/CBBQ
BBQ-lite on gender - Disambiguated Questions This dataset is an excerpt from Bias Benchmark from QA on gender, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
ARC-Easy A dataset of genuine grade-school level, multiple-choice science questions in advanced Q&A. This is the challenge set. CC BY-SA https://allenai.org/data/arc
BBQ-lite on nationality - Ambiguous Questions This dataset is an excerpt from Bias Benchmark from QA on nationality, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
m3exam-vietnamese This dataset contains local exam questions for testing Vietnamese capability. CC BY-NC-SA https://github.com/DAMO-NLP-SG/M3Exam
BBQ-lite on physical-appearance - Disambiguated Questions This dataset is an excerpt from Bias Benchmark from QA on physical-appearance, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
tanglish-tweets-SA Code-mixed Tamil and English tweets curated for the sentiment analysis task. CC0: Public Domain -
Ethics-Commonsense-Hard This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. MIT License https://github.com/hendrycks/ethics/
uciadult The UCI Adult dataset has been used widely used to assess fairness. Creative Commons Attribution 4.0 International https://archive.ics.uci.edu/dataset/2/adult
BBQ-lite on age - Disambiguated Questions This dataset is an excerpt from Bias Benchmark from QA on age, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
MLC Baseline - Suicide & Self-Harm Responses that enable, encourage, or endorse acts of intentional self-harm. - -
sg-legal-glossary A list of singapore legal terms extracted from SICC and Judiciary websites. - https://www.sicc.gov.sg/glossary-of-legal-terms
MLC Baseline - Hate Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics. - -
BBQ-lite on age - Ambiguous Questions This dataset is an excerpt from Bias Benchmark from QA on age, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
cause_and_effect This recipe measures if the model is able to distinguish and understand cause and effect. The higher the grade, the better the system can perform this capability. - https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/cause_and_effect
BBQ-lite on race-x-gender - Ambiguous Questions This dataset is an excerpt from Bias Benchmark from QA on race-x-gender, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
TruthfulQA (MCQ Version) TruthfulQA aims to measure the the truthfulness of a model. This dataset uses a multiple choice format. Apache-2.0 https://github.com/sylinrl/TruthfulQA
Ethics-Commonsense-Hard This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. MIT License https://github.com/hendrycks/ethics/
PubMedQA PubMedQA is a large-scale, multiple-choice question answering dataset designed to address realworld medical entrance exam questions. It contains more than 194k high-quality AIIMS & NEET PG entrnace exam MCQs covering 2.4k healthcare topics and 21 medical subjects. MIT License https://github.com/MedMCQA/MedMCQA
Facts about Singapore in True and False Contain prompts that contains facts about Singapore, in True/False format Apache-2.0 IMDA
RealToxicityPrompts RealToxicityPrompts aims to measure the model's ability in identify toxic sentences. Apache-2.0 https://github.com/allenai/real-toxicity-prompts/
gre_reading_comprehension To measure the model/'s ability to summarize text, infer and deduce knowledge from context, and match the context. - https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/gre_reading_comprehension
MMLU College Medicine MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU College Medicine MIT License https://github.com/hendrycks/test
BBQ-lite on race-x-ses - Disambiguated Questions This dataset is an excerpt from Bias Benchmark from QA on race-x-ses, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
Singapore Polical History Contain questions about Singapore's key historical events in political. Apache-2.0 IMDA
ChallengingToxicityPrompts ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. CC BY-SA 4.0 https://github.com/AI-secure/DecodingTrust
MLC Baseline - Violent Crimes Responses that enable, encourage, or endorse the commission of violent crimes. - -
MMLU Clinical Knowledge MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Clinical Knowledge MIT License https://github.com/hendrycks/test
coqa_conversational_question_answering To measure the ability of machines to understand a text passage and answer a series of interconnected questions - -
Singapore Transport System Contain questions about Singapore's public housing system Apache-2.0 IMDA
ChallengingToxicityPrompts ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. CC BY-SA 4.0 https://github.com/AI-secure/DecodingTrust
BBQ-lite on race-x-ses - Ambiguous Questions This dataset is an excerpt from Bias Benchmark from QA on race-x-ses, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives Responses that enable, encourage, or endorse the creation of weapons of mass destruction. - -
BBQ-lite on gender - Ambiguous Questions This dataset is an excerpt from Bias Benchmark from QA on gender, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on sexual-orientation - Ambiguous Questions This dataset is an excerpt from Bias Benchmark from QA on sexual-orientation, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives Responses that enable, encourage, or endorse the creation of weapons of mass destruction. - -
MLC Baseline - Hate Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics. - -
Winogrande This dataset is used for commonsense reasoning, expert-crafted pronoun resolution problems designed to be unsolvable for statistical models. Apache-2.0 https://github.com/allenai/winogrande
Contextual Parametric Knowledge Conflicts To measure the model/’s ability in making informed predictions - https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/contextual_parametric_knowledge_conflicts
MMLU Anatomy MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Anatomy MIT License https://github.com/hendrycks/test
Singapore Transport System Contain questions about Singapore's transport system Apache-2.0 IMDA
truthfulqa TruthfulQA aims to measure the the truthfulness of a model. Apache-2.0 https://github.com/sylinrl/TruthfulQA
ARC-Easy A dataset of genuine grade-school level, multiple-choice science questions in advanced Q&A. This is the easy set. CC BY-SA https://allenai.org/data/arc
Iconic Places in Singapore Contain questions about Singapore's iconic places. Apache-2.0 IMDA
MLC Baseline - Sex-Related Crimes Responses that enable, encourage, or endorse the commission of sex-related crimes. - -
Analogical Similarity To measure the model’s ability in discriminating between different degrees of analogical similarity in two given episodes - https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/analogical_similarity
BBQ-lite on race-ethnicity - Disambiguated Questions This dataset is an excerpt from Bias Benchmark from QA on race-ethnicity, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
GSM8K This dataset consists of high quality grade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. MIT License https://github.com/openai/grade-school-math
MLC Baseline - Child Abuse and Exploitation Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children. - -
m3exam-indonesian This dataset contains local exam questions for testing Indonesian capability. CC BY-NC-SA https://github.com/DAMO-NLP-SG/M3Exam
BBQ-lite on race-x-gender - Disambiguated Questions This dataset is an excerpt from Bias Benchmark from QA on race-x-gender, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
MLC Baseline - Sex-Related Crimes Responses that enable, encourage, or endorse the commission of sex-related crimes. - -
BBQ-lite on disability-status - Ambiguous Questions This dataset is an excerpt from Bias Benchmark from QA on disability-status, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
MMLU College Biology MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU College Biology MIT License https://github.com/hendrycks/test
ChallengingToxicityPrompts ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. CC BY-SA 4.0 https://github.com/AI-secure/DecodingTrust
BBQ-lite on religion - Disambiguated Questions This dataset is an excerpt from Bias Benchmark from QA on religion, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
MLC Baseline - Child Abuse and Exploitation Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children. - -
MLC Baseline - Non-Violent Crimes Responses that enable, encourage, or endorse the commission of non-violent crimes. - -
MLC Baseline - Non-Violent Crimes Responses that enable, encourage, or endorse the commission of non-violent crimes. - -
List of Occupations This dataset contains a list of occupations collected from various websites (e.g., MoM of Singapore). The dataset can be used to test job-related applications. MIT License IMDA
In-the-wild Jailbreak Dataset This dataset contains jailbreak prompts that were collected from various places (e.g., Discord). MIT license https://github.com/verazuo/jailbreak_llms
MLC Baseline - Suicide & Self-Harm Responses that enable, encourage, or endorse acts of intentional self-harm. - -
MLC Baseline - Violent Crimes Responses that enable, encourage, or endorse the commission of violent crimes. - -
Ethics-Commonsense-Hard This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. MIT License https://github.com/hendrycks/ethics/
auto_categorization To measure the model's ability in categorising elements into groups. - https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/auto_categorization
BBQ-lite on sexual-orientation - Disambiguated Questions This dataset is an excerpt from Bias Benchmark from QA on sexual-orientation, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
tamil-news-classification This dataset is used to classify a static set of Tamil News. The task is to classify news to its respective category. The dataset has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". GNU General Public License v3.0 -
C-Values - Values of Chinese Large Language attack_modules This dataset contains prompts that can measure the alignment ability of LLMs in both safety and responsibility in Chinese Language. Apache-2.0 license -
BBQ-lite on physical-appearance - Ambiguous Questions This dataset is an excerpt from Bias Benchmark from QA on physical-appearance, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
enronemail The Enron email dataset is aimed at evaluating if there is leakage of private information. The dataset was used in DecodingTrust as part of its privacy evaluation. Apache-2.0 https://github.com/ParakweetLabs/EmailIntentDataSet
m3exam-thai This dataset contains local exam questions for testing Thai capability. CC BY-NC-SA https://github.com/DAMO-NLP-SG/M3Exam
MMLU Medical Genetics MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Medical Genetics MIT License https://github.com/hendrycks/test
BBQ-lite on ses - Ambiguous Questions This dataset is an excerpt from Bias Benchmark from QA on ses, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
truthfulqa TruthfulQA aims to measure the the truthfulness of a model. Apache-2.0 https://github.com/sylinrl/TruthfulQA
Ethics-Commonsense-Hard This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. MIT License https://github.com/hendrycks/ethics/
Places in Singapore Contain prompts that test model's udnerstanding places in Singapore, in True/False format Apache-2.0 IMDA
BBQ-lite on nationality - Disambiguated Questions This dataset is an excerpt from Bias Benchmark from QA on nationality, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
MMLU Professional Medicine.json MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Professional Medicine.json MIT License https://github.com/hendrycks/test
BBQ-lite on race-ethnicity - Ambiguous Questions This dataset is an excerpt from Bias Benchmark from QA on race-ethnicity, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
winobias-variation1 This dataset contains gender-bias based on the professions from the Labor Force Statistics (https://www.bls.gov/cps/cpsaat11.htm), which contain some gender-bias. MIT License https://github.com/uclanlp/corefBias/tree/master/WinoBias/wino
HellaSwag This dataset is used to evaluate commonsense with questions that are trivial for humans but difficult for state-of-the-art models. MIT License https://github.com/rowanz/hellaswag
BBQ-lite on ses - Disambiguated Questions This dataset is an excerpt from Bias Benchmark from QA on ses, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
Safety Benchmark (Singapore Context) Contain prompts that test safety in Singapore-context Apache-2.0 IMDA
MedQA (US) MedQA is a free-form multiple-choice OpenQA dataset for solving medical problems. These are collected from the professional medical board exams. We extracted the list of MCQ questions from the US in this dataset. MIT License https://github.com/jind11/MedQA
uciadult The UCI adult dataset, created in 1996, is used to train models to predict whether a person's income will exceed $50K/yr based on census data. Also known as "Census Income" dataset. Creative Commons Attribution 4.0 International https://archive.ics.uci.edu/dataset/2/adult
MMLU This dataset covers 57 tasks including elementary mathemathics, US history, computer science, law, and more. MIT license https://github.com/hendrycks/test
BBQ-lite on disability-status - Disambiguated Questions This dataset is an excerpt from Bias Benchmark from QA on disability-status, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
Chinese Linguistics & Cognition Challenge (CLCC) This dataset is a subjective benchmark created by BAAI FlagEval group. CC-BY-4.0 license https://flageval.baai.ac.cn/