Skip to content

List of Datasets

Name Description License Reference
advglue
advglue-all.json
Adversarial GLUE Benchmark (AdvGLUE) is a comprehensive robustness evaluation benchmark that focuses on the adversarial robustness evaluation of language models. CC-BY-4.0 license https://github.com/AI-secure/adversarial-glue
Analogical Similarity
analogical-similarity.json
To measure the model’s ability in discriminating between different degrees of analogical similarity in two given episodes Apache 2.0 https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/analogical_similarity
Answercarefully Information Cantonese
answercarefully-ca.json
A dataset of security-related questions and answers. Dataset from NII-LLMC working group - subset created for AISI testing. Dataset from NII-LLMC working group - subset created for AISI testing.
Answercarefully Information Chinese
answercarefully-cn.json
A dataset of security-related questions and answers. Dataset from NII-LLMC working group - subset created for AISI testing Dataset from NII-LLMC working group - subset created for AISI testing
Answercarefully Information English
answercarefully-en.json
A dataset of security-related questions and answers. Dataset from NII-LLMC working group - subset created for AISI testing Dataset from NII-LLMC working group - subset created for AISI testing
Answercarefully Information Farsi
answercarefully-fa.json
A dataset of security-related questions and answers. Dataset from NII-LLMC working group - subset created for AISI testing. Dataset from NII-LLMC working group - subset created for AISI testing.
Answercarefully Information French
answercarefully-fr.json
A dataset of security-related questions and answers. Dataset from NII-LLMC working group - subset created for AISI testing. Dataset from NII-LLMC working group - subset created for AISI testing.
Answercarefully Information Japanese
answercarefully-jp.json
A dataset of security-related questions and answers. Dataset from NII-LLMC working group - subset created for AISI testing Dataset from NII-LLMC working group - subset created for AISI testing
Answercarefully Information Korean
answercarefully-kr.json
A dataset of security-related questions and answers. Dataset from NII-LLMC working group - subset created for AISI testing Dataset from NII-LLMC working group - subset created for AISI testing
Answercarefully Information Kwashili
answercarefully-kw.json
A dataset of security-related questions and answers. Dataset from NII-LLMC working group - subset created for AISI testing. Dataset from NII-LLMC working group - subset created for AISI testing.
Answercarefully Information Malay
answercarefully-my.json
A dataset of security-related questions and answers. Dataset from NII-LLMC working group - subset created for AISI testing. Dataset from NII-LLMC working group - subset created for AISI testing.
Answercarefully Information Telugu
answercarefully-tl.json
A dataset of security-related questions and answers. Dataset from NII-LLMC working group - subset created for AISI testing. Dataset from NII-LLMC working group - subset created for AISI testing.
ARC-Easy
arc-challenge.json
A dataset of genuine grade-school level, multiple-choice science questions in advanced Q&A. This is the challenge set. CC BY-SA https://allenai.org/data/arc
ARC-Easy
arc-easy.json
A dataset of genuine grade-school level, multiple-choice science questions in advanced Q&A. This is the easy set. CC BY-SA https://allenai.org/data/arc
auto_categorization
auto-categorisation.json
To measure the model's ability in categorising elements into groups. Apache 2.0 https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/auto_categorization
BBQ-lite on age - Ambiguous Questions
bbq-lite-age-ambiguous.json
This dataset is an excerpt from Bias Benchmark from QA on age, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on age - Disambiguated Questions
bbq-lite-age-disamb.json
This dataset is an excerpt from Bias Benchmark from QA on age, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on disability-status - Ambiguous Questions
bbq-lite-disability-status-ambiguous.json
This dataset is an excerpt from Bias Benchmark from QA on disability-status, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on disability-status - Disambiguated Questions
bbq-lite-disability-status-disamb.json
This dataset is an excerpt from Bias Benchmark from QA on disability-status, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on gender - Ambiguous Questions
bbq-lite-gender-ambiguous.json
This dataset is an excerpt from Bias Benchmark from QA on gender, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on gender - Disambiguated Questions
bbq-lite-gender-disamb.json
This dataset is an excerpt from Bias Benchmark from QA on gender, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on nationality - Ambiguous Questions
bbq-lite-nationality-ambiguous.json
This dataset is an excerpt from Bias Benchmark from QA on nationality, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on nationality - Disambiguated Questions
bbq-lite-nationality-disamb.json
This dataset is an excerpt from Bias Benchmark from QA on nationality, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on physical-appearance - Ambiguous Questions
bbq-lite-physical-appearance-ambiguous.json
This dataset is an excerpt from Bias Benchmark from QA on physical-appearance, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on physical-appearance - Disambiguated Questions
bbq-lite-physical-appearance-disamb.json
This dataset is an excerpt from Bias Benchmark from QA on physical-appearance, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on race-ethnicity - Ambiguous Questions
bbq-lite-race-ethnicity-ambiguous.json
This dataset is an excerpt from Bias Benchmark from QA on race-ethnicity, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on race-ethnicity - Disambiguated Questions
bbq-lite-race-ethnicity-disamb.json
This dataset is an excerpt from Bias Benchmark from QA on race-ethnicity, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on race-x-gender - Ambiguous Questions
bbq-lite-race-x-gender-ambiguous.json
This dataset is an excerpt from Bias Benchmark from QA on race-x-gender, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on race-x-gender - Disambiguated Questions
bbq-lite-race-x-gender-disamb.json
This dataset is an excerpt from Bias Benchmark from QA on race-x-gender, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on race-x-ses - Ambiguous Questions
bbq-lite-race-x-ses-ambiguous.json
This dataset is an excerpt from Bias Benchmark from QA on race-x-ses, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on race-x-ses - Disambiguated Questions
bbq-lite-race-x-ses-disamb.json
This dataset is an excerpt from Bias Benchmark from QA on race-x-ses, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on religion - Ambiguous Questions
bbq-lite-religion-ambiguous.json
This dataset is an excerpt from Bias Benchmark from QA on religion, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on religion - Disambiguated Questions
bbq-lite-religion-disamb.json
This dataset is an excerpt from Bias Benchmark from QA on religion, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on ses - Ambiguous Questions
bbq-lite-ses-ambiguous.json
This dataset is an excerpt from Bias Benchmark from QA on ses, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on ses - Disambiguated Questions
bbq-lite-ses-disamb.json
This dataset is an excerpt from Bias Benchmark from QA on ses, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on sexual-orientation - Ambiguous Questions
bbq-lite-sexual-orientation-ambiguous.json
This dataset is an excerpt from Bias Benchmark from QA on sexual-orientation, containing only the ambiguous questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BBQ-lite on sexual-orientation - Disambiguated Questions
bbq-lite-sexual-orientation-disamb.json
This dataset is an excerpt from Bias Benchmark from QA on sexual-orientation, containing only the disambiguated questions. CC-BY-4.0 license https://arxiv.org/pdf/2110.08193v2
BIPIA - abstract QA - English
bipia-abstract-test.json
Abstrct QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. Based on XSum dataset (BBC articles). Fake summaries or call-to-action embedded in articles. MIT License https://github.com/microsoft/BIPIA/tree/main
BIPIA - abstract QA - English
bipia-abstract-train.json
Abstrct QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. Based on XSum dataset (BBC articles). Fake summaries or call-to-action embedded in articles. MIT License https://github.com/microsoft/BIPIA/tree/main
BIPIA - email QA - English
bipia-email-test.json
Email QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. Based on OpenAI Evals (real-world emails with questions and answers). Malicious instructions may be hidden in the email body. MIT License https://github.com/microsoft/BIPIA/tree/main
BIPIA - email QA - English
bipia-email-train.json
Email QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. Based on OpenAI Evals (real-world emails with questions and answers). Malicious instructions may be hidden in the email body. MIT License https://github.com/microsoft/BIPIA/tree/main
BIPIA - News QA - English
bipia-news.json
News QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. News data is based on NewsQA dataset. Malicious content injected into search snippets or web pages. MIT License https://www.kaggle.com/datasets/nagendra048/newsqa-dataset, https://github.com/microsoft/BIPIA/tree/main
BIPIA - table QA - English
bipia-table-test.json
Table QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. Based on WikiTableQuestions dataset. Attacks embedded as cell notes or footnotes. MIT license https://github.com/microsoft/BIPIA/tree/main
BIPIA - table QA - English
bipia-table-train.json
Table QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. Based on WikiTableQuestions dataset.Attacks embedded as cell notes or footnotes. MIT License https://github.com/microsoft/BIPIA/tree/main
cause_and_effect
cause-and-effect.json
This recipe measures if the model is able to distinguish and understand cause and effect. The higher the grade, the better the system can perform this capability. Apache 2.0 https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/cause_and_effect
Chinese Version - Bias Benchmark for QA
cbbq-lite-SES-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-SES-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-age-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-age-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-disability-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-disability-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-disease-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-disease-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-educational-qualification-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-educational-qualification-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-ethnicity-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-ethnicity-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-gender-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-gender-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-household-registration-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-household-registration-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-nationality-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-nationality-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-physical-appearance-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-physical-appearance-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-race-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-race-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-region-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-region-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-religion-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-religion-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-sexual-orientation-ambiguous.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA
cbbq-lite-sexual-orientation-disamb.json
This aims to measure social biases across 9 different categories in Chinese language. CC-BY-SA-4.0 license https://github.com/YFHuangxxxx/CBBQ
ChallengingToxicityPrompts
challenging-toxicity-prompts-variation1.json
ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. CC BY-SA 4.0 https://github.com/AI-secure/DecodingTrust
ChallengingToxicityPrompts
challenging-toxicity-prompts-variation2.json
ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. CC BY-SA 4.0 https://github.com/AI-secure/DecodingTrust
ChallengingToxicityPrompts
challenging-toxicity-prompts.json
ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. CC BY-SA 4.0 https://github.com/AI-secure/DecodingTrust
Chinese Linguistics & Cognition Challenge (CLCC)
clcc.json
This dataset is a subjective benchmark created by BAAI FlagEval group. CC-BY-4.0 license https://flageval.baai.ac.cn/
CNN/DailyMail
cnn-dailymail.json
The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. This dataset contains the test split. Apache 2.0 https://huggingface.co/datasets/abisee/cnn_dailymail
Ethics-Commonsense-Hard
commonsense-morality-easy-variation1.json
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. MIT License https://github.com/hendrycks/ethics/
Ethics-Commonsense-Hard
commonsense-morality-easy-variation2.json
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. MIT License https://github.com/hendrycks/ethics/
Ethics-Commonsense-Hard
commonsense-morality-hard-variation1.json
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. MIT License https://github.com/hendrycks/ethics/
Ethics-Commonsense-Hard
commonsense-morality-hard-variation2.json
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. MIT License https://github.com/hendrycks/ethics/
Contextual Parametric Knowledge Conflicts
contextual-parametric-knowledge-conflicts.json
To measure the model/’s ability in making informed predictions Apache 2.0 https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/contextual_parametric_knowledge_conflicts
coqa_conversational_question_answering
coqa-conversational-qna.json
To measure the ability of machines to understand a text passage and answer a series of interconnected questions Apache 2.0 https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/coqa_conversational_question_answering
C-Values - Values of Chinese Large Language attack_modules
cvalues.json
This dataset contains prompts that can measure the alignment ability of LLMs in both safety and responsibility in Chinese Language. Apache-2.0 license https://github.com/X-PLUG/CValues
CyberSecEval Prompt Injection 2 Chinese
cyberseceval-promptinjection2-cn.json
A dataset of security-related questions and answers. MIT license https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval Prompt Injection 2 English
cyberseceval-promptinjection2-en.json
A dataset of security-related questions and answers. MIT license https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval Prompt Injection 2 French
cyberseceval-promptinjection2-fr.json
A dataset of security-related questions and answers. MIT license https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval Prompt Injection 2 Japanese
cyberseceval-promptinjection2-jp.json
A dataset of security-related questions and answers. MIT license https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval Prompt Injection 2 Korean
cyberseceval-promptinjection2-kr.json
A dataset of security-related questions and answers. MIT license https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval Prompt Injection 2 Kwashili
cyberseceval-promptinjection2-kw.json
A dataset of security-related questions and answers. MIT license https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval Prompt Injection 2 Malay
cyberseceval-promptinjection2-my.json
A dataset of security-related questions and answers. MIT license https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval 3 Interpreter - English
cyberseceval3-interpreter.json
These tests evaluate the security risks posed by integrating LLMs with code interpreters, specifically assessing how effectively an LLM can prevent malicious attempts to exploit the system or execute harmful code. MIT License https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks
CyberSecEval Prompt Injection
cyberseceval_promptinjection.json
Check if llm is susceptible to prompt injection MIT license https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
DecodingTrust - Privacy Leakage during Conversations
decodingtrust-private-conversations.json
To test if applications can leak privacy-sensitive information which is provided during interactive conversations in the inference stage. CC BY-SA 4.0 license https://github.com/AI-secure/DecodingTrust/tree/main
Facts about Asia pacific in True and False in 4 languages (Chinese, Malay, Tamil and English)
facticity-apac-multilungual-facts.json
Contain prompts that contain facts about Asia Pacific and Singapore, in True/False format Apache-2.0 facticity.ai
FACTS Grounding
facts-grounding.json
FACTS Grounding is a benchmark from Google DeepMind and Google Research designed to measure the performance of AI Models on factuality and grounding CC-by-4.0 license https://huggingface.co/datasets/google/FACTS-grounding-public
uciadult
fairness-uciadult.json
The UCI Adult dataset has been used widely used to assess fairness. Creative Commons Attribution 4.0 International https://archive.ics.uci.edu/dataset/2/adult
Gender Occupational Bias
gender-occupation-text2-image-prompts.json
The gender occupational bias is a set of gender neutral text-to-image prompts that are likely to result in models favouring the generation of one gender over the other. The occupations included were proposed in the paper: https://arxiv.org/abs/2211.03759 MIT License https://arxiv.org/abs/2211.03759
Gender Occupational Bias
gender-text2-image-prompts.json
The gender occupational bias is a set of gender neutral text-to-image prompts that are likely to result in models favouring the generation of one gender over the other. The occupations included were proposed in the paper: https://arxiv.org/abs/2211.03759 MIT License https://arxiv.org/abs/2211.03759
gre_reading_comprehension
gre-reading-comprehension.json
To measure the model/'s ability to summarize text, infer and deduce knowledge from context, and match the context. Apache 2.0 https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/gre_reading_comprehension
GSM8K-TEST
gsm8k-test.json
This dataset consists of high quality grade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. The dataset includes only the test set of GSM8K. MIT License https://github.com/openai/grade-school-math
GSM8K
gsm8k.json
This dataset consists of high quality grade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. MIT License https://github.com/openai/grade-school-math
HellaSwag
hellaswag.json
This dataset is used to evaluate commonsense with questions that are trivial for humans but difficult for state-of-the-art models. MIT License https://github.com/rowanz/hellaswag
Inappropriate Image Prompts (I2P)
i2p-text2image-prompts.json
The I2P benchmark contains real user prompts for generative text2image prompts that are unproportionately likely to produce inappropriate images. I2P was introduced in the 2023 CVPR paper Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models. This benchmark is not specific to any approach or model, but was designed to evaluate mitigating measures against inappropriate degeneration in Stable Diffusion. MIT License https://huggingface.co/datasets/AIML-TUDA/i2p
In-the-wild Jailbreak Dataset
jailbreak-dan.json
This dataset contains jailbreak prompts that were collected from various places (e.g., Discord). MIT license https://github.com/verazuo/jailbreak_llms
m3exam-indonesian
m3exam-indonesian-test.json
This dataset contains local exam questions for testing Indonesian capability. CC BY-NC-SA https://github.com/DAMO-NLP-SG/M3Exam
m3exam-thai
m3exam-thai-test.json
This dataset contains local exam questions for testing Thai capability. CC BY-NC-SA https://github.com/DAMO-NLP-SG/M3Exam
m3exam-vietnamese
m3exam-vietnamese-test.json
This dataset contains local exam questions for testing Vietnamese capability. CC BY-NC-SA https://github.com/DAMO-NLP-SG/M3Exam
MedMCQA
medmcqa.json
MedMCQ is a large-scale, multiple-choice question answering dataset designed to address realworld medical entrance exam questions. It contains more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects. MIT License https://github.com/MedMCQA/MedMCQA
MedQA (US)
medqa-us.json
MedQA is a free-form multiple-choice OpenQA dataset for solving medical problems. These are collected from the professional medical board exams. We extracted the list of MCQ questions from the US in this dataset. MIT License https://github.com/jind11/MedQA
MLCAiluminate - Child Sexual Exploitation
mlc-ailuminate-cse.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Defamation
mlc-ailuminate-dfm.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Hate
mlc-ailuminate-hte.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Intellectual Property Violations
mlc-ailuminate-ipv.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Indiscriminate Weapons (CBRNE)
mlc-ailuminate-iwp.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Non-violent Crimes
mlc-ailuminate-ncr.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Privacy Violations
mlc-ailuminate-prv.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Specialized Advice (Health)
mlc-ailuminate-spc-ele.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Specialized Advice (Financial)
mlc-ailuminate-spc-fin.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Specialized Advice (Elections)
mlc-ailuminate-spc-hlt.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Specialized Advice (Legal)
mlc-ailuminate-spc-lgl.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Sex-related Crimes
mlc-ailuminate-src.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Suicide and Self-Harm
mlc-ailuminate-ssh.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Sexual Content (Pornographic)
mlc-ailuminate-sxc-prn.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLCAiluminate - Violent Crimes
mlc-ailuminate-vcr.json
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLC Baseline - Child Abuse and Exploitation
mlc-cae-malicious-user.json
Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLC Baseline - Child Abuse and Exploitation
mlc-cae-typical-user.json
Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives
mlc-cbr-malicious-user.json
Responses that enable, encourage, or endorse the creation of weapons of mass destruction. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives
mlc-cbr-typical-user.json
Responses that enable, encourage, or endorse the creation of weapons of mass destruction. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLC Baseline - Hate
mlc-hat-malicious-user.json
Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLC Baseline - Hate
mlc-hat-typical-user.json
Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
ML Commons Ailuminte - Intellectual Property Violation - Cantonese
mlc-ipv-ca.json
A dataset of security-related questions and answers. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Intellectual Property Violation - Chinese
mlc-ipv-cn.json
A dataset of security-related questions and answers. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Intellectual Property Violation - English
mlc-ipv-en.json
Test whether responses violate the intellectual property rights of any party. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Intellectual Property Violation - Farsi
mlc-ipv-fa.json
A dataset of security-related questions and answers. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Intellectual Property Violation - French
mlc-ipv-fr.json
A dataset of security-related questions and answers. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Intellectual Property Violation - Japanese
mlc-ipv-jp.json
A dataset of security-related questions and answers. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Intellectual Property Violation - Korean
mlc-ipv-kr.json
A dataset of security-related questions and answers. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Intellectual Property Violation - Kwashili
mlc-ipv-kw.json
A dataset of security-related questions and answers. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Intellectual Property Violation - Malay
mlc-ipv-my.json
A dataset of security-related questions and answers. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Intellectual Property Violation - Telugu
mlc-ipv-tl.json
A dataset of security-related questions and answers. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Non-Violent Crimes - Chinese
mlc-ncr-cn.json
A dataset of security-related questions and answers. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Non-Violent Crimes - English
mlc-ncr-en.json
Test whether responses enable, encourage, or endorse the commission of non-violent crimes. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Non-Violent Crimes - Farsi
mlc-ncr-fa.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Non-Violent Crimes - French
mlc-ncr-fr.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Non-Violent Crimes - Japanese
mlc-ncr-jp.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Non-Violent Crimes - Korean
mlc-ncr-kr.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Non-Violent Crimes - Kwashili
mlc-ncr-kw.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Non-Violent Crimes - Malay
mlc-ncr-my.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Non-Violent Crimes - Telugu
mlc-ncr-tl.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
MLC Baseline - Non-Violent Crimes
mlc-nvc-malicious-user.json
Responses that enable, encourage, or endorse the commission of non-violent crimes. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLC Baseline - Non-Violent Crimes
mlc-nvc-typical-user.json
Responses that enable, encourage, or endorse the commission of non-violent crimes. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
ML Commons Ailuminte - Privacy - Chinese
mlc-prv-cn.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Privacy - English
mlc-prv-en.json
Test whether responses contain sensitive, nonpublic personal information that could undermine someone’s physical, digital, or financial security. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Privacy - Farsi
mlc-prv-fr.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Privacy - Japanese
mlc-prv-jp.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Privacy - Korean
mlc-prv-kr.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Privacy - Kwashili
mlc-prv-kw.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Privacy - Malay
mlc-prv-my.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Privacy - Telugu
mlc-prv-tl.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
MLC Baseline - Sex-Related Crimes
mlc-scr-malicious-user.json
Responses that enable, encourage, or endorse the commission of sex-related crimes. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLC Baseline - Sex-Related Crimes
mlc-scr-typical-user.json
Responses that enable, encourage, or endorse the commission of sex-related crimes. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLC Baseline - Suicide & Self-Harm
mlc-ssh-typical-user.json
Responses that enable, encourage, or endorse acts of intentional self-harm. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MLC Baseline - Suicide & Self-Harm
mlc-ssh-vulnerable-user.json
Responses that enable, encourage, or endorse acts of intentional self-harm. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
ML Commons Ailuminte - Violent Crimes - Chinese
mlc-vcr-cn.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Violent Crimes - English
mlc-vcr-en.json
Test whether responses enable, encourage, or endorse the commission of violent crimes. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Violent Crimes - French
mlc-vcr-fr.json
A dataset of security-related questions and answers. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Violent Crimes - Japanese
mlc-vcr-jp.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Violent Crimes - Korean
mlc-vcr-kr.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Violent Crimes - Kwashili
mlc-vcr-kw.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
MLC Baseline - Violent Crimes
mlc-vcr-malicious-user.json
Responses that enable, encourage, or endorse the commission of violent crimes. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
ML Commons Ailuminte - Violent Crimes - Malay
mlc-vcr-my.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
ML Commons Ailuminte - Violent Crimes - Telugu
mlc-vcr-tl.json
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. CC-BY-4.0 license https://mlcommons.org/ailuminate/
MLC Baseline - Violent Crimes
mlc-vcr-typical-user.json
Responses that enable, encourage, or endorse the commission of violent crimes. CC-BY-4.0 license https://github.com/mlcommons/ailuminate
MMLU
mmlu-all.json
This dataset covers 57 tasks including elementary mathemathics, US history, computer science, law, and more. MIT license https://github.com/hendrycks/test
MMLU Anatomy
mmlu-anatomy.json
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Anatomy MIT License https://github.com/hendrycks/test
MMLU Clinical Knowledge
mmlu-clinicalknowledge.json
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Clinical Knowledge MIT License https://github.com/hendrycks/test
MMLU College Biology
mmlu-collegebiology.json
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU College Biology MIT License https://github.com/hendrycks/test
MMLU College Medicine
mmlu-collegemedicine.json
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU College Medicine MIT License https://github.com/hendrycks/test
MMLU Medical Genetics
mmlu-medicalgenetics.json
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Medical Genetics MIT License https://github.com/hendrycks/test
MMLU Professional Medicine.json
mmlu-professionalmedicine.json
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Professional Medicine.json MIT License https://github.com/hendrycks/test
List of Occupations
occupation.json
This dataset contains a list of occupations collected from various websites (e.g., MoM of Singapore). The dataset can be used to test job-related applications. MIT License IMDA
enronemail
privacy-enronemails.json
The Enron email dataset is aimed at evaluating if there is leakage of private information. The dataset was used in DecodingTrust as part of its privacy evaluation. Apache-2.0 https://github.com/ParakweetLabs/EmailIntentDataSet
Jailbreak Prompts
prompt_injection_jailbreak.json
This dataset evaluates whether a model/app will detect attempted jailbreaks and generate unsafe content. These jailbreak techniques are collected from X (e.g., Pliny) and public repository (e.g., in-the-wild jailbreak prompts). Apache 2.0 OWASP
PubMedQA
pubmedqa.json
PubMedQA is a large-scale, multiple-choice question answering dataset designed to address realworld medical entrance exam questions. It contains more than 194k high-quality AIIMS & NEET PG entrnace exam MCQs covering 2.4k healthcare topics and 21 medical subjects. MIT License https://github.com/MedMCQA/MedMCQA
mock-dataset
rag-sample-dataset.json
This is a sample of a dataset to be used for RAG Evaluation.
RealToxicityPrompts
real-toxicity-prompts.json
RealToxicityPrompts aims to measure the model's ability in identify toxic sentences. Apache-2.0 https://github.com/allenai/real-toxicity-prompts/
RealtimeQA
realtimeqa-past.json
RealTime QA is a dynamic question answering (QA) platform that inquires about the present. https://github.com/realtimeqa/realtimeqa_public
SafetyBench-PII (Test)
safetybench-privacy-full.json
SafetyBench is a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. MIT License https://huggingface.co/datasets/thu-coai/SafetyBench
SafetyBench-PII (dev)
safetybench-privacy-small.json
SafetyBench is a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. MIT license https://huggingface.co/datasets/thu-coai/SafetyBench
sg-legal-glossary
sg-legal-glossary.json
A list of singapore legal terms extracted from SICC and Judiciary websites. https://www.sicc.gov.sg/glossary-of-legal-terms
sg-university-tutorial-questions-legal
sg-university-tutorial-questions-legal.json
Contain tutorial questions ans answers from Singapore's Universities to test model's ability in understanding legal context in Singapore
SGHateCheck - Chinese
sghatecheck-ms.json
SGHateCheck is a hate speech benchmark tailored for Singapore's socio-linguistical context. MIT license https://github.com/Social-AI-Studio/SGHateCheck/tree/main/testcases
SGHateCheck - Singlish
sghatecheck-ss.json
SGHateCheck is a hate speech benchmark tailored for Singapore's socio-linguistical context. MIT license https://github.com/Social-AI-Studio/SGHateCheck/tree/main/testcases
SGHateCheck - Chinese
sghatecheck-ta.json
SGHateCheck is a hate speech benchmark tailored for Singapore's socio-linguistical context. MIT license https://github.com/Social-AI-Studio/SGHateCheck/tree/main/testcases
SGHateCheck - Chinese
sghatecheck-zh.json
SGHateCheck is a hate speech benchmark tailored for Singapore's socio-linguistical context. MIT license https://github.com/Social-AI-Studio/SGHateCheck/tree/main/testcases
Facts about Singapore in True and False
singapore-facts-tnf.json
Contain prompts that contains facts about Singapore, in True/False format Apache-2.0 IMDA
Food in Singapore
singapore-food-tnf.json
Contain prompts that test model's udnerstanding in Food, in True/False format Apache-2.0 IMDA
Iconic Places in Singapore
singapore-iconic-places.json
Contain questions about Singapore's iconic places. Apache-2.0 IMDA
Places in Singapore
singapore-places-tnf.json
Contain prompts that test model's udnerstanding places in Singapore, in True/False format Apache-2.0 IMDA
Singapore POMFA Statements
singapore-pofma-statements-2023.json
Statements that are false under POFMA in Singapore for 2023. Apache-2.0 Genue
Singapore POMFA Statements
singapore-pofma-statements-2024.json
Statements that are false under POFMA in Singapore for 2024. Apache-2.0 Genue
Singapore POMFA True Statements
singapore-pofma-true-statements.json
Statements that are true under POFMA in Singapore. For sanity check only. Apache-2.0 Genue
Singapore Polical History
singapore-political-history.json
Contain questions about Singapore's key historical events in political. Apache-2.0 IMDA
Singapore Transport System
singapore-public-housing.json
Contain questions about Singapore's public housing system Apache-2.0 IMDA
Safety Benchmark (Singapore Context)
singapore-safety-questions.json
Contain prompts that test safety in Singapore-context Apache-2.0 IMDA
Singapore Transport System
singapore-transport-system.json
Contain questions about Singapore's transport system Apache-2.0 IMDA
squad-shifts-tnf
squad-shifts-tnf.json
Zero-shot reading comprehension on paragraphs and questions from squadshifts Apache 2.0 https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/squad_shifts
squad-v2
squad-v2.json
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. CC BY-SA 4.0 rajpurkar/squad_v2 · Datasets at Hugging Face
tamil-thirukural
tamil-kural-classification.json
This dataset is used to test the comprehension abilities for the Thirukkural. Thirukkural is a classic Tamil literature composed by the ancient Tamil poet Thiruvalluvar. It consists of 1330 couplets (kurals) that are grouped into 133 chapters, each containing 10 couplets. Creative Commons Attribution 4.0 International https://github.com/vijayanandrp/Thirukkural-Tamil-Dataset
tamil-news-classification
tamil-tamilnews-classification.json
This dataset is used to classify a static set of Tamil News. The task is to classify news to its respective category. The dataset has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". GNU General Public License v3.0 https://github.com/vanangamudi/tamil-news-classification/tree/master/dataset/news
tanglish-tweets-SA
tamil-tanglish-tweets.json
Code-mixed Tamil and English tweets curated for the sentiment analysis task. CC0: Public Domain https://www.kaggle.com/datasets/vyombhatia/tanglish-comments-for-sentiment-ananlysis/data
TruthfulQA (MCQ Version)
truthfulqa-mcq.json
TruthfulQA aims to measure the the truthfulness of a model. This dataset uses a multiple choice format. Apache-2.0 https://github.com/sylinrl/TruthfulQA
truthfulqa
truthfulqa-multiple-open-ended.json
TruthfulQA aims to measure the the truthfulness of a model. Apache-2.0 https://github.com/sylinrl/TruthfulQA
truthfulqa
truthfulqa-open-ended.json
TruthfulQA aims to measure the the truthfulness of a model. Apache-2.0 https://github.com/sylinrl/TruthfulQA
uciadult
uciadult.json
The UCI adult dataset, created in 1996, is used to train models to predict whether a person's income will exceed $50K/yr based on census data. Also known as "Census Income" dataset. Creative Commons Attribution 4.0 International https://archive.ics.uci.edu/dataset/2/adult
winobias-variation1
winobias-type1.json
This dataset contains gender-bias based on the professions from the Labor Force Statistics (https://www.bls.gov/cps/cpsaat11.htm), which contain some gender-bias. MIT License https://github.com/uclanlp/corefBias/tree/master/WinoBias/wino
Winogrande
winogrande.json
This dataset is used for commonsense reasoning, expert-crafted pronoun resolution problems designed to be unsolvable for statistical models. Apache-2.0 https://github.com/allenai/winogrande
XSTest (Privacy related only)
xstest-privacy-subset.json
XSTest test suite highlights systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models Creative Commons Attribution 4.0 International license https://huggingface.co/datasets/walledai/XSTest
XSTest
xstest.json
XSTest test suite highlights systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models CC-BY-4.0 license https://huggingface.co/datasets/walledai/XSTest