List of Datasets
| Name | Description | License | Reference | 
|---|---|---|---|
| advglue  advglue-all.json  | 
Adversarial GLUE Benchmark (AdvGLUE) is a comprehensive robustness evaluation benchmark that focuses on the adversarial robustness evaluation of language models. | CC-BY-4.0 license | https://github.com/AI-secure/adversarial-glue | 
| Analogical Similarity  analogical-similarity.json  | 
To measure the model’s ability in discriminating between different degrees of analogical similarity in two given episodes | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/analogical_similarity | 
| Answercarefully Information Cantonese  answercarefully-ca.json  | 
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing. | Dataset from NII-LLMC working group - subset created for AISI testing. | 
| Answercarefully Information Chinese  answercarefully-cn.json  | 
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing | Dataset from NII-LLMC working group - subset created for AISI testing | 
| Answercarefully Information English  answercarefully-en.json  | 
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing | Dataset from NII-LLMC working group - subset created for AISI testing | 
| Answercarefully Information Farsi  answercarefully-fa.json  | 
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing. | Dataset from NII-LLMC working group - subset created for AISI testing. | 
| Answercarefully Information French  answercarefully-fr.json  | 
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing. | Dataset from NII-LLMC working group - subset created for AISI testing. | 
| Answercarefully Information Japanese  answercarefully-jp.json  | 
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing | Dataset from NII-LLMC working group - subset created for AISI testing | 
| Answercarefully Information Korean  answercarefully-kr.json  | 
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing | Dataset from NII-LLMC working group - subset created for AISI testing | 
| Answercarefully Information Kwashili  answercarefully-kw.json  | 
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing. | Dataset from NII-LLMC working group - subset created for AISI testing. | 
| Answercarefully Information Malay  answercarefully-my.json  | 
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing. | Dataset from NII-LLMC working group - subset created for AISI testing. | 
| Answercarefully Information Telugu  answercarefully-tl.json  | 
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing. | Dataset from NII-LLMC working group - subset created for AISI testing. | 
| ARC-Easy  arc-challenge.json  | 
A dataset of genuine grade-school level, multiple-choice science questions in advanced Q&A. This is the challenge set. | CC BY-SA | https://allenai.org/data/arc | 
| ARC-Easy  arc-easy.json  | 
A dataset of genuine grade-school level, multiple-choice science questions in advanced Q&A. This is the easy set. | CC BY-SA | https://allenai.org/data/arc | 
| auto_categorization  auto-categorisation.json  | 
To measure the model's ability in categorising elements into groups. | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/auto_categorization | 
| BBQ-lite on age - Ambiguous Questions  bbq-lite-age-ambiguous.json  | 
This dataset is an excerpt from Bias Benchmark from QA on age, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on age - Disambiguated Questions  bbq-lite-age-disamb.json  | 
This dataset is an excerpt from Bias Benchmark from QA on age, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on disability-status - Ambiguous Questions  bbq-lite-disability-status-ambiguous.json  | 
This dataset is an excerpt from Bias Benchmark from QA on disability-status, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on disability-status - Disambiguated Questions  bbq-lite-disability-status-disamb.json  | 
This dataset is an excerpt from Bias Benchmark from QA on disability-status, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on gender - Ambiguous Questions  bbq-lite-gender-ambiguous.json  | 
This dataset is an excerpt from Bias Benchmark from QA on gender, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on gender - Disambiguated Questions  bbq-lite-gender-disamb.json  | 
This dataset is an excerpt from Bias Benchmark from QA on gender, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on nationality - Ambiguous Questions  bbq-lite-nationality-ambiguous.json  | 
This dataset is an excerpt from Bias Benchmark from QA on nationality, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on nationality - Disambiguated Questions  bbq-lite-nationality-disamb.json  | 
This dataset is an excerpt from Bias Benchmark from QA on nationality, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on physical-appearance - Ambiguous Questions  bbq-lite-physical-appearance-ambiguous.json  | 
This dataset is an excerpt from Bias Benchmark from QA on physical-appearance, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on physical-appearance - Disambiguated Questions  bbq-lite-physical-appearance-disamb.json  | 
This dataset is an excerpt from Bias Benchmark from QA on physical-appearance, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on race-ethnicity - Ambiguous Questions  bbq-lite-race-ethnicity-ambiguous.json  | 
This dataset is an excerpt from Bias Benchmark from QA on race-ethnicity, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on race-ethnicity - Disambiguated Questions  bbq-lite-race-ethnicity-disamb.json  | 
This dataset is an excerpt from Bias Benchmark from QA on race-ethnicity, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on race-x-gender - Ambiguous Questions  bbq-lite-race-x-gender-ambiguous.json  | 
This dataset is an excerpt from Bias Benchmark from QA on race-x-gender, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on race-x-gender - Disambiguated Questions  bbq-lite-race-x-gender-disamb.json  | 
This dataset is an excerpt from Bias Benchmark from QA on race-x-gender, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on race-x-ses - Ambiguous Questions  bbq-lite-race-x-ses-ambiguous.json  | 
This dataset is an excerpt from Bias Benchmark from QA on race-x-ses, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on race-x-ses - Disambiguated Questions  bbq-lite-race-x-ses-disamb.json  | 
This dataset is an excerpt from Bias Benchmark from QA on race-x-ses, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on religion - Ambiguous Questions  bbq-lite-religion-ambiguous.json  | 
This dataset is an excerpt from Bias Benchmark from QA on religion, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on religion - Disambiguated Questions  bbq-lite-religion-disamb.json  | 
This dataset is an excerpt from Bias Benchmark from QA on religion, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on ses - Ambiguous Questions  bbq-lite-ses-ambiguous.json  | 
This dataset is an excerpt from Bias Benchmark from QA on ses, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on ses - Disambiguated Questions  bbq-lite-ses-disamb.json  | 
This dataset is an excerpt from Bias Benchmark from QA on ses, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on sexual-orientation - Ambiguous Questions  bbq-lite-sexual-orientation-ambiguous.json  | 
This dataset is an excerpt from Bias Benchmark from QA on sexual-orientation, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BBQ-lite on sexual-orientation - Disambiguated Questions  bbq-lite-sexual-orientation-disamb.json  | 
This dataset is an excerpt from Bias Benchmark from QA on sexual-orientation, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 | 
| BIPIA - abstract QA - English  bipia-abstract-test.json  | 
Abstrct QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. Based on XSum dataset (BBC articles). Fake summaries or call-to-action embedded in articles. | MIT License | https://github.com/microsoft/BIPIA/tree/main | 
| BIPIA - abstract QA - English  bipia-abstract-train.json  | 
Abstrct QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. Based on XSum dataset (BBC articles). Fake summaries or call-to-action embedded in articles. | MIT License | https://github.com/microsoft/BIPIA/tree/main | 
| BIPIA - email QA - English  bipia-email-test.json  | 
Email QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. Based on OpenAI Evals (real-world emails with questions and answers). Malicious instructions may be hidden in the email body. | MIT License | https://github.com/microsoft/BIPIA/tree/main | 
| BIPIA - email QA - English  bipia-email-train.json  | 
Email QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. Based on OpenAI Evals (real-world emails with questions and answers). Malicious instructions may be hidden in the email body. | MIT License | https://github.com/microsoft/BIPIA/tree/main | 
| BIPIA - News QA - English  bipia-news.json  | 
News QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. News data is based on NewsQA dataset. Malicious content injected into search snippets or web pages. | MIT License | https://www.kaggle.com/datasets/nagendra048/newsqa-dataset, https://github.com/microsoft/BIPIA/tree/main | 
| BIPIA - table QA - English  bipia-table-test.json  | 
Table QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. Based on WikiTableQuestions dataset. Attacks embedded as cell notes or footnotes. | MIT license | https://github.com/microsoft/BIPIA/tree/main | 
| BIPIA - table QA - English  bipia-table-train.json  | 
Table QA from paper - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. Based on WikiTableQuestions dataset.Attacks embedded as cell notes or footnotes. | MIT License | https://github.com/microsoft/BIPIA/tree/main | 
| cause_and_effect  cause-and-effect.json  | 
This recipe measures if the model is able to distinguish and understand cause and effect. The higher the grade, the better the system can perform this capability. | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/cause_and_effect | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-SES-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-SES-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-age-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-age-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-disability-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-disability-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-disease-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-disease-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-educational-qualification-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-educational-qualification-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-ethnicity-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-ethnicity-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-gender-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-gender-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-household-registration-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-household-registration-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-nationality-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-nationality-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-physical-appearance-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-physical-appearance-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-race-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-race-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-region-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-region-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-religion-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-religion-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-sexual-orientation-ambiguous.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| Chinese Version - Bias Benchmark for QA  cbbq-lite-sexual-orientation-disamb.json  | 
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ | 
| ChallengingToxicityPrompts  challenging-toxicity-prompts-variation1.json  | 
ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. | CC BY-SA 4.0 | https://github.com/AI-secure/DecodingTrust | 
| ChallengingToxicityPrompts  challenging-toxicity-prompts-variation2.json  | 
ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. | CC BY-SA 4.0 | https://github.com/AI-secure/DecodingTrust | 
| ChallengingToxicityPrompts  challenging-toxicity-prompts.json  | 
ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. | CC BY-SA 4.0 | https://github.com/AI-secure/DecodingTrust | 
| Chinese Linguistics & Cognition Challenge (CLCC)  clcc.json  | 
This dataset is a subjective benchmark created by BAAI FlagEval group. | CC-BY-4.0 license | https://flageval.baai.ac.cn/ | 
| CNN/DailyMail  cnn-dailymail.json  | 
The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. This dataset contains the test split. | Apache 2.0 | https://huggingface.co/datasets/abisee/cnn_dailymail | 
| Ethics-Commonsense-Hard  commonsense-morality-easy-variation1.json  | 
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. | MIT License | https://github.com/hendrycks/ethics/ | 
| Ethics-Commonsense-Hard  commonsense-morality-easy-variation2.json  | 
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. | MIT License | https://github.com/hendrycks/ethics/ | 
| Ethics-Commonsense-Hard  commonsense-morality-hard-variation1.json  | 
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. | MIT License | https://github.com/hendrycks/ethics/ | 
| Ethics-Commonsense-Hard  commonsense-morality-hard-variation2.json  | 
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. | MIT License | https://github.com/hendrycks/ethics/ | 
| Contextual Parametric Knowledge Conflicts  contextual-parametric-knowledge-conflicts.json  | 
To measure the model/’s ability in making informed predictions | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/contextual_parametric_knowledge_conflicts | 
| coqa_conversational_question_answering  coqa-conversational-qna.json  | 
To measure the ability of machines to understand a text passage and answer a series of interconnected questions | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/coqa_conversational_question_answering | 
| C-Values - Values of Chinese Large Language attack_modules  cvalues.json  | 
This dataset contains prompts that can measure the alignment ability of LLMs in both safety and responsibility in Chinese Language. | Apache-2.0 license | https://github.com/X-PLUG/CValues | 
| CyberSecEval Prompt Injection 2 Chinese  cyberseceval-promptinjection2-cn.json  | 
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection | 
| CyberSecEval Prompt Injection 2 English  cyberseceval-promptinjection2-en.json  | 
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection | 
| CyberSecEval Prompt Injection 2 French  cyberseceval-promptinjection2-fr.json  | 
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection | 
| CyberSecEval Prompt Injection 2 Japanese  cyberseceval-promptinjection2-jp.json  | 
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection | 
| CyberSecEval Prompt Injection 2 Korean  cyberseceval-promptinjection2-kr.json  | 
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection | 
| CyberSecEval Prompt Injection 2 Kwashili  cyberseceval-promptinjection2-kw.json  | 
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection | 
| CyberSecEval Prompt Injection 2 Malay  cyberseceval-promptinjection2-my.json  | 
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection | 
| CyberSecEval 3 Interpreter - English  cyberseceval3-interpreter.json  | 
These tests evaluate the security risks posed by integrating LLMs with code interpreters, specifically assessing how effectively an LLM can prevent malicious attempts to exploit the system or execute harmful code. | MIT License | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks | 
| CyberSecEval Prompt Injection  cyberseceval_promptinjection.json  | 
Check if llm is susceptible to prompt injection | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection | 
| DecodingTrust - Privacy Leakage during Conversations  decodingtrust-private-conversations.json  | 
To test if applications can leak privacy-sensitive information which is provided during interactive conversations in the inference stage. | CC BY-SA 4.0 license | https://github.com/AI-secure/DecodingTrust/tree/main | 
| Facts about Asia pacific in True and False in 4 languages (Chinese, Malay, Tamil and English)  facticity-apac-multilungual-facts.json  | 
Contain prompts that contain facts about Asia Pacific and Singapore, in True/False format | Apache-2.0 | facticity.ai | 
| FACTS Grounding  facts-grounding.json  | 
FACTS Grounding is a benchmark from Google DeepMind and Google Research designed to measure the performance of AI Models on factuality and grounding | CC-by-4.0 license | https://huggingface.co/datasets/google/FACTS-grounding-public | 
| uciadult  fairness-uciadult.json  | 
The UCI Adult dataset has been used widely used to assess fairness. | Creative Commons Attribution 4.0 International | https://archive.ics.uci.edu/dataset/2/adult | 
| Gender Occupational Bias  gender-occupation-text2-image-prompts.json  | 
The gender occupational bias is a set of gender neutral text-to-image prompts that are likely to result in models favouring the generation of one gender over the other. The occupations included were proposed in the paper: https://arxiv.org/abs/2211.03759 | MIT License | https://arxiv.org/abs/2211.03759 | 
| Gender Occupational Bias  gender-text2-image-prompts.json  | 
The gender occupational bias is a set of gender neutral text-to-image prompts that are likely to result in models favouring the generation of one gender over the other. The occupations included were proposed in the paper: https://arxiv.org/abs/2211.03759 | MIT License | https://arxiv.org/abs/2211.03759 | 
| gre_reading_comprehension  gre-reading-comprehension.json  | 
To measure the model/'s ability to summarize text, infer and deduce knowledge from context, and match the context. | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/gre_reading_comprehension | 
| GSM8K-TEST  gsm8k-test.json  | 
This dataset consists of high quality grade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. The dataset includes only the test set of GSM8K. | MIT License | https://github.com/openai/grade-school-math | 
| GSM8K  gsm8k.json  | 
This dataset consists of high quality grade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. | MIT License | https://github.com/openai/grade-school-math | 
| HellaSwag  hellaswag.json  | 
This dataset is used to evaluate commonsense with questions that are trivial for humans but difficult for state-of-the-art models. | MIT License | https://github.com/rowanz/hellaswag | 
| Inappropriate Image Prompts (I2P)  i2p-text2image-prompts.json  | 
The I2P benchmark contains real user prompts for generative text2image prompts that are unproportionately likely to produce inappropriate images. I2P was introduced in the 2023 CVPR paper Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models. This benchmark is not specific to any approach or model, but was designed to evaluate mitigating measures against inappropriate degeneration in Stable Diffusion. | MIT License | https://huggingface.co/datasets/AIML-TUDA/i2p | 
| In-the-wild Jailbreak Dataset  jailbreak-dan.json  | 
This dataset contains jailbreak prompts that were collected from various places (e.g., Discord). | MIT license | https://github.com/verazuo/jailbreak_llms | 
| m3exam-indonesian  m3exam-indonesian-test.json  | 
This dataset contains local exam questions for testing Indonesian capability. | CC BY-NC-SA | https://github.com/DAMO-NLP-SG/M3Exam | 
| m3exam-thai  m3exam-thai-test.json  | 
This dataset contains local exam questions for testing Thai capability. | CC BY-NC-SA | https://github.com/DAMO-NLP-SG/M3Exam | 
| m3exam-vietnamese  m3exam-vietnamese-test.json  | 
This dataset contains local exam questions for testing Vietnamese capability. | CC BY-NC-SA | https://github.com/DAMO-NLP-SG/M3Exam | 
| MedMCQA  medmcqa.json  | 
MedMCQ is a large-scale, multiple-choice question answering dataset designed to address realworld medical entrance exam questions. It contains more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects. | MIT License | https://github.com/MedMCQA/MedMCQA | 
| MedQA (US)  medqa-us.json  | 
MedQA is a free-form multiple-choice OpenQA dataset for solving medical problems. These are collected from the professional medical board exams. We extracted the list of MCQ questions from the US in this dataset. | MIT License | https://github.com/jind11/MedQA | 
| MLCAiluminate - Child Sexual Exploitation  mlc-ailuminate-cse.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Defamation  mlc-ailuminate-dfm.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Hate  mlc-ailuminate-hte.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Intellectual Property Violations  mlc-ailuminate-ipv.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Indiscriminate Weapons (CBRNE)  mlc-ailuminate-iwp.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Non-violent Crimes  mlc-ailuminate-ncr.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Privacy Violations  mlc-ailuminate-prv.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Specialized Advice (Health)  mlc-ailuminate-spc-ele.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Specialized Advice (Financial)  mlc-ailuminate-spc-fin.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Specialized Advice (Elections)  mlc-ailuminate-spc-hlt.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Specialized Advice (Legal)  mlc-ailuminate-spc-lgl.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Sex-related Crimes  mlc-ailuminate-src.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Suicide and Self-Harm  mlc-ailuminate-ssh.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Sexual Content (Pornographic)  mlc-ailuminate-sxc-prn.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLCAiluminate - Violent Crimes  mlc-ailuminate-vcr.json  | 
The MLCommons AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories | CC-BY-4.0 license | https://github.com/mlcommons/ailuminate | 
| MLC Baseline - Child Abuse and Exploitation  mlc-cae-malicious-user.json  | 
Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| MLC Baseline - Child Abuse and Exploitation  mlc-cae-typical-user.json  | 
Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives  mlc-cbr-malicious-user.json  | 
Responses that enable, encourage, or endorse the creation of weapons of mass destruction. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives  mlc-cbr-typical-user.json  | 
Responses that enable, encourage, or endorse the creation of weapons of mass destruction. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| MLC Baseline - Hate  mlc-hat-malicious-user.json  | 
Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| MLC Baseline - Hate  mlc-hat-typical-user.json  | 
Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| ML Commons Ailuminte - Intellectual Property Violation - Cantonese  mlc-ipv-ca.json  | 
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Intellectual Property Violation - Chinese  mlc-ipv-cn.json  | 
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Intellectual Property Violation - English  mlc-ipv-en.json  | 
Test whether responses violate the intellectual property rights of any party. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Intellectual Property Violation - Farsi  mlc-ipv-fa.json  | 
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Intellectual Property Violation - French  mlc-ipv-fr.json  | 
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Intellectual Property Violation - Japanese  mlc-ipv-jp.json  | 
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Intellectual Property Violation - Korean  mlc-ipv-kr.json  | 
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Intellectual Property Violation - Kwashili  mlc-ipv-kw.json  | 
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Intellectual Property Violation - Malay  mlc-ipv-my.json  | 
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Intellectual Property Violation - Telugu  mlc-ipv-tl.json  | 
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Non-Violent Crimes - Chinese  mlc-ncr-cn.json  | 
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Non-Violent Crimes - English  mlc-ncr-en.json  | 
Test whether responses enable, encourage, or endorse the commission of non-violent crimes. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Non-Violent Crimes - Farsi  mlc-ncr-fa.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Non-Violent Crimes - French  mlc-ncr-fr.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Non-Violent Crimes - Japanese  mlc-ncr-jp.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Non-Violent Crimes - Korean  mlc-ncr-kr.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Non-Violent Crimes - Kwashili  mlc-ncr-kw.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Non-Violent Crimes - Malay  mlc-ncr-my.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Non-Violent Crimes - Telugu  mlc-ncr-tl.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| MLC Baseline - Non-Violent Crimes  mlc-nvc-malicious-user.json  | 
Responses that enable, encourage, or endorse the commission of non-violent crimes. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| MLC Baseline - Non-Violent Crimes  mlc-nvc-typical-user.json  | 
Responses that enable, encourage, or endorse the commission of non-violent crimes. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| ML Commons Ailuminte - Privacy - Chinese  mlc-prv-cn.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons  Ailuminte - Privacy - English  mlc-prv-en.json  | 
Test whether responses contain sensitive, nonpublic personal information that could undermine someone’s physical, digital, or financial security. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Privacy - Farsi  mlc-prv-fr.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Privacy - Japanese  mlc-prv-jp.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Privacy - Korean  mlc-prv-kr.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Privacy - Kwashili  mlc-prv-kw.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Privacy - Malay  mlc-prv-my.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Privacy - Telugu  mlc-prv-tl.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| MLC Baseline - Sex-Related Crimes  mlc-scr-malicious-user.json  | 
Responses that enable, encourage, or endorse the commission of sex-related crimes. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| MLC Baseline - Sex-Related Crimes  mlc-scr-typical-user.json  | 
Responses that enable, encourage, or endorse the commission of sex-related crimes. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| MLC Baseline - Suicide & Self-Harm  mlc-ssh-typical-user.json  | 
Responses that enable, encourage, or endorse acts of intentional self-harm. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| MLC Baseline - Suicide & Self-Harm  mlc-ssh-vulnerable-user.json  | 
Responses that enable, encourage, or endorse acts of intentional self-harm. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| ML Commons Ailuminte - Violent Crimes - Chinese  mlc-vcr-cn.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Violent Crimes - English  mlc-vcr-en.json  | 
Test whether responses enable, encourage, or endorse the commission of violent crimes. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Violent Crimes - French  mlc-vcr-fr.json  | 
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Violent Crimes - Japanese  mlc-vcr-jp.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Violent Crimes - Korean  mlc-vcr-kr.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Violent Crimes - Kwashili  mlc-vcr-kw.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| MLC Baseline - Violent Crimes  mlc-vcr-malicious-user.json  | 
Responses that enable, encourage, or endorse the commission of violent crimes. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| ML Commons Ailuminte - Violent Crimes - Malay  mlc-vcr-my.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| ML Commons Ailuminte - Violent Crimes - Telugu  mlc-vcr-tl.json  | 
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ | 
| MLC Baseline - Violent Crimes  mlc-vcr-typical-user.json  | 
Responses that enable, encourage, or endorse the commission of violent crimes. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 | 
| MMLU  mmlu-all.json  | 
This dataset covers 57 tasks including elementary mathemathics, US history, computer science, law, and more. | MIT license | https://github.com/hendrycks/test | 
| MMLU Anatomy  mmlu-anatomy.json  | 
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Anatomy | MIT License | https://github.com/hendrycks/test | 
| MMLU Clinical Knowledge  mmlu-clinicalknowledge.json  | 
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Clinical Knowledge | MIT License | https://github.com/hendrycks/test | 
| MMLU College Biology  mmlu-collegebiology.json  | 
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU College Biology | MIT License | https://github.com/hendrycks/test | 
| MMLU College Medicine  mmlu-collegemedicine.json  | 
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU College Medicine | MIT License | https://github.com/hendrycks/test | 
| MMLU Medical Genetics  mmlu-medicalgenetics.json  | 
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Medical Genetics | MIT License | https://github.com/hendrycks/test | 
| MMLU Professional Medicine.json  mmlu-professionalmedicine.json  | 
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Professional Medicine.json | MIT License | https://github.com/hendrycks/test | 
| List of Occupations  occupation.json  | 
This dataset contains a list of occupations collected from various websites (e.g., MoM of Singapore). The dataset can be used to test job-related applications. | MIT License | IMDA | 
| enronemail  privacy-enronemails.json  | 
The Enron email dataset is aimed at evaluating if there is leakage of private information. The dataset was used in DecodingTrust as part of its privacy evaluation. | Apache-2.0 | https://github.com/ParakweetLabs/EmailIntentDataSet | 
| Jailbreak Prompts  prompt_injection_jailbreak.json  | 
This dataset evaluates whether a model/app will detect attempted jailbreaks and generate unsafe content. These jailbreak techniques are collected from X (e.g., Pliny) and public repository (e.g., in-the-wild jailbreak prompts). | Apache 2.0 | OWASP | 
| PubMedQA  pubmedqa.json  | 
PubMedQA is a large-scale, multiple-choice question answering dataset designed to address realworld medical entrance exam questions. It contains more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects. | MIT License | https://github.com/MedMCQA/MedMCQA | 
| mock-dataset  rag-sample-dataset.json  | 
This is a sample of a dataset to be used for RAG Evaluation. | ||
| RealToxicityPrompts  real-toxicity-prompts.json  | 
RealToxicityPrompts aims to measure the model's ability in identify toxic sentences. | Apache-2.0 | https://github.com/allenai/real-toxicity-prompts/ | 
| RealtimeQA  realtimeqa-past.json  | 
RealTime QA is a dynamic question answering (QA) platform that inquires about the present. | https://github.com/realtimeqa/realtimeqa_public | |
| SafetyBench-PII (Test)  safetybench-privacy-full.json  | 
SafetyBench is a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. | MIT License | https://huggingface.co/datasets/thu-coai/SafetyBench | 
| SafetyBench-PII (dev)  safetybench-privacy-small.json  | 
SafetyBench is a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. | MIT license | https://huggingface.co/datasets/thu-coai/SafetyBench | 
| sg-legal-glossary  sg-legal-glossary.json  | 
A list of singapore legal terms extracted from SICC and Judiciary websites. | https://www.sicc.gov.sg/glossary-of-legal-terms | |
| sg-university-tutorial-questions-legal  sg-university-tutorial-questions-legal.json  | 
Contain tutorial questions ans answers from Singapore's Universities to test model's ability in understanding legal context in Singapore | ||
| SGHateCheck - Chinese  sghatecheck-ms.json  | 
SGHateCheck is a hate speech benchmark tailored for Singapore's socio-linguistical context. | MIT license | https://github.com/Social-AI-Studio/SGHateCheck/tree/main/testcases | 
| SGHateCheck - Singlish  sghatecheck-ss.json  | 
SGHateCheck is a hate speech benchmark tailored for Singapore's socio-linguistical context. | MIT license | https://github.com/Social-AI-Studio/SGHateCheck/tree/main/testcases | 
| SGHateCheck - Chinese  sghatecheck-ta.json  | 
SGHateCheck is a hate speech benchmark tailored for Singapore's socio-linguistical context. | MIT license | https://github.com/Social-AI-Studio/SGHateCheck/tree/main/testcases | 
| SGHateCheck - Chinese  sghatecheck-zh.json  | 
SGHateCheck is a hate speech benchmark tailored for Singapore's socio-linguistical context. | MIT license | https://github.com/Social-AI-Studio/SGHateCheck/tree/main/testcases | 
| Facts about Singapore in True and False  singapore-facts-tnf.json  | 
Contain prompts that contains facts about Singapore, in True/False format | Apache-2.0 | IMDA | 
| Food in Singapore  singapore-food-tnf.json  | 
Contain prompts that test model's udnerstanding in Food, in True/False format | Apache-2.0 | IMDA | 
| Iconic Places in Singapore  singapore-iconic-places.json  | 
Contain questions about Singapore's iconic places. | Apache-2.0 | IMDA | 
| Places in Singapore  singapore-places-tnf.json  | 
Contain prompts that test model's udnerstanding places in Singapore, in True/False format | Apache-2.0 | IMDA | 
| Singapore POMFA Statements  singapore-pofma-statements-2023.json  | 
Statements that are false under POFMA in Singapore for 2023. | Apache-2.0 | Genue | 
| Singapore POMFA Statements  singapore-pofma-statements-2024.json  | 
Statements that are false under POFMA in Singapore for 2024. | Apache-2.0 | Genue | 
| Singapore POMFA True Statements  singapore-pofma-true-statements.json  | 
Statements that are true under POFMA in Singapore. For sanity check only. | Apache-2.0 | Genue | 
| Singapore Polical History  singapore-political-history.json  | 
Contain questions about Singapore's key historical events in political. | Apache-2.0 | IMDA | 
| Singapore Transport System  singapore-public-housing.json  | 
Contain questions about Singapore's public housing system | Apache-2.0 | IMDA | 
| Safety Benchmark (Singapore Context)  singapore-safety-questions.json  | 
Contain prompts that test safety in Singapore-context | Apache-2.0 | IMDA | 
| Singapore Transport System  singapore-transport-system.json  | 
Contain questions about Singapore's transport system | Apache-2.0 | IMDA | 
| squad-shifts-tnf  squad-shifts-tnf.json  | 
Zero-shot reading comprehension on paragraphs and questions from squadshifts | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/squad_shifts | 
| squad-v2  squad-v2.json  | 
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. | CC BY-SA 4.0 | rajpurkar/squad_v2 · Datasets at Hugging Face | 
| tamil-thirukural  tamil-kural-classification.json  | 
This dataset is used to test the comprehension abilities for the Thirukkural. Thirukkural is a classic Tamil literature composed by the ancient Tamil poet Thiruvalluvar. It consists of 1330 couplets (kurals) that are grouped into 133 chapters, each containing 10 couplets. | Creative Commons Attribution 4.0 International | https://github.com/vijayanandrp/Thirukkural-Tamil-Dataset | 
| tamil-news-classification  tamil-tamilnews-classification.json  | 
This dataset is used to classify a static set of Tamil News. The task is to classify news to its respective category. The dataset has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". | GNU General Public License v3.0 | https://github.com/vanangamudi/tamil-news-classification/tree/master/dataset/news | 
| tanglish-tweets-SA  tamil-tanglish-tweets.json  | 
Code-mixed Tamil and English tweets curated for the sentiment analysis task. | CC0: Public Domain | https://www.kaggle.com/datasets/vyombhatia/tanglish-comments-for-sentiment-ananlysis/data | 
| TruthfulQA (MCQ Version)  truthfulqa-mcq.json  | 
TruthfulQA aims to measure the the truthfulness of a model. This dataset uses a multiple choice format. | Apache-2.0 | https://github.com/sylinrl/TruthfulQA | 
| truthfulqa  truthfulqa-multiple-open-ended.json  | 
TruthfulQA aims to measure the the truthfulness of a model. | Apache-2.0 | https://github.com/sylinrl/TruthfulQA | 
| truthfulqa  truthfulqa-open-ended.json  | 
TruthfulQA aims to measure the the truthfulness of a model. | Apache-2.0 | https://github.com/sylinrl/TruthfulQA | 
| uciadult  uciadult.json  | 
The UCI adult dataset, created in 1996, is used to train models to predict whether a person's income will exceed $50K/yr based on census data. Also known as "Census Income" dataset. | Creative Commons Attribution 4.0 International | https://archive.ics.uci.edu/dataset/2/adult | 
| winobias-variation1  winobias-type1.json  | 
This dataset contains gender-bias based on the professions from the Labor Force Statistics (https://www.bls.gov/cps/cpsaat11.htm), which contain some gender-bias. | MIT License | https://github.com/uclanlp/corefBias/tree/master/WinoBias/wino | 
| Winogrande  winogrande.json  | 
This dataset is used for commonsense reasoning, expert-crafted pronoun resolution problems designed to be unsolvable for statistical models. | Apache-2.0 | https://github.com/allenai/winogrande | 
| XSTest (Privacy related only)  xstest-privacy-subset.json  | 
XSTest test suite highlights systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models | Creative Commons Attribution 4.0 International license | https://huggingface.co/datasets/walledai/XSTest | 
| XSTest  xstest.json  | 
XSTest test suite highlights systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models | CC-BY-4.0 license | https://huggingface.co/datasets/walledai/XSTest |