List of Metrics

Name	Description
AnswerCarefullyAnnotator answercarefully-annotator.py	Calculates the number of undesired, no undesired and unknown response by using GPT4o.
AnswerCorrectness answercorrectness.py	Answer correctness in the context of Ragas involves evaluating the accuracy of a generated answer compared to the ground truth. This process assesses both the semantic and factual similarities between the answer and the ground truth. Scores range from 0 to 1, where a higher score indicates a closer match, thus higher correctness.
AnswerRelevance answerrelevance.py	The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context and the answer.
Attack Success Rate advglue.py	Attack success rate measures how successful a changed prompt performs. A high score shows that the system under test is highly sensitive towards a prompt with minimal changes.
BertScore bertscore.py	BertScore uses Bert to check for the similarity in embedding between two sentences.
BleuScore bleuscore.py	Bleuscore uses Bleu to return the various rouge scores.
ContextPrecision contextprecision.py	Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally, all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground_truth, and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.
ContextRecall contextrecall.py	Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed using question, ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.
EntityProcessor entity_processor.py	This metric is designed to process and analyse text to identify entities using the spaCy library. The class also has searches for entities in a given text that do not appear in a source text, labeling them as "unmatched" or "hallucinated" entities.
ExactStrMatch exactstrmatch.py	ExactStrMatch will compare the output from language model with a single target or multiple expected target.
ExactStrMatchF1SquadV2 exactstrmatch-f1-squad-v2.py	ExactStrMatch will compare the output from language model with a single target or multiple expected target.
ExactStrMatchGSM8k exactstrmatch-gsm8k-test.py	ExactStrMatch will compare the output from language model with a single target or multiple expected target.
Faithfulness faithfulness.py	This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
FlagJudge Annotator flagjudgeannotator.py	Calculates the number of correct response using FlagJudge (A judge model from FlagEval Group).
GenderBiasMetric genderbias_metric.py	This metric uses a CLIP based gender classifier to determine the proportion of females returning a score between 0 and 0.5. score = \|female_frac - 0.5\|
LlamaGuard Annotator llamaguardannotator.py	Calculates the number of safe, unsafe and unknown response by using LlamaGuard.
LeakageRate leakagerate.py	Leakage Rate will compare the LCS between two string - Output and Target.
MLCPRVAnnotator mlcprv-annotator.py	Calculates the number of acceptable, unacceptable and unknown response by using GPT4o.
MLCIPV Annotator mlcipv-annotator.py	Calculates the number of acceptable, unacceptable and unknown response by using GPT4o.
MLCNCR Annotator mlcncr-annotator.py	Calculates the number of acceptable, unacceptable and unknown response by using GPT4o.
MLCVCR Annotator mlcvcr-annotator.py	Calculates the number of acceptable, unacceptable and unknown response by using GPT4o.
NudeNetMetric nudenet_metric.py	NudeNet is a classifier used to deduce probabilities of exposed body parts in images.
Q16Metric q16_metric.py	Q16 is a binary classifier used to determine whether an image is inappropriate (https://github.com/ml-research/Q16)
ReadabilityScore readabilityscore.py	ReadabilityScore uses Flesch Reading Ease to compute the complexity of the output
RelaxStrMatch relaxstrmatch.py	RelaxStrMatch will remove symbols and spaces before comparing the output from language model with the expected target.
ReverseExactStrMatch reverseexactstrmatch.py	ReverseExactStrMatch will compare the output from language model with the expected target.
RougeScorer rougescorer.py	RougeScorer returns the various rouge scores.
SpellingScore spelling.py	SpellingScore uses Levenshtein Distance to find permutations within an edit distance of 2 from the original word before comparing to known words in a word frequency list.
Toxicity Classifier toxicity-classifier.py	This classifier measures how toxic a given input is and calculates the number of toxic sentences detected.
Cybersec Eval Prompt Injection v2 cybersecevalannotator2.py	Calculates number of Success and Unsuccessful reply by judge llm.
General Judge LLM cybersecevalannotator.py	Calculates number of yes and no replied by judge llm (yes means prompt injection succeed)