Running a Test

Overview

Testing in Clavata.ai lets you safely validate Policies before applying them to live content analysis. Use your own labeled test datasets for the most relevant results, or choose from Clavata’s available datasets. While images and text must be tested separately, both can be evaluated under the same Policy.

Labeled datasets improve testing by providing key metrics like accuracy, precision, recall, and false positive/negative rates. They also help identify mismatches between your labels and AI-detected content, highlighting areas where your Policy may need adjustments for better coverage.

For more details on dataset format requirements to use your own test data, refer to this article.

Running Tests

Running tests should be a normal part of policy creation and updating. Clavata provides a number of metrics to help you understand how well your policy is performing against a dataset.

Status

True Positive/Negative

True Positive is when the system results match all of the labels within the User label column.
True Negative is when the system and User label are returned as blank meaning the row was not supposed to be caught by the policy.

False Positive/Negative

False Positive happens when the User label is blank and the system returns some result for the row. This indicates the row was not supposed to be caught by the policy but was.
False Negative happens when the User label is not blank but the system did not detect a label from the policy. This indicates the row was supposed to be caught by the policy but wasn't.

Score

Confidence score are returned from the model from 1-99. This demonstrates how confident the model was in it's decision.

In the above example the model had a confidence score of 99 meaning it was very confident that row was supposed to receive the 'Drugs' label.

Testing Metrics Overview

Here you will get an overview of Accuracy, Precision, Recall, False Negative Rate, and False Positive Rate. Here, you will be able to see how these are calculated and what they mean for your policy and results.

F1 Score

The F1 score is a machine learning metric that provides a single value to assess the accuracy of a classification model, especially when dealing with imbalanced datasets. It's calculated as the harmonic mean of Precision and Recall.

The F1 score effectively balances both False Positives and False Negatives, making it a more robust metric than just Precision or Recall alone, especially in situations where missing positive cases is as critical as incorrectly identifying negative ones, or vice versa.

The F1 score ranges from 0% (completely inaccurate) to 100% (perfectly accurate). A higher F1 score indicates a better-performing model.

It's calculated by

2 x (Precision x Recall) / (Precision + Recall)

Accuracy

Accuracy measures how often Clavata’s assessment matches the intended outcome. A 0% (completely inaccurate) to 100% (perfectly accurate).

It’s calculated the number of matches between Clavata’s results and your labels by the total dataset size.

(True Positive + True Negative) / All Content

Accuracy in the All metrics section for the overall is an average from the label derived metrics.

Precision

Precision is a measure of accuracy in classification. It calculates the proportion of correctly identified positive cases out of all cases predicted as positive. In simple terms, it tells you how often a positive prediction is actually correct.
Precision measures how well your policy correctly flags labelled content. A high precision score means fewer false positives, and more true positives which are things you want labels to be applied to.
For example, if the AI labels 100 items and 90 are correct (true positives) while 10 are incorrect (false positives), the precision is 90%—meaning 90% of flagged content was correctly identified.

Recall

Recall measures how well a system identifies all actual positive cases. It calculates the proportion of correctly identified positives out of all true positives that exist. In simple terms, it tells you how often the system catches what it’s supposed to find.
Recall measures how well your policy flags all the content you wanted flagged from your dataset. A high recall means fewer false negatives, ensuring most harmful content is flagged.
For example, if there are 100 items that should have labels and the AI correctly flags 80 but misses 20, recall is 80%—meaning 80% of harmful content was identified.

Clavata offers a structure to achieve both high Precision and Recall but at the upper levels there is a trade off between correctly labelling content at a cost of letting content that should be labelled go unlabelled vs. labelling everything that should at the cost of labelling some content that shouldn't be.

False Negative

A False Negative happens when the AI misses something that should have been labelled. Having a low percentage rate here is a good thing.
The measurement is done by taking False Negatives/All the content in the dataset

False Positive

A False Positive occurs when the AI incorrectly labels content. Having a low percentage rate here is a good thing.
The measurement is done by taking True Negatives/All the content in the dataset

Note! Metrics of F1 Score, Accuracy, Precision, Recall, False Neg, and False Pos, represent the selected label. So if you are refining your "Dog" label, the metrics will represent how your current "Dog" label is doing against your dataset.

Toggle to 'All metrics' to see how all labels are performing.

Condition information

After running test, Clavata will provide information in the editor on how often a specific Condition was triggered. You can safely assume that a Condition with a high hit count is more impactful to your testing metrics than one with a low hit count.

If a condition does not have a hit count, then that means that condition was not triggered by anything in your dataset.

Hits

Hits show how often a condition is coming back as True. This will demonstrate how much content an assertion is finding in your given dataset.

Precision

Precision shows how good your assertion is at finding the correct content. As above, it measure how good a condition is at finding labelled content. A higher percentage here is a positive and lower percentage means the condition should be changed or a UNLESS statement may be needed.

Negative Predictive Value (NPV)

The purpose of an UNLESS statement is to help exclude content from an assertion. An example of that is if you are attempting to remove the sale or discussion or drugs in your policy you may add something like 'legal drugs' to your UNLESS statement to ensure that legal drugs can still be discussed. NPV attempts to measure the effectiveness of this by ensuring it is removing non-labelled data. A high NPV score means that it is working well while a low NPV score means that content you would want labelled is being skipped by NPV.

Downloading your results

Click the download button to generate a csv with a full breakdown of your data.

Need more help? Contact our support team at support@clavata.ai

Running a Test

Overview

Running Tests

Status

True Positive/Negative

False Positive/Negative

Score

Testing Metrics Overview

F1 Score

Accuracy

Precision

Recall

False Negative

False Positive

Condition information

Hits

Precision

Negative Predictive Value (NPV)

Downloading your results

Related Articles