Running a Test
Overview
Testing in Clavata.ai lets you safely validate Policies before applying them to live content analysis. Use your own labeled test datasets for the most relevant results, or choose from Clavata’s available datasets. While images and text must be tested separately, both can be evaluated under the same Policy.
Labeled datasets improve testing by providing key metrics like accuracy, precision, recall, and false positive/negative rates. They also help identify mismatches between your labels and AI-detected content, highlighting areas where your Policy may need adjustments for better coverage.
For more details on dataset format requirements to use your own test data, refer to this article.
Running Tests
Running tests should be a normal part of policy creation and updating. Clavata provides a number of metrics to help you understand how well your policy is performing against a dataset.
Label match
True
- If the Policy label and the User label are both blank then the row returns as True because Clavata correctly did not apply a label
- If the Policy label contains a subset of the labels within User label then the row returns as True because Clavata correctly applied the right labels
False
- If the either the Policy label or the User label contains a label and the other does not then the row returns as False because there was no match
- If the Policy label does not contain every label within User label then the row returns as False because Policy label did not contain a subset of User label
Testing Metrics Overview
Here you will get an overview of Accuracy, Precision, Recall, False Negative Rate, and False Positive Rate. Here, you will be able to see how these are calculated and what they mean for your policy and results.
Accuracy
Accuracy measures how often Clavata’s assessment matches the intended outcome. A 0% (completely inaccurate) to 100% (perfectly accurate).
It’s calculated by dividing the number of matches between Clavata’s results and your labels by the total dataset size.
- True = User labeled content as "harmful."
- False = User left content unlabeled (benign).
Example: If Clavata correctly matches 10 out of 20 items, accuracy is 50%.
Precision
- Precision is a measure of accuracy in classification. It calculates the proportion of correctly identified positive cases out of all cases predicted as positive. In simple terms, it tells you how often a positive prediction is actually correct.
- Precision measures how well your policy correctly flags labelled content. A high precision score means fewer false positives, and more true positives which are things you want labels to be applied to.
- For example, if the AI labels 100 items and 90 are correct (true positives) while 10 are incorrect (false positives), the precision is 90%—meaning 90% of flagged content was correctly identified.
Recall
- Recall measures how well a system identifies all actual positive cases. It calculates the proportion of correctly identified positives out of all true positives that exist. In simple terms, it tells you how often the system catches what it’s supposed to find.
- Recall measures how well your policy flags all the content you wanted flagged from your dataset. A high recall means fewer false negatives, ensuring most harmful content is flagged.
- For example, if there are 100 items that should have labels and the AI correctly flags 80 but misses 20, recall is 80%—meaning 80% of harmful content was identified.
Clavata offers a structure to achieve both high Precision and Recall but at the upper levels there is a trade off between correctly labelling content at a cost of letting content that should be labelled go unlabelled vs. labelling everything that should at the cost of labelling some content that shouldn't be.
False Negative
- A False Negative happens when the AI misses something that should have been labelled. Having a low percentage rate here is a good thing.
- The measurement is done by taking False Negatives/All the content in the dataset
False Positive
- A False Positive occurs when the AI incorrectly labels content. Having a low percentage rate here is a good thing.
- The measurement is done by taking True Negatives/All the content in the dataset
Condition information
After running test, Clavata will provide information in the editor on how often a specific Condition was triggered. You can safely assume that a Condition with a high hit count is more impactful to your testing metrics than one with a low hit count.
If a condition does not have a hit count, then that means that condition was not triggered by anything in your dataset.
Hits
Hits show how often a condition is coming back as True. This will demonstrate how much content an assertion is finding in your given dataset.
Precision
Precision shows how good your assertion is at finding the correct content. As above, it measure how good a condition is at finding labelled content. A higher percentage here is a positive and lower percentage means the condition should be changed or a UNLESS statement may be needed.
Negative Predictive Value (NPV)
The purpose of an EXCEPT WHEN or UNLESS statement is to help exclude content from an assertion. An example of that is if you are attempting to remove the sale or discussion or drugs in your policy you may add something like 'legal drugs' to your EXCEPT WHEN or UNLESS statement to ensure that legal drugs can still be discussed. NPV attempts to measure the effectiveness of this by ensuring it is removing non-labelled data. A high NPV score means that it is working well while a low NPV score means that content you would want labelled is being skipped by NPV.
Downloading your results
Click the download button to generate a csv with a full breakdown of your data.
Need more help? Contact our support team at support@clavata.ai