Text validation framework for Python - test AI outputs against BLEU, ROUGE, semantic similarity, and other metrics.
AI-generated text is hard to validate. You can't assertEqual on free-form output, and eyeballing samples doesn't scale. You need metrics that capture different quality dimensions, with thresholds you can tune per use case.
Veritext provides composable validators for text quality. Pick the metrics you care about, set thresholds, compose them with boolean logic. Use cases range from chatbot output quality to summariser fidelity to content generator consistency. It plugs into existing Python test infrastructure via pytest, so adopting it doesn't mean learning a new framework.
Five metric families: BLEU (n-gram precision against a reference), ROUGE (recall-oriented overlap), lexical similarity (edit distance, Jaccard), readability (Flesch-Kincaid and similar), and semantic similarity (sentence-transformers embeddings, cosine similarity). Each metric is a standalone validator with a configurable threshold.
Compose validators with all_of (every metric must pass) and any_of (at least one must pass) for complex validation rules. Semantic similarity catches paraphrases that lexical metrics miss entirely, but it's opt-in because the model download is large and inference is slow. Users who only need lexical metrics don't pay for it.
Ships as a pytest plugin. validate_text() integrates with standard test discovery, so adding text quality assertions feels like writing normal tests, not learning a new framework. Structured failure messages tell you which metrics failed and by how much, not just 'assertion failed'.
The regression detection workflow benchmarks a set of outputs, then checks future versions against the benchmark. Quality regressions get caught the same way unit tests catch functional regressions: a failing test in CI.
A command-line tool for batch validation against JSONL files. Each line is an input/reference/output tuple. Useful for validating a dataset of outputs before deploying a new model version or after a prompt change. Reports per-metric scores and aggregate pass/fail, and can be wired into CI pipelines alongside unit tests.
Pydantic models define validator configuration: metric type, threshold, and composition rules (all_of, any_of). Each metric is a stateless function that takes text and returns a score. The pytest plugin wraps validators in a validate_text() assertion that integrates with standard test discovery. The CLI reads JSONL inputs and runs the same validators in batch, outputting structured results.
Multiple metrics: BLEU, ROUGE, lexical, readability, semantic similarity
Composable validators (all_of, any_of for complex checks)
Native pytest integration with validate_text() assertion
Quality benchmarking with regression detection
CLI with JSONL batch processing
Composable validators were the key design decision. Early versions had a monolithic validate() function with too many parameters. Splitting into individual metrics composed with all_of/any_of made the API much more natural and the error messages much more useful.
Semantic similarity via sentence-transformers catches paraphrases that BLEU and ROUGE miss entirely, but it's slow and the model download is large. Making it opt-in kept the library lightweight for users who only need lexical metrics.
Building this as a pytest plugin instead of a standalone tool was the right call. Developers already have test suites; adding text quality assertions should feel like writing any other test, not learning a new framework.