We Benchmarked Our Detection Engine Against 2,369 Samples from 7 Peer-Reviewed Datasets
Most AI security vendors tell you their tool "catches prompt injection." Few publish numbers. Fewer still test against independent, peer-reviewed datasets where they don't control the data.
We ran PromptGuard's full detection pipeline against 2,369 samples from seven independent datasets - including benchmarks published at NeurIPS, ACM CCS, NAACL, and ICLR. We compared it against a leading open-source ML classifier deployed standalone, without our pipeline. Every prediction is saved per-sample to JSONL for full reproducibility.
Here's what we found.
The Setup
Three configurations tested on identical data:
- PromptGuard Full: Our production pipeline: adversarial text normalization → regex pattern matching → ML classification → content safety classification → multi-turn intent drift detection → policy evaluation
- Standalone ML classifier: A leading open-source injection classifier called directly, no preprocessing
- PromptGuard Regex-Only: Pattern matching only, no ML (baseline)
Seven datasets, zero cherry-picking:
| Dataset | Source | Samples | What it tests |
|---|---|---|---|
| TensorTrust | ICLR 2024 | 500 | Human-generated injection attacks from an adversarial game |
| In-the-Wild Jailbreaks | ACM CCS 2024 | 500 | Real jailbreak prompts from Reddit and Discord |
| deepset/prompt-injections | Community | 500 | Mixed injection + benign prompts |
| JailbreakBench | NeurIPS 2024 | 200 | Harmful behavior requests + benign baselines |
| XSTest | NAACL 2024 | 250 | Safe prompts that look dangerous (false positive test) |
| Benign Corpus | Internal | 298 | Business, educational, and conversational prompts |
| Evasion Suite | Internal | 100 | 10 attack seeds × 10 adversarial encoding techniques |
| Total | 2,369 | 1,378 attack + 991 benign |
Aggregate Results
| Approach | F1 | 95% CI | Precision | Recall | FPR |
|---|---|---|---|---|---|
| PromptGuard Full | 0.887 | [0.874, 0.900] | 99.1% | 80.3% | 1.01% |
| Standalone ML classifier | 0.850 | [0.834, 0.864] | 99.5% | 74.2% | 0.50% |
| Regex-Only (baseline) | 0.527 | [0.498, 0.554] | 99.0% | 35.9% | 0.50% |
The confidence intervals don't overlap. The pipeline improvement over the standalone classifier is statistically significant at the 95% level.
What does this mean in practice? PromptGuard catches 6 more attacks out of every 100 that the standalone model misses, while maintaining 99.1% precision (almost zero false alarms).
Where the Pipeline Wins: Evasion Robustness
This is the result that matters most for production security. Attackers don't send clean English injection prompts - they encode them.
We took 10 canonical injection prompts and applied 10 evasion techniques to each:
| Evasion Technique | PromptGuard Full | Standalone ML classifier |
|---|---|---|
| Base64 encoding | 10/10 (100%) | 3/10 (30%) |
| Leetspeak | 10/10 (100%) | 1/10 (10%) |
| Text reversal | 10/10 (100%) | 7/10 (70%) |
| Unicode homoglyphs | 10/10 (100%) | 9/10 (90%) |
| Zero-width chars | 10/10 (100%) | 10/10 (100%) |
| Case alternation | 10/10 (100%) | 10/10 (100%) |
| Whitespace injection | 10/10 (100%) | 10/10 (100%) |
| Markdown wrapping | 10/10 (100%) | 10/10 (100%) |
| XML wrapping | 10/10 (100%) | 10/10 (100%) |
| Benign prefix | 10/10 (100%) | 10/10 (100%) |
| Total | 100/100 (100%) | 80/100 (80%) |
Standalone ML classifiers — even strong ones trained on clean text — fail catastrophically on base64 (30%) and leetspeak (10%). These are trivial encoding techniques that any motivated attacker will try first.
PromptGuard's adversarial text normalization layer strips the encoding before the text reaches the ML model, restoring its full detection capability. The model doesn't need to learn every encoding - the normalization layer handles it.
Per-Dataset Breakdown
| Dataset | PG Full F1 | Baseline F1 | Delta |
|---|---|---|---|
| TensorTrust (ICLR 2024) | 0.992 | 0.992 | 0.0% |
| In-the-Wild (ACM CCS 2024) | 0.902 | 0.841 | +7.3% |
| Internal Red Team | 1.000 | 0.865 | +15.6% |
| Evasion Suite | 1.000 | 0.889 | +12.5% |
| deepset/prompt-injections | 0.639 | 0.612 | +4.3% |
On in-the-wild jailbreak prompts — the real adversarial distribution from Reddit and Discord — PromptGuard outperforms the standalone classifier by 7.3%. On our internal red team vectors, the gap is 15.6%.
The largest gains come from the normalization layer catching encoded attacks that the raw model can't parse.
What About False Positives?
We tested against 250 safe prompts from XSTest (NAACL 2024) - prompts that deliberately use language similar to unsafe content (homonyms, figurative language, safe contexts). A poorly calibrated detector would flag these.
- PromptGuard Full: 0.4% FPR (1 out of 250)
- Standalone ML classifier: 0.0% FPR
- Benign corpus (298 prompts): 0.0% FPR for all detectors
99.6% of safe-but-tricky prompts pass through correctly. No legitimate business prompts are blocked.
The Ablation: What Each Layer Contributes
| Configuration | F1 | Recall | Precision |
|---|---|---|---|
| Regex only (baseline) | 0.527 | 35.9% | 99.0% |
| Full pipeline (norm + regex + ML) | 0.887 | 80.3% | 99.1% |
Regex catches known patterns with 99% precision but only 36% recall. Adding ML doubles recall while maintaining precision. The normalization layer closes the evasion gap completely.
Reproducibility
Every prediction from this benchmark is saved to per-sample JSONL files with timestamps, confidence scores, and latencies. The benchmark script supports:
- Resume capability: re-run skips already-computed samples
- Eval-only mode: recompute metrics from saved predictions without re-running inference
- Structured run directories: full provenance for every experiment
The code, dataset loaders, and results are available at github.com/promptguard.
What This Means
If you're deploying an LLM-powered application and relying on a standalone ML classifier for injection detection, you have a 20% blind spot on adversarial evasion. Base64 encoding and leetspeak — techniques any script kiddie can apply — drop even strong classifiers to 30% and 10% detection respectively.
PromptGuard's multi-layered architecture closes this gap completely. The normalization layer defeats encoding evasion. The regex layer catches known patterns with near-zero latency. The ML layer generalizes to novel attacks. Together, they achieve what no single layer can.
Want to test PromptGuard against your own adversarial prompts? Start free or contact us for enterprise evaluation.
Continue Reading
The LiteLLM Compromise: What a Three-Hour Window Reveals About AI Infrastructure Security
A technical analysis of the LiteLLM supply chain attack - how a compromised security scanner led to credential theft across the AI ecosystem, what the three-stage payload did, and what it means for anyone building on LLM infrastructure.
Read more EngineeringHow Our Multi-Model ML Ensemble Detects Attacks Without Adding Latency
A technical deep dive into how PromptGuard's ensemble of specialized classifiers detects threats — covering parallel inference, weighted voting, category-specific thresholds, confidence calibration, and why multiple small models beat one large one.
Read more EngineeringOne MCP Server to Secure Every AI Tool
PromptGuard's MCP server works with Cursor, Claude, VS Code Copilot, Windsurf, Cline, Roo Code, Continue, Zed, Goose, Lovable, and every other MCP-compatible tool. Here's how one install protects everything.
Read more