Article

We Benchmarked Our Detection Engine Against 2,369 Samples from 7 Peer-Reviewed Datasets. Here Are the Results.

PromptGuard's multi-layered detection achieves F1 = 0.887 with 99.1% precision across TensorTrust (ICLR 2024), In-the-Wild Jailbreaks (ACM CCS 2024), JailbreakBench (NeurIPS 2024), XSTest (NAACL 2024), and more — with 100% evasion robustness where standalone classifiers achieve only 80%.

PromptGuardPromptGuard
4 min read·
BenchmarksSecurityMLResearch

We Benchmarked Our Detection Engine Against 2,369 Samples from 7 Peer-Reviewed Datasets

Most AI security vendors tell you their tool "catches prompt injection." Few publish numbers. Fewer still test against independent, peer-reviewed datasets where they don't control the data.

We ran PromptGuard's full detection pipeline against 2,369 samples from seven independent datasets - including benchmarks published at NeurIPS, ACM CCS, NAACL, and ICLR. We compared it against a leading open-source ML classifier deployed standalone, without our pipeline. Every prediction is saved per-sample to JSONL for full reproducibility.

Here's what we found.

The Setup

Three configurations tested on identical data:

  • PromptGuard Full: Our production pipeline: adversarial text normalization → regex pattern matching → ML classification → content safety classification → multi-turn intent drift detection → policy evaluation
  • Standalone ML classifier: A leading open-source injection classifier called directly, no preprocessing
  • PromptGuard Regex-Only: Pattern matching only, no ML (baseline)

Seven datasets, zero cherry-picking:

DatasetSourceSamplesWhat it tests
TensorTrustICLR 2024500Human-generated injection attacks from an adversarial game
In-the-Wild JailbreaksACM CCS 2024500Real jailbreak prompts from Reddit and Discord
deepset/prompt-injectionsCommunity500Mixed injection + benign prompts
JailbreakBenchNeurIPS 2024200Harmful behavior requests + benign baselines
XSTestNAACL 2024250Safe prompts that look dangerous (false positive test)
Benign CorpusInternal298Business, educational, and conversational prompts
Evasion SuiteInternal10010 attack seeds × 10 adversarial encoding techniques
Total2,3691,378 attack + 991 benign

Aggregate Results

ApproachF195% CIPrecisionRecallFPR
PromptGuard Full0.887[0.874, 0.900]99.1%80.3%1.01%
Standalone ML classifier0.850[0.834, 0.864]99.5%74.2%0.50%
Regex-Only (baseline)0.527[0.498, 0.554]99.0%35.9%0.50%

The confidence intervals don't overlap. The pipeline improvement over the standalone classifier is statistically significant at the 95% level.

What does this mean in practice? PromptGuard catches 6 more attacks out of every 100 that the standalone model misses, while maintaining 99.1% precision (almost zero false alarms).

Where the Pipeline Wins: Evasion Robustness

This is the result that matters most for production security. Attackers don't send clean English injection prompts - they encode them.

We took 10 canonical injection prompts and applied 10 evasion techniques to each:

Evasion TechniquePromptGuard FullStandalone ML classifier
Base64 encoding10/10 (100%)3/10 (30%)
Leetspeak10/10 (100%)1/10 (10%)
Text reversal10/10 (100%)7/10 (70%)
Unicode homoglyphs10/10 (100%)9/10 (90%)
Zero-width chars10/10 (100%)10/10 (100%)
Case alternation10/10 (100%)10/10 (100%)
Whitespace injection10/10 (100%)10/10 (100%)
Markdown wrapping10/10 (100%)10/10 (100%)
XML wrapping10/10 (100%)10/10 (100%)
Benign prefix10/10 (100%)10/10 (100%)
Total100/100 (100%)80/100 (80%)

Standalone ML classifiers — even strong ones trained on clean text — fail catastrophically on base64 (30%) and leetspeak (10%). These are trivial encoding techniques that any motivated attacker will try first.

PromptGuard's adversarial text normalization layer strips the encoding before the text reaches the ML model, restoring its full detection capability. The model doesn't need to learn every encoding - the normalization layer handles it.

Per-Dataset Breakdown

DatasetPG Full F1Baseline F1Delta
TensorTrust (ICLR 2024)0.9920.9920.0%
In-the-Wild (ACM CCS 2024)0.9020.841+7.3%
Internal Red Team1.0000.865+15.6%
Evasion Suite1.0000.889+12.5%
deepset/prompt-injections0.6390.612+4.3%

On in-the-wild jailbreak prompts — the real adversarial distribution from Reddit and Discord — PromptGuard outperforms the standalone classifier by 7.3%. On our internal red team vectors, the gap is 15.6%.

The largest gains come from the normalization layer catching encoded attacks that the raw model can't parse.

What About False Positives?

We tested against 250 safe prompts from XSTest (NAACL 2024) - prompts that deliberately use language similar to unsafe content (homonyms, figurative language, safe contexts). A poorly calibrated detector would flag these.

  • PromptGuard Full: 0.4% FPR (1 out of 250)
  • Standalone ML classifier: 0.0% FPR
  • Benign corpus (298 prompts): 0.0% FPR for all detectors

99.6% of safe-but-tricky prompts pass through correctly. No legitimate business prompts are blocked.

The Ablation: What Each Layer Contributes

ConfigurationF1RecallPrecision
Regex only (baseline)0.52735.9%99.0%
Full pipeline (norm + regex + ML)0.88780.3%99.1%

Regex catches known patterns with 99% precision but only 36% recall. Adding ML doubles recall while maintaining precision. The normalization layer closes the evasion gap completely.

Reproducibility

Every prediction from this benchmark is saved to per-sample JSONL files with timestamps, confidence scores, and latencies. The benchmark script supports:

  • Resume capability: re-run skips already-computed samples
  • Eval-only mode: recompute metrics from saved predictions without re-running inference
  • Structured run directories: full provenance for every experiment

The code, dataset loaders, and results are available at github.com/promptguard.

What This Means

If you're deploying an LLM-powered application and relying on a standalone ML classifier for injection detection, you have a 20% blind spot on adversarial evasion. Base64 encoding and leetspeak — techniques any script kiddie can apply — drop even strong classifiers to 30% and 10% detection respectively.

PromptGuard's multi-layered architecture closes this gap completely. The normalization layer defeats encoding evasion. The regex layer catches known patterns with near-zero latency. The ML layer generalizes to novel attacks. Together, they achieve what no single layer can.


Want to test PromptGuard against your own adversarial prompts? Start free or contact us for enterprise evaluation.