
False Positives Are the Silent Killer of AI Adoption
In traditional security, there's an old saying: "It's better to block a legitimate user than let a hacker in."
In the AI world, that maxim is wrong.
When you block a legitimate user from using your chatbot, they don't file a bug report. They don't call support. They just think your AI is "broken" or "dumb," and they leave. They tell their friends. They cancel their subscription. The damage is invisible in your security metrics and devastating in your business metrics.
False positives are the silent killer of AI adoption. They don't show up in your incident reports. They show up in your churn rate.
The "Ignore Previous Instructions" Paradox
Let's start with a concrete example of why false positives are so hard to eliminate.
The phrase "ignore previous instructions" is the canonical prompt injection attack. It's in every security training deck, every blog post, every AI safety paper.
It's also a phrase used by:
- Lawyers: "Please ignore previous instructions regarding clause 7.2 and use the updated terms."
- Teachers: "Students should ignore previous instructions for the essay and follow the new rubric."
- Customer support agents: "Ignore previous instructions from the bot—here's how to actually fix your issue."
- Developers debugging AI: "I told it to ignore previous instructions but it keeps following them. Can you help?"
If you regex-block that phrase, you break the app for lawyers, teachers, support agents, and developers. That's four legitimate use cases destroyed to catch one attack pattern.
Now multiply this across hundreds of patterns. "Write a story about hacking." "Explain how to pick a lock." "What chemicals are in household cleaners?" Every one of these is simultaneously a legitimate query and a potential attack. The difference is intent, and intent can't be read from keywords.
Why Keyword Filters Produce Unacceptable False Positive Rates
We tested a keyword-based filter against a sample of production traffic. The results were dismal:
| Metric | Keyword Filter | ML Ensemble |
|---|---|---|
| True Positive Rate | 89% | 97% |
| False Positive Rate | 8.2% | <0.1% |
| User Impact | 1 in 12 users blocked incorrectly | <1 in 1,000 |
An 8.2% false positive rate means roughly 1 in 12 legitimate users will be blocked. For a chatbot handling 10,000 conversations per day, that's 820 frustrated users—every single day. At a 5% conversion rate, that's 41 lost customers daily from your security tool alone.
The ML ensemble reduces this by roughly 80x. Here's how.
Confidence Calibration: Making Scores Meaningful
The first problem with raw ML model outputs is that they're not probabilities. A model that outputs 0.7 is not saying "there's a 70% chance this is an attack." It's outputting a score on an arbitrary scale that happens to be between 0 and 1.
The mapping between raw scores and actual attack probability varies by:
- Model: DeBERTa and Llama-Prompt-Guard produce very different score distributions for the same inputs.
- Category: Injection scores tend to be more calibrated than toxicity scores.
- Traffic distribution: A model trained on academic datasets will be miscalibrated on production traffic.
We solve this with Platt scaling—a learned sigmoid transformation:
calibrated_score = sigmoid(a × raw_score + b)Each model in the ensemble has its own a (scale) and b (bias) parameters, learned from production data. After calibration, a score of 0.90 means "90% of prompts with this score were actual attacks." This is what makes our confidence header (X-PromptGuard-Confidence) trustworthy.
Example: Before and After Calibration
| Raw Score | Calibrated Score | Actual Attack Rate |
|---|---|---|
| 0.50 | 0.23 | 21% |
| 0.70 | 0.58 | 55% |
| 0.80 | 0.76 | 74% |
| 0.90 | 0.91 | 89% |
| 0.95 | 0.97 | 96% |
Without calibration, a threshold of 0.50 would seem reasonable—"50% confidence, probably an attack." But the calibrated score reveals that only 21% of prompts at that raw score are actual attacks. You'd be blocking 79% legitimate users. With calibration, you can set thresholds that correspond to actual precision targets.
The Feedback Loop: Continuous Improvement
Calibration parameters aren't set once and forgotten. They drift as attack patterns evolve and as your user base changes. Our solution is a weekly recalibration pipeline driven by user feedback.
How Feedback Works
When PromptGuard makes a mistake, users can report it:
- False positive report: "This was a legitimate prompt. You shouldn't have blocked it."
- False negative report: "This was an attack. You should have caught it."
Each report is stored with the original prompt, the decision, the confidence score, the individual model scores, and the user's correction. This creates a labeled dataset of production edge cases—exactly the data you need to improve calibration.
The Recalibration Process
Every week, a maintenance job runs automatically:
- Collect feedback: Pull all unprocessed feedback entries, grouped by model.
- Calculate error rates: For each model, compute the false positive rate and false negative rate from the feedback.
- Adjust calibration parameters:
- If false negatives > false positives: nudge
aup (make the model more sensitive) andbdown (shift the threshold lower). - If false positives > false negatives: nudge
adown (make the model less sensitive) andbup (shift the threshold higher).
- If false negatives > false positives: nudge
- Clamp parameters:
ais bounded to [0.3, 3.0] andbto [-1.0, 1.0] to prevent runaway drift. - Mark feedback as processed.
The nudge formula is intentionally conservative:
a_delta = (fn_rate - fp_rate) × 0.1
b_delta = (fn_rate - fp_rate) × 0.05Small adjustments, every week, driven by real data. This is how you go from 2.4% false positives to under 0.1% without manual tuning.
The Shadow Mode Safety Net
What if a recalibration makes things worse? We catch this with shadow mode.
When testing new calibration parameters, we run the new model configuration alongside the production configuration on live traffic. Both configurations evaluate every request, but only the production configuration's decision is used. The shadow configuration's decision is logged for comparison.
If the shadow configuration produces more disagreements with the production configuration than expected, we don't deploy it. If it produces fewer false positives with no increase in false negatives, we promote it to production.
This A/B testing framework ensures that calibration changes are validated against real traffic before they affect users.
Category-Specific Thresholds
One-size-fits-all thresholds are another source of false positives.
A toxicity score of 0.40 for "self-harm" content should be treated very differently from a score of 0.40 for "general toxicity." We'd rather over-block on self-harm (the cost of a false negative is catastrophic) and under-block on general toxicity (the cost of a false negative is a rude message).
Our threshold configuration reflects this:
self_harm: 0.25 (very aggressive — over-block is acceptable)
sexual_minors: 0.25
violence: 0.30
hate_speech: 0.40
harassment: 0.45
sexual: 0.45
general: 0.50 (conservative — avoid over-blocking)Each threshold can be further adjusted by the preset's strictness level: strict subtracts 0.10 (more aggressive), permissive adds 0.10 (more lenient).
The Agentic Evaluator: Handling the Truly Ambiguous
Some prompts defeat both regex and ML classification. They exist in a gray zone where the confidence is between 0.4 and 0.8—too high to ignore, too low to block with conviction.
For these borderline cases, we escalate to an agentic evaluator: a larger language model (IBM Granite Guardian or Meta Llama Guard) that can reason about context.
The agentic evaluator runs when all three conditions are met:
- At least one ML model detected a threat
- The confidence is in the "unsure" zone (0.4-0.8)
- There are custom policies with exceptions that might apply
This adds latency (~200ms), but it only runs on <1% of traffic. The tradeoff is worth it: the agentic evaluator catches nuanced cases that classifiers miss, without imposing its latency cost on the 99% of requests that are clearly safe or clearly dangerous.
Measuring What Matters
We track four key metrics for detection quality:
| Metric | What It Measures | Target |
|---|---|---|
| Precision | Of all blocked requests, how many were actual attacks? | >99% |
| Recall | Of all actual attacks, how many did we catch? | >97% |
| False Positive Rate | What percentage of legitimate requests are incorrectly blocked? | <0.1% |
| Latency (p95) | How long does the security check take? | <200ms |
These metrics are computed from the feedback data and exposed in the dashboard. You can see exactly how the system is performing for your specific traffic pattern.
Practical Advice
If you're building your own detection system—or evaluating one—here's what we've learned:
-
Log your blocks. If you aren't reviewing what you block, you have no idea who you're annoying. Every false positive is a user experience failure.
-
Calibrate your scores. Raw model outputs are not probabilities. If you're setting thresholds on uncalibrated scores, you're guessing.
-
Build a feedback mechanism. Users will tell you when you're wrong, if you give them a way to do it. That feedback is the most valuable data you'll ever collect for improving detection quality.
-
Use category-specific thresholds. "Self-harm" and "general toxicity" have very different costs of being wrong. Treat them differently.
-
Test changes in shadow mode. Never deploy a model change to production without validating it against real traffic first.
-
Prefer precision over recall. A missed attack is bad. A blocked user is worse. You can always increase sensitivity later, but you can't un-frustrate a user you blocked incorrectly.
The goal isn't zero false positives—that's impossible without also missing every attack. The goal is a false positive rate so low that it disappears into the noise of your application's normal operations. Under 0.1% is where it stops being a user experience problem and becomes a rounding error.
That's where we are. And we keep pushing it lower every week.
READ MORE

Inside Our 5-Model ML Ensemble: How We Detect Attacks Without Adding Latency
A technical deep dive into how PromptGuard's ensemble of Llama-Prompt-Guard, DeBERTa, ALBERT, toxic-bert, and RoBERTa classifies threats—covering parallel inference, weighted voting, category-specific thresholds, confidence calibration, and why five small models beat one large one.

Why We Don't Use LLMs to Secure LLMs
Using GPT-4 to check if a prompt is safe doubles your latency and your bill. Here's why we bet on a 5-model classical ML ensemble, and how it outperforms single-model approaches at a fraction of the cost.

Anatomy of Prompt Injection: Attack Patterns We See in Production
We sit between thousands of apps and their LLM providers. Here are the five categories of prompt injection attacks we block regularly, how each one works, and why they're harder to stop than you think.