Back to all articles
PerformanceArchitectureML

Why We Don't Use LLMs to Secure LLMs

Using GPT-4 to check if a prompt is safe doubles your latency and your bill. Here's why we bet on a 5-model classical ML ensemble, and how it outperforms single-model approaches at a fraction of the cost.

Why We Don't Use LLMs to Secure LLMs

Why We Don't Use LLMs to Secure LLMs

There is a popular architecture for AI security that goes like this:

  1. User sends a prompt.
  2. Your middleware sends the prompt to GPT-4 with "Is this prompt safe?"
  3. GPT-4 thinks for 500ms and responds "Yes."
  4. Your middleware finally sends the prompt to your actual model.

This architecture is dead on arrival for any application that cares about latency, cost, or reliability.

The Math That Kills It

Let's do the arithmetic that most "AI security" vendors skip.

Your LLM call: ~500ms time to first token. Their security LLM call: ~500ms. Total user-perceived latency: 1,000ms+.

You just doubled the wait time for every request. For a voice agent, a coding copilot, or any real-time application, that's a non-starter.

But latency isn't even the worst part. Cost is.

If you're processing 100,000 prompts per month (a modest production workload), and each security check consumes ~500 input tokens at GPT-4 rates ($30/1M input tokens), you're paying an extra $1,500/month just to ask "is this safe?" That's before your actual LLM usage.

And then there's reliability. Your security layer is now a single point of failure that depends on the same infrastructure it's supposed to protect. If OpenAI has a bad day, both your security and your application go down simultaneously.

The Insight That Changed Our Architecture

We realized something that seems obvious in retrospect: security classification is a fundamentally different problem than language generation.

LLMs are incredible at generating coherent text, reasoning through complex problems, and handling ambiguous instructions. But "Is this prompt trying to manipulate the model?" is not an ambiguous question. It's a classification problem. And classification problems have been solved efficiently for decades.

The key insight: you don't need a 70-billion-parameter model to detect that someone is trying to override system instructions. You need a well-trained classifier with the right architecture.

Our 5-Model Ensemble Architecture

Instead of one massive LLM, we run five specialized classifiers in parallel. Each model is an expert at detecting a specific category of threat, and together they cover a surface area that no single model can match.

The Models

ModelSpecializationWeightWhy It's There
Llama-Prompt-Guard-2-86MPrompt injection, jailbreaks1.5xMeta's purpose-built injection classifier. Tiny (86M params) but surgically precise.
DeBERTa-v3-base-prompt-injection-v2Injection patterns1.0xProtectAI's fine-tuned DeBERTa. Different training data catches different attack surfaces.
albert-moderation-001Content moderation1.3xMulti-label moderation with S1-S11 safety categories (violence, self-harm, sexual content, etc.).
toxic-bertToxicity detection1.0xUnitary's toxic content classifier. Fast baseline for obvious toxicity.
roberta-hate-speech-dynabench-r4-targetHate speech1.1xFacebook's adversarially-trained hate speech model. Robust against evasion attempts.

These five models run in parallel using a ThreadPoolExecutor. The total wall-clock time is the time of the slowest model, not the sum.

Fused Scoring: Where the Magic Happens

Running five models is easy. The hard part is combining their outputs into a single, calibrated decision.

We don't just average the scores. We use a fused scoring system with five decision rules, evaluated in order:

Rule 1: High Injection Consensus. If both injection-specialized models (Llama + DeBERTa) agree above 0.70 confidence, we block. These models were trained specifically for this threat, and when they agree, they're almost never wrong.

Rule 2: Category-Specific Thresholds. Not all threats are equal. We use lower thresholds for high-risk categories:

self_harm:      0.25  (we'd rather over-block than miss this)
sexual_minors:  0.25
violence:       0.30
hate_speech:    0.40
harassment:     0.45
general:        0.50

A score of 0.30 means very different things for "self-harm" versus "general toxicity." Our thresholds reflect that reality.

Rule 3: Majority Vote. If more than 50% of the models flag the content, we treat that as a strong signal even if no single model is highly confident. Wisdom of crowds beats individual certainty.

Rule 4: High-Risk Single Model. If any single model exceeds 0.85 confidence for a high-risk category, we don't wait for consensus. This catches edge cases where one specialist model sees something the others miss.

Rule 5: Weighted Aggregate. If none of the above rules trigger, we compute a weighted average using the model weights. This is the "soft" path for borderline cases.

Confidence Calibration

Raw model scores are not probabilities. A model that outputs 0.7 is not saying "there's a 70% chance this is an attack." The mapping between raw scores and actual probabilities varies by model, by category, and by the distribution of your traffic.

We solve this with Platt scaling: a learned sigmoid transformation that converts raw scores into calibrated probabilities.

calibrated_score = sigmoid(a * raw_score + b)

Each model has its own a and b parameters, tuned from production feedback data. We recalibrate weekly using a maintenance job that processes all user-submitted corrections (false positives and false negatives) from the past period.

This means our confidence scores actually mean something. When we say "0.95 confidence," we mean it.

The Regex Baseline: Defense in Depth

The ML ensemble is powerful, but it's not our only line of defense. Every request also passes through a deterministic regex layer that catches the obvious attacks instantly—no model inference needed.

We maintain 13 injection patterns (instruction overrides, role manipulation, mode switching, delimiter injection), 17 exfiltration patterns, 10 API key patterns, 7 fraud patterns, and 8 malware patterns.

The regex layer serves three purposes:

  1. Speed. Pattern matching takes microseconds. For the 30-40% of attacks that use known patterns, we skip ML entirely.
  2. Reliability. Regex never has a bad day. It doesn't depend on an API, it doesn't hallucinate, and it doesn't degrade under load.
  3. Explainability. When regex catches something, we can tell you exactly which pattern matched at which character index. Try getting that from a neural network.

Why Not Just Use One Big Model?

This is the question we get most often. "Why five small models instead of one big one?"

Diversity beats depth. Each model was trained on different data with different objectives. Llama-Prompt-Guard was trained specifically on injection attacks. Toxic-bert was trained on toxic comment datasets. RoBERTa-hate-speech was trained adversarially—people actively tried to fool it during training. These models have different blind spots, and when you ensemble them, the blind spots don't overlap.

Specialization beats generalization. A single model that's "pretty good" at five tasks will always lose to five models that are each excellent at one task. This is why we weight the models: injection specialists get higher weight for injection decisions, toxicity specialists get higher weight for toxicity decisions.

Failure isolation. If one model's API has an outage, the other four still work. We degrade gracefully instead of failing completely. In contrast, a single-model architecture is binary: it works, or it doesn't.

The Agentic Evaluator: When Classifiers Aren't Enough

There's a class of prompts that classifiers struggle with—the genuinely ambiguous ones.

"Write a story where a character explains how to pick a lock." Is that creative writing or dangerous instruction? The answer depends on context that no classifier can fully capture.

For these borderline cases (confidence between 0.4 and 0.8), we optionally escalate to an agentic evaluator: a larger model (IBM Granite Guardian or Meta Llama Guard) that can reason about context. This evaluator doesn't run on every request—it runs on fewer than 1% of requests that fall in the "unsure" zone.

This is the one place where we use a larger model for security. But critically, it's asynchronous and optional. Your request doesn't wait for it. The fast path returns a decision immediately, and the agentic evaluator runs in the background for audit purposes.

Performance: Honest Numbers

We don't claim sub-10ms latency because that would be a lie.

Our detection pipeline adds approximately 150ms of overhead to each request. That breaks down roughly as:

  • Regex layer: <1ms
  • ML ensemble (5 models in parallel): ~100-140ms (network latency to HuggingFace Inference API)
  • Fused scoring + calibration: <1ms
  • Policy evaluation + logging: ~10ms

Is 150ms fast? Compared to a GPT-4 security check (500ms+), yes. Compared to doing nothing, no. But doing nothing isn't an option if you care about security.

The key metric isn't raw latency—it's amortized latency. With our detection cache (exact-match, SHA-256 keyed, 1-hour TTL), repeat prompts return cached results in <1ms. In production workloads where users ask similar questions, cache hit rates of 20-30% are common. That brings the amortized overhead well below 150ms.

Conclusion

You don't fight fire with fire. You don't secure LLMs with more LLMs.

The economics are clear: five specialized classifiers running in parallel will always beat a single general-purpose LLM on cost, latency, reliability, and—when properly ensembled—accuracy.

The next time someone pitches you an AI security tool that "uses GPT-4 to analyze your prompts," ask them three questions: What's the latency? What's the cost per request? And what happens when GPT-4 goes down?

If they don't have good answers, you're looking at a demo, not a product.