How Our Multi-Model ML Ensemble Detects Attacks
Most AI security tools are binary: safe or unsafe. They give you a verdict with no explanation, no confidence score, and no way to tune the sensitivity.
We take a different approach. PromptGuard's threat detection runs a multi-model ML ensemble that classifies inputs across multiple threat categories, fuses the results with weighted voting, applies category-specific thresholds, and returns a calibrated confidence score that actually means something.
This post is a deep technical dive into how the ensemble works — the design philosophy, the fusion logic, the calibration, and the decisions behind each component.
The Ensemble: Specialized Classifiers with Complementary Coverage
Each model in the ensemble is specialized for a specific threat surface. We didn't pick models at random — we selected classifiers with complementary training data and threat coverage so their blind spots don't overlap.
The ensemble includes:
-
Injection specialists — purpose-built classifiers trained specifically on prompt injection and jailbreak datasets. We run multiple injection models trained on different corpora, so attacks that fool one model are likely caught by the other.
-
Content moderation classifier — a multi-label model covering safety categories including sexual content, harassment, self-harm, violence, and hate speech. Its multi-label output gives us the most granular threat categorization.
-
Toxicity baseline — a model trained on one of the most well-studied toxicity datasets in the ML community. It acts primarily as a confirming signal: if the toxicity model doesn't flag something, we have strong evidence it's benign.
-
Adversarially-trained hate speech model — this model was trained through an adversarial process where humans actively tried to create examples that would fool the model, and the model was retrained on its failures. This makes it significantly more robust against evasion attempts than models trained on static datasets.
Each model in the ensemble is compact (under 150M parameters) and runs on CPU infrastructure. The models carry different weights in the fusion algorithm based on their specialization — injection specialists carry higher weight for injection decisions, content moderation models carry higher weight for safety decisions.
Parallel Inference
All models in the ensemble run simultaneously. This means the total wall-clock time is the latency of the slowest model, not the sum.
Sequential execution: Model1(30ms) + Model2(35ms) + Model3(25ms) + ... = 150ms+
Parallel execution: max(30ms, 35ms, 25ms, ...) = 35msIn practice, the inference latency varies based on API load, but parallel execution keeps the ensemble overhead well within our latency budget.
The Fusion Algorithm
Running multiple models gives you multiple scores. The hard part is combining them into a single, actionable decision. We don't just average — we use a multi-rule fusion algorithm that evaluates a prioritized chain of decision rules.
The rules are evaluated in order, and the first rule that triggers produces the final decision:
Specialist consensus. When multiple injection-specialized models agree with high confidence, we block immediately. These models were trained specifically for injection, and when they agree, the probability of a false positive is extremely low.
Category-specific thresholds. Different threat categories have different risk profiles. We use lower thresholds for categories where the cost of a false negative is catastrophic (self-harm, child safety) and higher thresholds for categories where over-blocking is more harmful (general toxicity). This asymmetry reflects the real-world cost asymmetry of different types of failures. Each threshold is further adjusted by the project's strictness preset — strict presets lower thresholds, permissive presets raise them.
Majority vote. If more than half of the models flag the content — even if no individual model has high confidence — we treat the consensus as a strong signal. This catches cases where the threat is subtle enough that no single model is certain, but the pattern is clear enough that most models detect something.
High-confidence specialist override. If any single model has very high confidence for a high-risk category, we don't wait for consensus. This catches specialist detections — cases where one model's training data gives it unique insight that the others lack.
Weighted aggregate. The fallback rule. If none of the above rules triggered, we compute a weighted average using the model weights and compare it against an adaptive threshold.
Confidence Calibration
Raw model scores are not probabilities. A model that outputs 0.7 is not saying "there's a 70% chance this is an attack." The relationship between raw scores and true probabilities varies by model, by category, and by the distribution of your traffic.
We solve this with Platt scaling: a learned sigmoid transformation that converts raw model outputs into calibrated probabilities. Each model has its own calibration parameters, tuned from production data.
After calibration, our confidence scores are true probabilities: when we report 0.90 confidence, approximately 90% of prompts with that score are actual threats.
Continuous Recalibration
Calibration parameters drift as attack patterns evolve and user traffic changes. We run a weekly automated recalibration process that:
- Collects all false positive and false negative reports since the last calibration
- Computes per-model error rates
- Adjusts calibration parameters conservatively based on the error distribution
- Validates changes against production traffic in shadow mode before deployment
This creates a continuous improvement loop: user feedback drives calibration adjustment, which improves precision, which reduces the volume of feedback needed.
Pattern Boosting for Coverage Gaps
ML models have training data gaps. If a category is underrepresented in training data, the model will underdetect it. We address this with pattern boosting: deterministic patterns that boost confidence scores for specific underrepresented categories.
For example, self-harm content may be underrepresented in toxicity training datasets (due to responsible data collection practices). We maintain patterns for known indicators and boost the ML confidence score when they match.
We also apply academic context reduction: if the prompt contains markers of academic or research context ("study," "research," "analysis"), we reduce the pattern boost. A prompt about "the epidemiology of self-harm in adolescents" should not be treated the same as a prompt instructing on self-harm methods.
Graceful Degradation
ML inference is inherently unreliable. APIs time out, models return unexpected outputs, rate limits activate. Our policy: ML failures never block requests.
If the ML inference layer returns an error, the ensemble returns "no detection" and the request proceeds with regex-only protection. We log the failure for monitoring, but the user's request is not interrupted.
This fail-open design is a deliberate choice. A security system that takes down your application whenever its ML models have a bad moment is worse than no security at all.
Why Multiple Small Models, Not One Large One
This is the question we get most often. Here's the case:
Diversity beats depth. Multiple models trained on different datasets with different architectures have different failure modes. When one model misses an attack, another catches it. When you aggregate across diverse models, the errors cancel out and the signal amplifies.
Specialization beats generalization. A single model that's "decent" at injection, toxicity, hate speech, and content moderation will always be outperformed by multiple models that are each excellent at one thing.
Failure isolation. If one model has a hiccup, the others still work. We degrade gracefully instead of failing completely.
Cost efficiency. Multiple small models (under 150M parameters each) running in parallel are dramatically cheaper than one 70B+ parameter model running on GPU. And for classification tasks, they're more accurate.
The Agentic Evaluator: The Optional Third Layer
For the borderline cases — prompts where the ensemble confidence falls in an ambiguous range — we optionally escalate to a larger language model for contextual reasoning.
These models can reason about context in ways that classifiers can't: "Is this a creative writing request or an actual threat?" "Is this academic research or social engineering?"
The agentic evaluator only runs when:
- At least one detector flagged the content
- Confidence is in the "unsure" zone
- There are custom policies with exceptions that might apply
This affects a very small fraction of traffic, so the latency cost is negligible in aggregate.
Conclusion
The PromptGuard ML ensemble is not magic. It's multiple well-chosen classifiers, running in parallel, combined with calibrated fusion logic and continuous improvement from production feedback.
The architecture reflects a simple belief: security is a precision problem, not an intelligence problem. You don't need a model that can write poetry to detect that someone is trying to override system instructions. You need multiple models that are very good at saying "this looks suspicious" and a fusion algorithm that's very good at combining their opinions.
Every decision is fully explainable. You can see exactly why a request was blocked, which detectors flagged it, and what confidence each reported.
That's the point. Security that you can't audit isn't security. It's faith.
Continue Reading
Why Support Bots Are Your Biggest Security Hole (And How to Fix It)
We've watched helpfully trained bots email transaction histories to strangers, issue unauthorized refunds, and leak internal system prompts-all without a single 'jailbreak' keyword. Here's the three-layer defense architecture that actually secures customer support AI.
Read more EngineeringBuilding Secure AI Agents: A Default-Deny Architecture
We gave an AI agent permission to 'clean up temp files.' It followed a symlink and deleted 3 months of production logs. Here's the architecture we built to prevent autonomous agents from causing irreversible damage.
Read more EngineeringWhy We Don't Use LLMs to Secure LLMs
Using GPT-4 to check if a prompt is safe doubles your latency and your bill. Here's why we bet on a multi-model classical ML ensemble, and how it outperforms single-model approaches at a fraction of the cost.
Read more