Engineering

How Our Multi-Model ML Ensemble Detects Attacks Without Adding Latency

A technical deep dive into how PromptGuard's ensemble of specialized classifiers detects threats — covering parallel inference, weighted voting, category-specific thresholds, confidence calibration, and why multiple small models beat one large one.

PromptGuardPromptGuard
4 min read·
MLArchitectureSecurity

How Our Multi-Model ML Ensemble Detects Attacks

Most AI security tools are binary: safe or unsafe. They give you a verdict with no explanation, no confidence score, and no way to tune the sensitivity.

We take a different approach. PromptGuard's threat detection runs a multi-model ML ensemble that classifies inputs across multiple threat categories, fuses the results with weighted voting, applies category-specific thresholds, and returns a calibrated confidence score that actually means something.

This post is a deep technical dive into how the ensemble works — the design philosophy, the fusion logic, the calibration, and the decisions behind each component.

The Ensemble: Specialized Classifiers with Complementary Coverage

Each model in the ensemble is specialized for a specific threat surface. We didn't pick models at random — we selected classifiers with complementary training data and threat coverage so their blind spots don't overlap.

The ensemble includes:

  • Injection specialists — purpose-built classifiers trained specifically on prompt injection and jailbreak datasets. We run multiple injection models trained on different corpora, so attacks that fool one model are likely caught by the other.

  • Content moderation classifier — a multi-label model covering safety categories including sexual content, harassment, self-harm, violence, and hate speech. Its multi-label output gives us the most granular threat categorization.

  • Toxicity baseline — a model trained on one of the most well-studied toxicity datasets in the ML community. It acts primarily as a confirming signal: if the toxicity model doesn't flag something, we have strong evidence it's benign.

  • Adversarially-trained hate speech model — this model was trained through an adversarial process where humans actively tried to create examples that would fool the model, and the model was retrained on its failures. This makes it significantly more robust against evasion attempts than models trained on static datasets.

Each model in the ensemble is compact (under 150M parameters) and runs on CPU infrastructure. The models carry different weights in the fusion algorithm based on their specialization — injection specialists carry higher weight for injection decisions, content moderation models carry higher weight for safety decisions.

Parallel Inference

All models in the ensemble run simultaneously. This means the total wall-clock time is the latency of the slowest model, not the sum.

Sequential execution: Model1(30ms) + Model2(35ms) + Model3(25ms) + ... = 150ms+
Parallel execution:   max(30ms, 35ms, 25ms, ...) = 35ms

In practice, the inference latency varies based on API load, but parallel execution keeps the ensemble overhead well within our latency budget.

The Fusion Algorithm

Running multiple models gives you multiple scores. The hard part is combining them into a single, actionable decision. We don't just average — we use a multi-rule fusion algorithm that evaluates a prioritized chain of decision rules.

The rules are evaluated in order, and the first rule that triggers produces the final decision:

Specialist consensus. When multiple injection-specialized models agree with high confidence, we block immediately. These models were trained specifically for injection, and when they agree, the probability of a false positive is extremely low.

Category-specific thresholds. Different threat categories have different risk profiles. We use lower thresholds for categories where the cost of a false negative is catastrophic (self-harm, child safety) and higher thresholds for categories where over-blocking is more harmful (general toxicity). This asymmetry reflects the real-world cost asymmetry of different types of failures. Each threshold is further adjusted by the project's strictness preset — strict presets lower thresholds, permissive presets raise them.

Majority vote. If more than half of the models flag the content — even if no individual model has high confidence — we treat the consensus as a strong signal. This catches cases where the threat is subtle enough that no single model is certain, but the pattern is clear enough that most models detect something.

High-confidence specialist override. If any single model has very high confidence for a high-risk category, we don't wait for consensus. This catches specialist detections — cases where one model's training data gives it unique insight that the others lack.

Weighted aggregate. The fallback rule. If none of the above rules triggered, we compute a weighted average using the model weights and compare it against an adaptive threshold.

Confidence Calibration

Raw model scores are not probabilities. A model that outputs 0.7 is not saying "there's a 70% chance this is an attack." The relationship between raw scores and true probabilities varies by model, by category, and by the distribution of your traffic.

We solve this with Platt scaling: a learned sigmoid transformation that converts raw model outputs into calibrated probabilities. Each model has its own calibration parameters, tuned from production data.

After calibration, our confidence scores are true probabilities: when we report 0.90 confidence, approximately 90% of prompts with that score are actual threats.

Continuous Recalibration

Calibration parameters drift as attack patterns evolve and user traffic changes. We run a weekly automated recalibration process that:

  1. Collects all false positive and false negative reports since the last calibration
  2. Computes per-model error rates
  3. Adjusts calibration parameters conservatively based on the error distribution
  4. Validates changes against production traffic in shadow mode before deployment

This creates a continuous improvement loop: user feedback drives calibration adjustment, which improves precision, which reduces the volume of feedback needed.

Pattern Boosting for Coverage Gaps

ML models have training data gaps. If a category is underrepresented in training data, the model will underdetect it. We address this with pattern boosting: deterministic patterns that boost confidence scores for specific underrepresented categories.

For example, self-harm content may be underrepresented in toxicity training datasets (due to responsible data collection practices). We maintain patterns for known indicators and boost the ML confidence score when they match.

We also apply academic context reduction: if the prompt contains markers of academic or research context ("study," "research," "analysis"), we reduce the pattern boost. A prompt about "the epidemiology of self-harm in adolescents" should not be treated the same as a prompt instructing on self-harm methods.

Graceful Degradation

ML inference is inherently unreliable. APIs time out, models return unexpected outputs, rate limits activate. Our policy: ML failures never block requests.

If the ML inference layer returns an error, the ensemble returns "no detection" and the request proceeds with regex-only protection. We log the failure for monitoring, but the user's request is not interrupted.

This fail-open design is a deliberate choice. A security system that takes down your application whenever its ML models have a bad moment is worse than no security at all.

Why Multiple Small Models, Not One Large One

This is the question we get most often. Here's the case:

Diversity beats depth. Multiple models trained on different datasets with different architectures have different failure modes. When one model misses an attack, another catches it. When you aggregate across diverse models, the errors cancel out and the signal amplifies.

Specialization beats generalization. A single model that's "decent" at injection, toxicity, hate speech, and content moderation will always be outperformed by multiple models that are each excellent at one thing.

Failure isolation. If one model has a hiccup, the others still work. We degrade gracefully instead of failing completely.

Cost efficiency. Multiple small models (under 150M parameters each) running in parallel are dramatically cheaper than one 70B+ parameter model running on GPU. And for classification tasks, they're more accurate.

The Agentic Evaluator: The Optional Third Layer

For the borderline cases — prompts where the ensemble confidence falls in an ambiguous range — we optionally escalate to a larger language model for contextual reasoning.

These models can reason about context in ways that classifiers can't: "Is this a creative writing request or an actual threat?" "Is this academic research or social engineering?"

The agentic evaluator only runs when:

  1. At least one detector flagged the content
  2. Confidence is in the "unsure" zone
  3. There are custom policies with exceptions that might apply

This affects a very small fraction of traffic, so the latency cost is negligible in aggregate.

Conclusion

The PromptGuard ML ensemble is not magic. It's multiple well-chosen classifiers, running in parallel, combined with calibrated fusion logic and continuous improvement from production feedback.

The architecture reflects a simple belief: security is a precision problem, not an intelligence problem. You don't need a model that can write poetry to detect that someone is trying to override system instructions. You need multiple models that are very good at saying "this looks suspicious" and a fusion algorithm that's very good at combining their opinions.

Every decision is fully explainable. You can see exactly why a request was blocked, which detectors flagged it, and what confidence each reported.

That's the point. Security that you can't audit isn't security. It's faith.