Back to all articles
MLArchitectureSecurity

Inside Our 5-Model ML Ensemble: How We Detect Attacks Without Adding Latency

A technical deep dive into how PromptGuard's ensemble of Llama-Prompt-Guard, DeBERTa, ALBERT, toxic-bert, and RoBERTa classifies threats—covering parallel inference, weighted voting, category-specific thresholds, confidence calibration, and why five small models beat one large one.

Inside Our 5-Model ML Ensemble: How We Detect Attacks Without Adding Latency

Inside Our 5-Model ML Ensemble

Most AI security tools are binary: safe or unsafe. They give you a verdict with no explanation, no confidence score, and no way to tune the sensitivity.

We take a different approach. PromptGuard's threat detection runs a 5-model ML ensemble that classifies inputs across multiple threat categories, fuses the results with weighted voting, applies category-specific thresholds, and returns a calibrated confidence score that actually means something.

This post is a deep technical dive into how the ensemble works—the models, the fusion logic, the calibration, and the design decisions behind each component. If you care about how AI security actually works under the hood, this is for you.

The Five Models

Each model in the ensemble is specialized for a specific threat surface. We didn't pick five random models—we selected models with complementary training data and threat coverage so their blind spots don't overlap.

1. Llama-Prompt-Guard-2-86M (Weight: 1.5x)

Specialization: Prompt injection and jailbreak detection Provider: Meta Size: 86M parameters Why it's here: This is Meta's purpose-built injection classifier. It was trained on a curated dataset of prompt injection attacks and benign prompts, with specific attention to semantic evasion techniques (roleplay, encoding, hypothetical framing). It's the most reliable single model for injection detection in our testing. Why the 1.5x weight: It's the gold standard for its specialty. When it fires with high confidence on injection, it's almost never wrong.

2. DeBERTa-v3-base-prompt-injection-v2 (Weight: 1.0x)

Specialization: Prompt injection patterns Provider: ProtectAI Size: ~86M parameters (DeBERTa-v3-base) Why it's here: Different training data from Llama-Prompt-Guard. ProtectAI trained this model on their own corpus of injection attacks, including attack patterns from the security research community. Having two injection models with different training data means attacks that fool one model are likely caught by the other. Label mapping: Returns INJECTION, JAILBREAK, or LABEL_1 for positive detections.

3. ALBERT-moderation-001 (Weight: 1.3x)

Specialization: Multi-label content moderation Provider: OxyAPI Size: ~11M parameters (ALBERT is designed to be small) Why it's here: Covers the S1-S11 safety categories that the injection-focused models miss: sexual content, harassment, self-harm, violence/graphic, hate/threatening. It's the broadest coverage model in the ensemble. Label mapping: S=sexual, H=hate, V=violence, HR=harassment, SH=self_harm, S3=sexual_minors, H2=hate_threatening, V2=violence_graphic. Why the 1.3x weight: Its multi-label output gives us the most granular threat categorization.

4. toxic-bert (Weight: 1.0x)

Specialization: Toxicity baseline Provider: Unitary AI Size: ~110M parameters (BERT-base) Why it's here: Trained on the Jigsaw Toxic Comment Classification dataset—one of the most well-studied toxicity datasets in the ML community. It's our baseline: if toxic-bert doesn't flag something as toxic, we have strong evidence it's benign. If it does flag it, we compare against the other models. Role: Primarily acts as a confirming signal. Its 1.0x weight means it doesn't dominate decisions but contributes to majority voting.

5. RoBERTa-hate-speech-dynabench-r4-target (Weight: 1.1x)

Specialization: Hate speech, specifically adversarially robust Provider: Facebook Research Size: ~125M parameters (RoBERTa-base) Why it's here: This model was trained on the Dynabench dataset, where the training data was generated through an adversarial process—humans actively tried to create examples that would fool the model, and the model was retrained on its failures. This makes it significantly more robust against evasion attempts than models trained on static datasets. Why it matters: Hate speech evolves constantly. New slurs, coded language, and dogwhistles appear weekly. An adversarially-trained model adapts better to novel patterns than one trained on a fixed vocabulary.

Parallel Inference

All five models run simultaneously using a ThreadPoolExecutor(max_workers=5). This means the total wall-clock time for the ensemble is the latency of the slowest model, not the sum.

Sequential execution: Model1(30ms) + Model2(35ms) + Model3(25ms)
                    + Model4(30ms) + Model5(40ms) = 160ms

Parallel execution:   max(30ms, 35ms, 25ms, 30ms, 40ms) = 40ms

In practice, the inference latency varies based on the HuggingFace Inference API's load, but typical parallel execution completes in 100-140ms for all five models.

The Fusion Algorithm

Running five models gives you five scores. The hard part is combining them into a single, actionable decision. We don't just average—we use a multi-rule fusion algorithm that evaluates five decision rules in order.

Rule 1: High Injection Consensus

IF llama_injection_score >= 0.70 AND deberta_injection_score >= 0.70:
    BLOCK with confidence = weighted_average(llama, deberta)

When both injection specialists agree with high confidence, we block immediately. These two models were trained specifically for injection, and when they both fire, the probability of a false positive is extremely low.

Rule 2: Category-Specific Thresholds

Different threat categories have different risk profiles. We use lower thresholds for categories where the cost of a false negative is catastrophic:

CATEGORY_THRESHOLDS = {
    "self_harm":      0.25,   # Err on the side of caution
    "sexual_minors":  0.25,   # Zero tolerance
    "drugs":          0.30,
    "violence":       0.30,
    "weapons":        0.30,
    "hate_speech":    0.40,
    "harassment":     0.45,
    "sexual":         0.45,
    "general":        0.50,   # Higher bar for generic toxicity
}

A content moderation score of 0.30 for "self-harm" triggers a block. The same score for "general toxicity" does not. This asymmetry reflects the real-world cost asymmetry of different types of failures.

Each threshold is further adjusted by the project's strictness preset:

  • Strict: -0.10 (more aggressive)
  • Moderate: no change
  • Permissive: +0.10 (more lenient)

Rule 3: Majority Vote

IF (models_flagging / total_models) > 0.50:
    BLOCK with confidence = weighted_average(flagging_models)

If more than half of the models flag the content—even if no individual model has high confidence—we treat the consensus as a strong signal. This catches cases where the threat is subtle enough that no single model is certain, but the pattern is clear enough that most models detect something.

Rule 4: Single-Model High Confidence

IF any_model_score >= 0.85 AND category in HIGH_RISK_CATEGORIES:
    BLOCK with confidence = that_model_score

If any single model has very high confidence (0.85+) for a high-risk category, we don't wait for consensus. This catches specialist detections—cases where one model's training data gives it unique insight that the others lack.

Rule 5: Weighted Aggregate

weighted_score = sum(model_score * model_weight) / sum(weights)
IF weighted_score >= adaptive_threshold:
    BLOCK with confidence = weighted_score

The fallback rule. If none of the above rules triggered, we compute a weighted average and compare it against an adaptive threshold.

Confidence Calibration

Raw model scores are not probabilities. A DeBERTa score of 0.7 does not mean "70% chance of injection." The relationship between raw scores and true probabilities varies by model, by category, and by the distribution of your traffic.

We solve this with Platt scaling: a learned sigmoid transformation.

def calibrate(raw_score: float, a: float, b: float) -> float:
    return 1.0 / (1.0 + math.exp(-(a * raw_score + b)))

Each model has its own a (scale) and b (bias) parameters:

CALIBRATION_PARAMS = {
    "protectai/deberta-v3-base-prompt-injection-v2": {"a": 1.2, "b": -0.3},
    "unitary/toxic-bert": {"a": 1.0, "b": -0.1},
    "facebook/roberta-hate-speech-dynabench-r4-target": {"a": 1.1, "b": -0.2},
}

After calibration, our confidence scores are true probabilities: when we report 0.90 confidence, approximately 90% of prompts with that score are actual threats.

Weekly Recalibration

Calibration parameters drift as attack patterns evolve and user traffic changes. Our weekly maintenance job recalibrates using production feedback data:

  1. Collect all false positive and false negative reports since last calibration
  2. Compute per-model error rates
  3. Nudge a and b based on the error distribution:
    • More false negatives → increase sensitivity (nudge a up)
    • More false positives → decrease sensitivity (nudge a down)
  4. Clamp parameters to safe ranges (a ∈ [0.3, 3.0], b ∈ [-1.0, 1.0])

This creates a continuous improvement loop: user feedback → calibration adjustment → better precision → less user feedback needed.

Pattern Boosting for Coverage Gaps

ML models have training data gaps. If a category is underrepresented in the training data, the model will underdetect it. We address this with pattern boosting: regex patterns that boost confidence scores for specific underrepresented categories.

For example, self-harm content may be underrepresented in toxicity training datasets (due to responsible data collection practices). We maintain regex patterns for self-harm indicators and boost the confidence score when they match:

PATTERN_BOOSTS = {
    "self_harm": {"boost": 0.85, "patterns": [...]},
    "drugs":     {"boost": 0.90, "patterns": [...]},
    "violence":  {"boost": 0.90, "patterns": [...]},
}

We also apply academic context reduction: if the prompt contains markers of academic or research context ("study," "research," "analysis"), we reduce the pattern boost. A prompt about "the epidemiology of self-harm in adolescents" should not be treated the same as a prompt instructing on self-harm methods.

Graceful Degradation

ML inference is inherently unreliable. APIs time out, models return unexpected outputs, rate limits activate. Our policy: ML failures never block requests.

If the HuggingFace API returns an error, the ML ensemble returns "no detection" and the request proceeds with regex-only protection. We log the failure for monitoring, but the user's request is not interrupted.

This fail-open design is a deliberate choice. A security system that takes down your application whenever its ML models have a bad moment is worse than no security at all.

Why Five Small Models, Not One Large One

This is the question we get most often. Here's the case:

Diversity beats depth. Five models trained on different datasets with different architectures have different failure modes. When one model misses an attack, another catches it. When you average across five diverse models, the errors cancel out and the signal amplifies.

Specialization beats generalization. A single model that's "decent" at injection, toxicity, hate speech, and content moderation will always be outperformed by five models that are each excellent at one thing.

Failure isolation. If one model's API has a hiccup, the other four still work. We degrade from 5-model to 4-model detection, not from detection to no detection.

Cost efficiency. Five small models (86M-125M parameters each) running in parallel on CPU are dramatically cheaper than one 70B+ parameter model running on GPU. And for classification tasks, they're more accurate.

The Agentic Evaluator: The Optional Third Layer

For the borderline cases—prompts where the ensemble confidence is between 0.4 and 0.8—we optionally escalate to a larger language model for contextual reasoning.

We integrate with IBM Granite Guardian (8B params, with 5B fallback) and Meta Llama Guard (8B params). These models can reason about context in ways that classifiers can't: "Is this a creative writing request or an actual threat?" "Is this academic research or social engineering?"

The agentic evaluator only runs when:

  1. At least one detector flagged the content
  2. Confidence is in the "unsure" zone
  3. There are custom policies with exceptions that might apply

This affects <1% of traffic, so the latency cost (~200ms extra) is negligible in aggregate.

Conclusion

The PromptGuard ML ensemble is not magic. It's five well-chosen classifiers, running in parallel, combined with calibrated fusion logic and continuous improvement from production feedback.

The architecture reflects a simple belief: security is a precision problem, not an intelligence problem. You don't need a model that can write poetry to detect that someone is trying to override system instructions. You need five models that are very good at saying "this looks suspicious" and a fusion algorithm that's very good at combining their opinions.

Every decision is fully explainable. You can see exactly why a request was blocked, which models flagged it, and what confidence each model reported. Enterprise customers who self-host get complete access to the model configurations, fusion logic, calibration parameters, and threshold definitions.

That's the point. Security that you can't audit isn't security. It's faith.