
Inside PromptGuard's Architecture: How We Built a Production AI Firewall
When we started building PromptGuard, we had one constraint that shaped every decision: security overhead must be invisible enough that developers don't rip it out.
The industry's existing AI security tools fell into two camps. The first: "Send your prompt to GPT-4 and ask if it's safe." Too slow, too expensive, and hilariously recursive. The second: "Block any message containing 'ignore previous instructions.'" Too dumb, too many false positives, and bypassed in five minutes by anyone with a thesaurus.
We needed something in between—fast enough to sit in the critical path, smart enough to catch semantic attacks, and transparent enough that developers could debug it.
Here's how we built it.
The Request Lifecycle
Every request that flows through PromptGuard follows the same pipeline. Understanding this pipeline is key to understanding our architectural decisions.
User App → PromptGuard Proxy → [Security Pipeline] → LLM Provider → [Output Scan] → User AppStep 1: Request Intake and DoS Protection
Before we even parse the JSON, we enforce hard limits:
- Max request body: 1 MB. Anything larger gets a 413 immediately. This prevents memory exhaustion attacks where someone sends a 500MB payload.
- Max prompt length: 100,000 characters (~25K tokens). This is generous for any legitimate use case but blocks the "repeat this word 10 million times" class of token-burn attacks.
These are dumb checks. They're supposed to be. The expensive security logic shouldn't waste cycles on payloads that are obviously adversarial at the transport level.
Step 2: The Security Engine (7 Detectors)
Once a request passes intake validation, it enters the security engine. This is where the real work happens.
We run seven independent detectors, each specialized for a different threat type:
| Detector | Threat Type | Mechanism | Available On |
|---|---|---|---|
| InjectionDetector | Prompt injection, jailbreaks | Regex patterns + ML ensemble (5 models) | All tiers |
| PIIDetector | Personal data leaks | 14 regex patterns + Luhn validation | All tiers |
| ExfiltrationDetector | Data theft attempts | 17 regex patterns + context analysis | Paid tiers |
| ToxicityDetector | Hate speech, violence, self-harm | Regex patterns + ML ensemble | Paid tiers |
| APIKeyDetector | Leaked credentials | 10 regex patterns (OpenAI, AWS, GitHub, etc.) | Paid tiers |
| FraudDetector | Social engineering, scams | 7 regex patterns | Paid tiers |
| MalwareDetector | Destructive commands, reverse shells | 8 regex patterns | Paid tiers |
The evaluation order matters. We run the most common and most dangerous checks first (injection, exfiltration, fraud, malware), then the "redact rather than block" checks (PII, API keys), then toxicity (which requires the full ML ensemble).
Step 3: The Three-Decision Model
Every detector returns the same interface: (is_detected, reason, confidence). The security engine combines these into one of three decisions:
- ALLOW: No threats detected. Forward the request to the LLM.
- BLOCK: Threat detected with high confidence. Return a 403 with an explanation.
- REDACT: Sensitive data detected, but the request is otherwise safe. Replace PII with tokens like
[EMAIL_REDACTED]and forward the sanitized version to the LLM.
The REDACT decision is crucial. Most security tools only know ALLOW and BLOCK. But if a user says "My SSN is 123-45-6789, can you help me with my tax return?", the intent is legitimate—they just shouldn't be sending their SSN to an LLM. We strip the SSN, the LLM sees [SSN_REDACTED], and the user gets their answer without exposing sensitive data.
Step 4: Policy Evaluation
Raw detector outputs feed into the Policy Engine, which applies business logic on top of security signals.
The Policy Engine supports three layers of rules:
Preset Configurations. We ship six use-case templates:
| Preset | Optimized For | Custom Patterns |
|---|---|---|
support_bot | Customer-facing bots | Blocks password/credit card sharing patterns |
code_assistant | Developer tools | Blocks API keys, allows code-heavy content |
rag_system | Document Q&A | Blocks confidential/proprietary data extraction |
creative_writing | Content generation | Relaxed thresholds for creative language |
data_analysis | Data processing | Blocks SSN/DOB patterns, limits external access |
default | General purpose | Balanced thresholds across all categories |
Each preset can be combined with a strictness level (strict, moderate, permissive) that adjusts detection thresholds across all detectors simultaneously.
Custom Rules. Users can define regex-based rules for their specific domain. A fintech company might add a rule to block messages containing account numbers in their proprietary format. A healthcare company might flag messages referencing specific drug names.
Custom Policies. The most powerful layer. Condition-based policies that can filter inputs, filter outputs, or transform content based on arbitrary conditions. Policies have priority ordering—higher-priority policies execute first, and a policy can override lower-priority ones.
Step 5: Smart Routing and Provider Failover
If the request passes security checks, we forward it to the LLM provider. But we don't just blindly proxy to OpenAI.
Our SmartRouter maintains a health model of multiple LLM providers (OpenAI, Anthropic, Gemini, Mistral, Groq, Azure OpenAI) and selects the optimal provider based on:
- Health status: Circuit breaker tracking of recent errors and latencies.
- Cost: Different providers have different token pricing.
- Capability: Model availability varies by provider.
If the primary provider fails, we automatically retry with up to two alternative providers. The user's request doesn't fail just because OpenAI is having a bad five minutes.
Step 6: Output Scanning
The LLM's response passes through a second security scan before reaching the user. This output scan runs PII detection, API key detection, toxicity checks, and any output-filter policies.
Why scan outputs? Because even with a clean input, the LLM can hallucinate sensitive data, echo back PII from its training data, or generate toxic content in response to an innocuous question. The output firewall catches these cases.
For streaming responses, we scan in real-time as chunks arrive. If we detect a threat mid-stream, we can cut the stream immediately.
Step 7: Logging and Alerting
Every decision is logged with full context:
- The event ID (returned to the caller as
X-PromptGuard-Event-ID) - The decision and confidence (returned as
X-PromptGuard-DecisionandX-PromptGuard-Confidence) - The threat type, if any (returned as
X-PromptGuard-Threat-Type) - A truncated content preview (max 500 characters, for privacy)
If the project has zero retention mode enabled, the content preview is omitted entirely. We log the decision but not the data.
If the decision is BLOCK, we fire alerts through two channels:
- Email alerts to the project owner (configurable:
all,critical, ornone) - Webhook alerts to a configured URL (Slack-compatible JSON payload)
The ML Pipeline
The most complex component is the ML ensemble, so it deserves its own section.
Why HuggingFace Inference API
We use HuggingFace's hosted Inference API for ML inference rather than running models locally. This was a deliberate decision:
For the hosted product: Running five GPU-accelerated models per request would require expensive dedicated GPU infrastructure. The HuggingFace API gives us access to optimized model serving without managing CUDA drivers, GPU memory, and model loading.
For self-hosted deployments: Customers who need on-premises inference can configure their own model endpoints. The InferenceClient supports any HuggingFace-compatible API.
The tradeoff is latency: API calls to HuggingFace add ~100-140ms of network overhead. We mitigate this by running all five models in parallel using a ThreadPoolExecutor(max_workers=5). The wall-clock time is the slowest model, not the sum.
Graceful Degradation
ML inference is inherently unreliable. APIs timeout, models return unexpected outputs, rate limits kick in. Our policy: ML failures never block requests.
If the HuggingFace API is down, the ML ensemble returns "no detection" and the request proceeds with only regex-based protection. This is the fail-open design philosophy: security should degrade gracefully, not catastrophically.
We log every ML failure for monitoring, and the regex layer provides a reliable baseline regardless of ML availability.
Caching
The fastest security check is one that doesn't run.
Our DetectionCache uses SHA-256 hashes of the prompt text as cache keys, stored in Redis (or in-memory for single-instance deployments). If we've seen an identical prompt in the last hour, we return the cached result without running any detectors.
We intentionally chose exact-match caching over semantic/fuzzy caching. Approximate matching introduces a new attack surface: if an attacker can craft a prompt that's "similar enough" to a cached safe prompt but contains a subtle injection, fuzzy matching would let it through. Exact matching is less efficient but more correct.
Bot Detection
Alongside content analysis, we run behavioral analysis on every request.
The BotDetector computes a risk score from five independent signals:
- Rate limiting (40% weight): Burst detection (10/sec), minute limits (60/min), hour limits (1000/hr).
- Timing analysis (25% weight): Humans have variable response times. Bots don't. We measure the coefficient of variation of inter-request intervals—if it's suspiciously low, the traffic is likely automated.
- Payload analysis (15% weight): Identical payloads (replay attacks) and high-volume unique payloads (model extraction attempts, where >90% of 15+ payloads are unique).
- Session analysis (10% weight): Short sessions with high volume indicate scripted access.
- Reputation (10% weight): Exponential moving average of past behavior for each client fingerprint.
When the composite score exceeds 0.8, the client is temporarily blocked (stored in Redis for multi-instance consistency). Between 0.5 and 0.8, they receive a challenge (CAPTCHA via reCAPTCHA, Turnstile, or hCaptcha).
The Deployment Stack
PromptGuard is designed to be self-hosted. The production deployment is five containers:
| Service | Purpose |
|---|---|
| API (Python/FastAPI) | Core proxy and security engine |
| Dashboard (Next.js) | Configuration, analytics, log viewer |
| PostgreSQL | Persistent storage for users, projects, events, feedback |
| Redis | Caching, session state, bot detection, rate limiting |
| Nginx | TLS termination, reverse proxy |
Everything runs in a single docker-compose.yml. No Kubernetes required, no proprietary cloud services, no vendor lock-in.
For the hosted product, we deploy to Google Cloud Run with managed Postgres and Redis. The same codebase, different infrastructure.
Lessons Learned
1. Fail-Open is the Right Default
Early on, we defaulted to fail-closed: if anything went wrong in the security pipeline, we blocked the request. This sounded principled but was operationally disastrous. A Redis connection hiccup would take down the entire application. We switched to fail-open and never looked back.
2. Confidence Matters More Than Decisions
A binary "safe/unsafe" is almost useless for debugging. When we started returning calibrated confidence scores (via X-PromptGuard-Confidence), our users' ability to tune their security posture improved dramatically. A confidence of 0.51 means "barely suspicious"—you probably don't want to block that. A confidence of 0.98 means "definitely an attack"—block with conviction.
3. Output Scanning Is Not Optional
We initially focused only on input scanning. Then a customer reported that their bot was leaking API keys that were embedded in its training data. The input was clean—the output was the problem. Output scanning catches an entire class of threats that input scanning misses.
4. Cache Invalidation Is Hard (So Don't Do Semantic Caching)
We tried semantic caching early on—using embedding similarity to match "close enough" prompts to cached results. The false match rate was unacceptable. Two prompts that differ by one word ("Delete my account" vs. "Don't delete my account") can have high cosine similarity but opposite security implications. We stripped it out and went with exact-match hashing. Less clever, more correct.
Conclusion
PromptGuard isn't magic. It's a pipeline of well-understood components—regex matchers, ML classifiers, policy engines, caches, and circuit breakers—assembled with care and tested against real attack traffic.
The architecture reflects a simple belief: production security is an engineering problem, not an AI problem. You don't need a revolutionary new model. You need a reliable system that handles edge cases, degrades gracefully, and explains its decisions.
Every component we described here is open source. You can read the code, audit the logic, and run it on your own infrastructure. That's the point.
READ MORE

Why We Don't Use LLMs to Secure LLMs
Using GPT-4 to check if a prompt is safe doubles your latency and your bill. Here's why we bet on a 5-model classical ML ensemble, and how it outperforms single-model approaches at a fraction of the cost.

Multi-Provider Failover: How to Keep Your AI App Running When OpenAI Goes Down
When OpenAI has a 30-minute outage, your AI application doesn't have to go down with it. Here's how PromptGuard's SmartRouter automatically fails over across providers—OpenAI, Anthropic, Gemini, Mistral, Groq, and Azure—without your users noticing.

Inside Our 5-Model ML Ensemble: How We Detect Attacks Without Adding Latency
A technical deep dive into how PromptGuard's ensemble of Llama-Prompt-Guard, DeBERTa, ALBERT, toxic-bert, and RoBERTa classifies threats—covering parallel inference, weighted voting, category-specific thresholds, confidence calibration, and why five small models beat one large one.