Back to all articles
PerformanceArchitecturePython

Inside PromptGuard's Architecture: How We Built a Production AI Firewall

A deep engineering walkthrough of how PromptGuard inspects every prompt in ~150ms using a 7-detector pipeline, 5-model ML ensemble, multi-provider routing, and Redis-backed state—without adding complexity to your codebase.

Inside PromptGuard's Architecture: How We Built a Production AI Firewall

Inside PromptGuard's Architecture: How We Built a Production AI Firewall

When we started building PromptGuard, we had one constraint that shaped every decision: security overhead must be invisible enough that developers don't rip it out.

The industry's existing AI security tools fell into two camps. The first: "Send your prompt to GPT-4 and ask if it's safe." Too slow, too expensive, and hilariously recursive. The second: "Block any message containing 'ignore previous instructions.'" Too dumb, too many false positives, and bypassed in five minutes by anyone with a thesaurus.

We needed something in between—fast enough to sit in the critical path, smart enough to catch semantic attacks, and transparent enough that developers could debug it.

Here's how we built it.

The Request Lifecycle

Every request that flows through PromptGuard follows the same pipeline. Understanding this pipeline is key to understanding our architectural decisions.

User App → PromptGuard Proxy → [Security Pipeline] → LLM Provider → [Output Scan] → User App

Step 1: Request Intake and DoS Protection

Before we even parse the JSON, we enforce hard limits:

  • Max request body: 1 MB. Anything larger gets a 413 immediately. This prevents memory exhaustion attacks where someone sends a 500MB payload.
  • Max prompt length: 100,000 characters (~25K tokens). This is generous for any legitimate use case but blocks the "repeat this word 10 million times" class of token-burn attacks.

These are dumb checks. They're supposed to be. The expensive security logic shouldn't waste cycles on payloads that are obviously adversarial at the transport level.

Step 2: The Security Engine (7 Detectors)

Once a request passes intake validation, it enters the security engine. This is where the real work happens.

We run seven independent detectors, each specialized for a different threat type:

DetectorThreat TypeMechanismAvailable On
InjectionDetectorPrompt injection, jailbreaksRegex patterns + ML ensemble (5 models)All tiers
PIIDetectorPersonal data leaks14 regex patterns + Luhn validationAll tiers
ExfiltrationDetectorData theft attempts17 regex patterns + context analysisPaid tiers
ToxicityDetectorHate speech, violence, self-harmRegex patterns + ML ensemblePaid tiers
APIKeyDetectorLeaked credentials10 regex patterns (OpenAI, AWS, GitHub, etc.)Paid tiers
FraudDetectorSocial engineering, scams7 regex patternsPaid tiers
MalwareDetectorDestructive commands, reverse shells8 regex patternsPaid tiers

The evaluation order matters. We run the most common and most dangerous checks first (injection, exfiltration, fraud, malware), then the "redact rather than block" checks (PII, API keys), then toxicity (which requires the full ML ensemble).

Step 3: The Three-Decision Model

Every detector returns the same interface: (is_detected, reason, confidence). The security engine combines these into one of three decisions:

  • ALLOW: No threats detected. Forward the request to the LLM.
  • BLOCK: Threat detected with high confidence. Return a 403 with an explanation.
  • REDACT: Sensitive data detected, but the request is otherwise safe. Replace PII with tokens like [EMAIL_REDACTED] and forward the sanitized version to the LLM.

The REDACT decision is crucial. Most security tools only know ALLOW and BLOCK. But if a user says "My SSN is 123-45-6789, can you help me with my tax return?", the intent is legitimate—they just shouldn't be sending their SSN to an LLM. We strip the SSN, the LLM sees [SSN_REDACTED], and the user gets their answer without exposing sensitive data.

Step 4: Policy Evaluation

Raw detector outputs feed into the Policy Engine, which applies business logic on top of security signals.

The Policy Engine supports three layers of rules:

Preset Configurations. We ship six use-case templates:

PresetOptimized ForCustom Patterns
support_botCustomer-facing botsBlocks password/credit card sharing patterns
code_assistantDeveloper toolsBlocks API keys, allows code-heavy content
rag_systemDocument Q&ABlocks confidential/proprietary data extraction
creative_writingContent generationRelaxed thresholds for creative language
data_analysisData processingBlocks SSN/DOB patterns, limits external access
defaultGeneral purposeBalanced thresholds across all categories

Each preset can be combined with a strictness level (strict, moderate, permissive) that adjusts detection thresholds across all detectors simultaneously.

Custom Rules. Users can define regex-based rules for their specific domain. A fintech company might add a rule to block messages containing account numbers in their proprietary format. A healthcare company might flag messages referencing specific drug names.

Custom Policies. The most powerful layer. Condition-based policies that can filter inputs, filter outputs, or transform content based on arbitrary conditions. Policies have priority ordering—higher-priority policies execute first, and a policy can override lower-priority ones.

Step 5: Smart Routing and Provider Failover

If the request passes security checks, we forward it to the LLM provider. But we don't just blindly proxy to OpenAI.

Our SmartRouter maintains a health model of multiple LLM providers (OpenAI, Anthropic, Gemini, Mistral, Groq, Azure OpenAI) and selects the optimal provider based on:

  • Health status: Circuit breaker tracking of recent errors and latencies.
  • Cost: Different providers have different token pricing.
  • Capability: Model availability varies by provider.

If the primary provider fails, we automatically retry with up to two alternative providers. The user's request doesn't fail just because OpenAI is having a bad five minutes.

Step 6: Output Scanning

The LLM's response passes through a second security scan before reaching the user. This output scan runs PII detection, API key detection, toxicity checks, and any output-filter policies.

Why scan outputs? Because even with a clean input, the LLM can hallucinate sensitive data, echo back PII from its training data, or generate toxic content in response to an innocuous question. The output firewall catches these cases.

For streaming responses, we scan in real-time as chunks arrive. If we detect a threat mid-stream, we can cut the stream immediately.

Step 7: Logging and Alerting

Every decision is logged with full context:

  • The event ID (returned to the caller as X-PromptGuard-Event-ID)
  • The decision and confidence (returned as X-PromptGuard-Decision and X-PromptGuard-Confidence)
  • The threat type, if any (returned as X-PromptGuard-Threat-Type)
  • A truncated content preview (max 500 characters, for privacy)

If the project has zero retention mode enabled, the content preview is omitted entirely. We log the decision but not the data.

If the decision is BLOCK, we fire alerts through two channels:

  1. Email alerts to the project owner (configurable: all, critical, or none)
  2. Webhook alerts to a configured URL (Slack-compatible JSON payload)

The ML Pipeline

The most complex component is the ML ensemble, so it deserves its own section.

Why HuggingFace Inference API

We use HuggingFace's hosted Inference API for ML inference rather than running models locally. This was a deliberate decision:

For the hosted product: Running five GPU-accelerated models per request would require expensive dedicated GPU infrastructure. The HuggingFace API gives us access to optimized model serving without managing CUDA drivers, GPU memory, and model loading.

For self-hosted deployments: Customers who need on-premises inference can configure their own model endpoints. The InferenceClient supports any HuggingFace-compatible API.

The tradeoff is latency: API calls to HuggingFace add ~100-140ms of network overhead. We mitigate this by running all five models in parallel using a ThreadPoolExecutor(max_workers=5). The wall-clock time is the slowest model, not the sum.

Graceful Degradation

ML inference is inherently unreliable. APIs timeout, models return unexpected outputs, rate limits kick in. Our policy: ML failures never block requests.

If the HuggingFace API is down, the ML ensemble returns "no detection" and the request proceeds with only regex-based protection. This is the fail-open design philosophy: security should degrade gracefully, not catastrophically.

We log every ML failure for monitoring, and the regex layer provides a reliable baseline regardless of ML availability.

Caching

The fastest security check is one that doesn't run.

Our DetectionCache uses SHA-256 hashes of the prompt text as cache keys, stored in Redis (or in-memory for single-instance deployments). If we've seen an identical prompt in the last hour, we return the cached result without running any detectors.

We intentionally chose exact-match caching over semantic/fuzzy caching. Approximate matching introduces a new attack surface: if an attacker can craft a prompt that's "similar enough" to a cached safe prompt but contains a subtle injection, fuzzy matching would let it through. Exact matching is less efficient but more correct.

Bot Detection

Alongside content analysis, we run behavioral analysis on every request.

The BotDetector computes a risk score from five independent signals:

  1. Rate limiting (40% weight): Burst detection (10/sec), minute limits (60/min), hour limits (1000/hr).
  2. Timing analysis (25% weight): Humans have variable response times. Bots don't. We measure the coefficient of variation of inter-request intervals—if it's suspiciously low, the traffic is likely automated.
  3. Payload analysis (15% weight): Identical payloads (replay attacks) and high-volume unique payloads (model extraction attempts, where >90% of 15+ payloads are unique).
  4. Session analysis (10% weight): Short sessions with high volume indicate scripted access.
  5. Reputation (10% weight): Exponential moving average of past behavior for each client fingerprint.

When the composite score exceeds 0.8, the client is temporarily blocked (stored in Redis for multi-instance consistency). Between 0.5 and 0.8, they receive a challenge (CAPTCHA via reCAPTCHA, Turnstile, or hCaptcha).

The Deployment Stack

PromptGuard is designed to be self-hosted. The production deployment is five containers:

ServicePurpose
API (Python/FastAPI)Core proxy and security engine
Dashboard (Next.js)Configuration, analytics, log viewer
PostgreSQLPersistent storage for users, projects, events, feedback
RedisCaching, session state, bot detection, rate limiting
NginxTLS termination, reverse proxy

Everything runs in a single docker-compose.yml. No Kubernetes required, no proprietary cloud services, no vendor lock-in.

For the hosted product, we deploy to Google Cloud Run with managed Postgres and Redis. The same codebase, different infrastructure.

Lessons Learned

1. Fail-Open is the Right Default

Early on, we defaulted to fail-closed: if anything went wrong in the security pipeline, we blocked the request. This sounded principled but was operationally disastrous. A Redis connection hiccup would take down the entire application. We switched to fail-open and never looked back.

2. Confidence Matters More Than Decisions

A binary "safe/unsafe" is almost useless for debugging. When we started returning calibrated confidence scores (via X-PromptGuard-Confidence), our users' ability to tune their security posture improved dramatically. A confidence of 0.51 means "barely suspicious"—you probably don't want to block that. A confidence of 0.98 means "definitely an attack"—block with conviction.

3. Output Scanning Is Not Optional

We initially focused only on input scanning. Then a customer reported that their bot was leaking API keys that were embedded in its training data. The input was clean—the output was the problem. Output scanning catches an entire class of threats that input scanning misses.

4. Cache Invalidation Is Hard (So Don't Do Semantic Caching)

We tried semantic caching early on—using embedding similarity to match "close enough" prompts to cached results. The false match rate was unacceptable. Two prompts that differ by one word ("Delete my account" vs. "Don't delete my account") can have high cosine similarity but opposite security implications. We stripped it out and went with exact-match hashing. Less clever, more correct.

Conclusion

PromptGuard isn't magic. It's a pipeline of well-understood components—regex matchers, ML classifiers, policy engines, caches, and circuit breakers—assembled with care and tested against real attack traffic.

The architecture reflects a simple belief: production security is an engineering problem, not an AI problem. You don't need a revolutionary new model. You need a reliable system that handles edge cases, degrades gracefully, and explains its decisions.

Every component we described here is open source. You can read the code, audit the logic, and run it on your own infrastructure. That's the point.