
Why We Open Sourced Our AI Firewall
When you deploy an LLM application, you are effectively giving every user a command line to your backend.
If you're building a chatbot, that command line is restricted—it can only generate text. If you're building an agent with tools, that command line has sudo. It can read files, query databases, send emails, and issue refunds. The only thing standing between a user and those capabilities is a system prompt that says "please don't do bad things."
That's not security. That's a suggestion.
The industry's answer to this problem has been "safety layers"—proprietary APIs that return a binary safe or unsafe verdict. We tried using them. We hated them. Here's why, and what we built instead.
The Black Box Problem
Imagine a Web Application Firewall (WAF) that didn't tell you why it blocked a request. It just returned 403 Forbidden. You check your logs—nothing. You check the vendor dashboard—"Threat Detected." You call support—"Our model flagged it as suspicious."
How do you debug that? You don't. You turn the WAF off, eat the risk, and move on.
That is the state of most AI security tools today. They are proprietary models hidden behind APIs. You send your user's prompt, and you pray the vendor's definition of "unsafe" matches yours.
We saw three fundamental problems with this approach:
1. No explainability. When a legitimate user gets blocked, you can't tell them why. You can't tell your engineering team why. You can't file a meaningful bug report. The vendor's model is a black box, and you're building your product on top of it.
2. No auditability. You're trusting a third party with every prompt your users send. Are they training on your data? Are they logging it? You don't know, because you can't see the code. For regulated industries (healthcare, finance, defense), this is a non-starter.
3. No customization. Every application has different security requirements. A creative writing tool and a banking chatbot have opposite definitions of "safe content." Black-box tools can't be tuned for your specific use case.
What We Built
PromptGuard is an open-source AI firewall. It sits between your application and your LLM provider as a transparent proxy.
Your App → PromptGuard → OpenAI / Anthropic / Gemini / etc.Integration is one line of code—change your base_url. No SDK required, no code refactoring, no middleware to write.
Seven Threat Detectors
We don't run one generic "is this safe?" model. We run seven specialized detectors, each focused on a specific threat type:
- Prompt Injection: Catches direct overrides ("ignore instructions"), roleplay manipulation, delimiter injection, and semantic evasion techniques.
- PII Detection: Identifies 14+ types of personally identifiable information (email, phone, SSN, credit cards with Luhn validation, passport numbers, IBAN, and more) and replaces them with safe tokens.
- Data Exfiltration: Detects attempts to extract system prompts, training data, user data, or knowledge base contents.
- Toxicity: Five-model ML ensemble covering hate speech, violence, self-harm, sexual content, and harassment.
- API Key Detection: Catches leaked credentials (OpenAI, AWS, GitHub, Google, generic API keys and tokens).
- Fraud Detection: Identifies social engineering patterns like wire transfer scams, gift card scams, and credential harvesting.
- Malware Detection: Blocks destructive commands (rm -rf, format c:), reverse shells, and encoded PowerShell payloads.
The Hybrid Architecture
Most AI security tools are just another LLM call. They send your prompt to GPT-4 with "is this safe?" and add 500ms to your latency.
We take a different approach. Our detection pipeline combines:
- Regex patterns for known attack signatures (fast, deterministic, zero false negatives on known patterns)
- A 5-model ML ensemble (Llama-Prompt-Guard, DeBERTa, ALBERT, toxic-bert, RoBERTa) running in parallel for semantic understanding
- Confidence calibration via Platt scaling, so our confidence scores are actually calibrated probabilities
Total overhead: approximately 150ms. Not zero, but an order of magnitude faster than an LLM-based security check.
Three Decisions, Not Two
Most security tools give you ALLOW or BLOCK. We add a third: REDACT.
If a user says "My SSN is 123-45-6789, can you help with my tax return?", the intent is legitimate. Blocking them would be hostile. Instead, we replace the SSN with [SSN_REDACTED], forward the sanitized prompt to the LLM, and the user gets their answer without exposing sensitive data.
Every decision is returned with full metadata:
X-PromptGuard-Event-ID: Unique trace ID for debuggingX-PromptGuard-Decision: ALLOW, BLOCK, or REDACTX-PromptGuard-Confidence: Calibrated confidence score (0.0-1.0)X-PromptGuard-Threat-Type: Which threat category was detected
Six Use-Case Presets
Every application has different security needs. A support bot needs strict PII protection. A code assistant needs to allow technical content that looks like injection. A creative writing tool needs relaxed toxicity thresholds.
We ship six composable presets: support_bot, code_assistant, rag_system, creative_writing, data_analysis, and default. Each preset can be combined with a strictness level (strict, moderate, permissive) to fine-tune thresholds across all detectors simultaneously.
Beyond Detection
PromptGuard isn't just a scanner. It's a complete security platform:
- Multi-provider routing with automatic failover across OpenAI, Anthropic, Gemini, Mistral, Groq, and Azure OpenAI
- Bot detection with behavioral analysis (rate limiting, timing analysis, payload fingerprinting, session analysis, reputation scoring)
- Red team testing with 20 built-in attack vectors you can run from the API or dashboard
- AI agent security with tool call validation, argument inspection, sequence analysis, and velocity limiting
- Webhook and email alerting when threats are detected (Slack-compatible)
- Shadow mode / A/B testing for safely evaluating detection model changes without affecting production traffic
- Feedback-driven model recalibration that automatically adjusts confidence thresholds based on false positive/negative reports
Why Open Source?
We chose open source for one simple reason: security tools must be auditable.
If you're trusting a tool to sit between your users and your AI, you need to be able to read the code. You need to see the regex patterns, the model configurations, the policy evaluation logic, and the data handling practices.
When PromptGuard blocks a request, you can trace the decision from the HTTP handler through the security engine to the specific detector and pattern that triggered. No black box. No "trust us." No mystery.
If you find a bypass, you can open an issue—or better yet, a pull request. The community finds and fixes vulnerabilities faster than any internal team.
The Free Tier
We offer 10,000 requests per month for free. Not a trial. Not "free for 14 days." Free forever.
We do this because we believe AI security shouldn't be gated by budget. A solo developer building a side project deserves the same injection protection as an enterprise team. The free tier includes the two most critical detectors (prompt injection and PII), the dashboard, and the full proxy functionality.
If you need ML-powered detection, all seven threat types, custom policies, or webhook alerting, the paid plans start at $49/month.
Try It
You can use the hosted version at promptguard.co—swap your base_url and you're done.
Or clone the repo from GitHub, run docker-compose up, and have the entire stack running in your VPC in five minutes. No data leaves your infrastructure. We never see your prompts.
If your AI application has users, it has an attack surface. The question isn't whether to add security—it's whether your security layer will explain itself when things go wrong.
Ours does. And you can read the source code to prove it.
READ MORE

Inside Our 5-Model ML Ensemble: How We Detect Attacks Without Adding Latency
A technical deep dive into how PromptGuard's ensemble of Llama-Prompt-Guard, DeBERTa, ALBERT, toxic-bert, and RoBERTa classifies threats—covering parallel inference, weighted voting, category-specific thresholds, confidence calibration, and why five small models beat one large one.

Securing LangChain Applications: The Complete Guide
LangChain makes it easy to build powerful agents. It also makes it easy to build security vulnerabilities. Here's how to add production-grade security to your chains, agents, and RAG pipelines without rewriting your application.

PCI-DSS for AI: Don't Let Your Chatbot Touch Credit Cards
The moment your AI agent sees a credit card number, your entire compliance scope explodes. Here's how to architect AI-powered financial services that keep PANs out of the LLM context, pass PCI audits, and actually work.