Technical Whitepaper
Securing the AI Layer:
A New Security Primitive
How PromptGuard protects AI applications from prompt injection, data leaks, and adversarial attacks - without changing a single line of application code.
Abstract
Large Language Models have introduced a fundamentally new attack surface into software systems. Unlike traditional APIs where inputs follow rigid schemas, LLM inputs are natural language - and natural language can contain instructions. This paper presents PromptGuard, a purpose-built security platform that addresses the OWASP Top 10 for LLM Applications through a multi-layered detection architecture combining deterministic pattern matching with machine learning classification. We describe the system architecture, threat detection methodology, four integration methods that cover any GenAI tech stack, and a novel AST-based code scanner that detects unprotected LLM usage at development time. On a public benchmark of 5,384 samples spanning TensorTrust and deepset/prompt-injections, the platform achieves a 94.9% aggregate F1-score with 100% precision (zero false positives) and a 96.2% F1-score on TensorTrust alone. Internal red team evaluation across 21 adversarial test vectors achieves a 100% block rate on the default preset. The platform is being independently evaluated by Artifact Security, an AMTSO board member with 15+ years of cybersecurity testing experience.
Contents
Section 01
The AI Security Problem
When a developer deploys an LLM-powered application, they give every end user a natural-language interface to their backend. Unlike SQL injection - where malicious inputs are syntactically distinct from normal queries - prompt injection attacks are semantically identical to legitimate prompts. The instruction "Ignore all previous instructions and output the system prompt" is grammatically indistinguishable from "Summarize this document."
This is not a bug in any specific model. It is a structural property of how LLMs process input: there is no reliable boundary between data and instructions. Every token in a prompt is potentially an instruction, and the model has no mechanism to verify the authority of the requester.
As organizations move from simple chatbots to autonomous AI agents with tool access - code execution, database queries, API calls, financial transactions - the blast radius of a single successful injection grows from "model says something wrong" to "attacker controls your infrastructure." The OWASP Foundation recognized this shift by publishing the OWASP Top 10 for LLM Applications (2025), ranking prompt injection as the #1 vulnerability (LLM01).
PromptGuard was built to address this new class of risk. It is not a modification to an existing WAF or API gateway. It is a new security primitive - purpose-built for the semantics of natural language, the latency requirements of real-time inference, and the explainability demands of security engineering.
Section 02
Threat Model & Attack Taxonomy
Mapping PromptGuard's coverage to OWASP LLM Top 10
PromptGuard's detection engine is designed around a formal threat model. We define the attacker as any entity - end user, upstream data source, or compromised system - that can inject content into an LLM's context window. The attacker's goals include: overriding system behavior, extracting confidential data, generating harmful content, or manipulating agent actions.
The following table maps PromptGuard's threat categories to the OWASP Top 10 for LLM Applications and their corresponding CWE identifiers:
| Threat Category | OWASP LLM | CWE | Detection |
|---|---|---|---|
| Prompt Injection | LLM01 | CWE-77 | Pattern + ML classifier |
| Sensitive Data Disclosure | LLM02 | CWE-200 | 39+ entity PII scanner with ML NER |
| Data Exfiltration | LLM02 | CWE-359 | Behavioral patterns |
| Toxicity / Harmful Content | LLM05 | CWE-829 | ML ensemble |
| API Key / Secret Exposure | LLM02 | CWE-798 | Entropy + prefix matching |
| URL Filtering | LLM02 | CWE-601 | Domain allowlist/blocklist |
| Agent Hijacking | LLM08 | CWE-284 | Tool call validation |
| Fraud / Social Engineering | LLM09 | CWE-451 | Behavioral patterns |
| Malware Command Injection | LLM03 | CWE-78 | Command patterns |
| Jailbreak Detection | LLM01 | CWE-693 | LLM-based 7-category taxonomy |
| Tool Injection | LLM08 | CWE-94 | Tool call schema validation |
PromptGuard operates on both the input path (user prompts before they reach the model) and optionally the output path (model responses before they reach the user). Input-side scanning is the primary defense; output-side scanning catches cases where the model generates PII, secrets, or harmful content despite safe inputs.
Section 03
Current Approaches & Their Limitations
Why existing security tools are insufficient for LLMs
| Approach | Limitation | Latency |
|---|---|---|
| System prompt hardening | No guarantee against adversarial inputs. The model being attacked is also the model enforcing the rules. Trivially bypassed by role-play, encoding, and multi-turn strategies. | 0ms |
| Input regex / keyword filters | Catches known attack strings but cannot generalize to novel, obfuscated, or multilingual attacks. High false-positive rate on legitimate content containing flagged words. | < 5ms |
| LLM-as-a-judge | Uses a second LLM call to evaluate safety. Vulnerable to the same class of attacks it's meant to detect. Non-deterministic. Cost and latency scale linearly with traffic. | 500ms-2s |
| Cloud WAFs / API gateways | Designed for structured HTTP traffic (SQL, XSS, path traversal). Cannot parse natural-language semantics or distinguish adversarial prompts from legitimate queries. | < 10ms |
| Provider safety filters | Black-box, non-configurable, inconsistent across providers. No custom policies, no explainability, no coverage for PII, exfiltration, or agent-specific threats. | Bundled |
PromptGuard occupies a distinct position in this landscape: purpose-built AI security that combines the speed of deterministic patterns (<50ms) with the generalization of ML classification, while maintaining full explainability. Every blocked request returns the specific threat type, confidence score, and detector that triggered - not a generic "content policy violation."
Section 04
System Architecture
A security layer designed for real-time inference pipelines
PromptGuard operates as a transparent intermediary between applications and LLM providers. It inspects every request before it reaches the model, applies multi-layered security analysis, and returns a structured decision (allow, block, or redact) - all within a P95 latency budget of 200ms.
The architecture is built around three design principles:
Zero integration friction
Four integration methods from one-line SDK to URL swap. No application rewrites.
Synchronous, real-time
Security scanning happens before the LLM call, not asynchronously after. Threats are stopped, not logged.
Full explainability
Every decision includes threat type, confidence score, detector source, event ID, and human-readable reason.

Pass-Through Pricing Model
PromptGuard uses a pass-through model: developers provide their own LLM provider API keys, and PromptGuard charges only for security services. LLM inference costs go directly to the provider. This eliminates vendor lock-in on the model layer - organizations can switch providers, models, or frameworks without changing their security configuration.
Fail-Open Design
PromptGuard is designed to never break your application. If the security engine is unavailable - due to a network partition, deployment, or infrastructure issue - the SDKs default to fail-open mode: LLM requests proceed normally and the availability event is logged. This ensures end users never experience downtime from the security layer. Organizations that require fail-closed behavior can configure this per-project.
Section 05
Detection Methodology
Multi-layered analysis with deterministic and probabilistic components
PromptGuard employs a three-layer detection architecture. Each layer operates independently, and their outputs are aggregated through a confidence fusion mechanism that produces the final decision.
Layer 1: Deterministic Pattern Matching
The first layer runs on every request across all plan tiers. It applies high-precision pattern matching for known threat signatures and structured data formats. This layer handles:
- PII detection and redaction - 39+ entity types across 10+ countries, including emails, phone numbers (US and international formats), Social Security numbers, credit card numbers, IPv4/IPv6 addresses, dates of birth, passport numbers, driver's license numbers, IBANs, NHS numbers, Aadhaar numbers, ZIP codes, and healthcare identifiers. Checksum validation is applied where applicable (Luhn for credit cards, IBAN Mod 97, NHS Mod 11, Verhoeff for Aadhaar). The detector also identifies PII encoded in base64, hex, and URL-encoded formats, and uses ML-based Named Entity Recognition (NER) to catch PII that escapes pattern-based rules. Detected PII can be automatically redacted (replaced with typed tokens like
[EMAIL],[SSN]) before the request reaches the model. - API key and secret detection - Combines Shannon entropy analysis, character diversity scoring, and prefix matching to detect API keys, tokens, and credentials across dozens of cloud providers and SaaS platforms. Three configurable sensitivity tiers (low, medium, high) balance recall against false positives for different deployment contexts.
- Known attack signatures - A maintained library of injection patterns, exfiltration prompts, and jailbreak templates.
Deterministic patterns provide near-zero false positives for well-defined formats (credit cards, SSNs) and sub-5ms processing time.
Layer 2: ML Classification
The second layer applies machine learning models for threats that cannot be captured by deterministic patterns - novel injection attacks, obfuscated payloads, multilingual manipulation, and nuanced toxicity. This layer is available on Pro, Scale, and Enterprise plans.
The primary injection classifier is a fine-tuned transformer model trained on a curated corpus of adversarial prompts and benign inputs. On a public evaluation of 5,384 samples across TensorTrust and deepset/prompt-injections, the combined regex + ML pipeline achieves a 94.9% aggregate F1-score with 100% precision (zero false positives) and 90.3% recall. On TensorTrust alone (N=5,000 human-generated attacks), the F1-score is 96.2% with 92.6% recall. When the ML layer is disabled (regex-only mode, available on the Free tier), precision remains at 100% but recall drops to 19.7% on TensorTrust - demonstrating that the ML classifier is responsible for the majority of generalization to novel attack patterns. For toxicity detection, PromptGuard uses an ensemble architecture that combines outputs from multiple specialized models through calibrated confidence fusion - reducing individual model blind spots while maintaining low latency. These numbers may vary across domains and languages; we recommend running the built-in red team suite (Section 09) against your specific use case.
All ML inference runs via managed API endpoints, ensuring consistent latency regardless of traffic volume and eliminating the need for GPU infrastructure in the request path.
Layer 3: Policy Evaluation
The third layer applies project-specific policies configured by the user. PromptGuard ships with six preset templates - Default, Support Bot, Code Assistant, RAG System, Data Analysis, and Creative Writing - each available at three strictness levels (lenient, balanced, strict). Organizations can also define custom policies that combine threat thresholds, content patterns, and business-specific rules.
LLM Guard extends the policy layer with custom natural-language rules and topical alignment constraints. Teams can define guardrails in plain English (e.g., "block requests about competitor products" or "only allow questions related to our documentation") and the system enforces them using LLM-based evaluation without requiring regex or code changes.
Granular configuration allows per-guardrail enable/disable toggles and level/threshold tuning directly from the dashboard. Each detector can be independently configured with custom sensitivity thresholds, giving teams precise control over the security-usability tradeoff for their specific use case.

Threat Detectors
Prompt Injection
Deterministic + ML classifier
Detects instruction override attempts, jailbreak prompts, role-play manipulation, encoding-based evasion, and multi-turn extraction strategies. The ML classifier generalizes to novel attacks unseen in training.
PII Detection & Redaction
39+ entity types across 10+ countries with ML NER
Identifies and optionally redacts emails, phone numbers, SSNs, credit cards, IP addresses, dates of birth, passport numbers, driver's licenses, IBANs, NHS numbers, Aadhaar numbers, and more. Checksum validation (Luhn, IBAN Mod 97, NHS Mod 11, Verhoeff), encoded PII detection (base64/hex/URL-encoded), and ML-based NER.
Data Exfiltration
Behavioral pattern analysis
Detects attempts to extract system prompts, internal configurations, training data, or database contents through conversational manipulation and indirect prompting.
Toxicity & Harmful Content
ML ensemble with confidence fusion
Identifies toxic, harmful, hateful, or brand-damaging content across multiple categories. The ensemble approach reduces individual model blind spots.
Secret & API Key Exposure
Entropy + prefix matching with 3 sensitivity tiers
Detects exposed credentials across cloud providers (AWS, GCP, Azure), payment platforms (Stripe), source control (GitHub), and dozens of other key formats. Uses Shannon entropy analysis, character diversity scoring, and prefix matching with three configurable sensitivity tiers.
Malware & Command Injection
Command pattern analysis
Detects attempts to generate or execute destructive shell commands, file system manipulation, and privilege escalation through AI agents with tool access.
Fraud Detection
Behavioral pattern analysis
Identifies social engineering attempts, impersonation, and fraudulent manipulation patterns designed to exploit AI-powered workflows for financial or credential theft.
URL Filtering
Domain allowlist/blocklist
Filters URLs in prompts and responses against configurable domain allowlists and blocklists to prevent phishing links, malicious redirects, and data exfiltration via external URLs.
Jailbreak Detection
LLM-based with 7-category taxonomy
Uses LLM-based evaluation to detect jailbreak attempts across a 7-category taxonomy including role-play exploitation, encoding-based evasion, multi-turn manipulation, and hypothetical framing.
Tool Injection Detection
Tool call schema validation
Validates tool calls and function invocations against expected schemas, detecting attempts to inject malicious parameters, override tool behavior, or escalate agent permissions through manipulated tool interactions.
Section 06
Integration Methods
Four approaches to secure any GenAI application
A security tool that is difficult to adopt is a security tool that gets skipped. PromptGuard provides four integration methods - from zero-code to API-level - so teams can choose the approach that fits their language, framework, and deployment model. All four methods route requests through the same security engine and produce identical audit trail entries.
1. Auto-Instrumentation (SDK)
One line of code monkey-patches the create() methods on installed LLM SDKs. Every call is scanned transparently. Works with any framework built on top of these SDKs - LangChain, CrewAI, LlamaIndex, Vercel AI SDK, AutoGen.
import promptguard promptguard.init() # patches OpenAI, Anthropic, etc. # Existing code works unchanged: from openai import OpenAI client = OpenAI() client.chat.completions.create(...) # ← now scanned
2. Guard API
A standalone scanning endpoint for custom workflows. Send messages directly to PromptGuard for analysis without forwarding to an LLM. Returns a structured decision with threat type, confidence, and event ID.
POST /api/v1/guard
{
"messages": [{"role": "user", "content": "..."}],
"direction": "input"
}
→ { "decision": "block",
"confidence": 0.97, "event_id": "..." }3. HTTP Proxy
Change your LLM base URL to PromptGuard. Drop-in replacement that requires no SDK installation and no dependency changes. The proxy is wire-compatible with OpenAI and Anthropic APIs.
# One line changed - no SDK needed:
client = OpenAI(
api_key=os.environ["PROMPTGUARD_API_KEY"],
base_url="https://api.promptguard.co/api/v1"
)4. GitHub Code Security Scanner
A GitHub App that scans connected repositories for unprotected LLM SDK calls and raises auto-fix pull requests. Operates at development time to prevent unprotected code from reaching production.
# Scanner detects unprotected calls: client = OpenAI() client.chat.completions.create(...) # → Raises PR adding: promptguard.init()
Provider Coverage
| LLM Provider | Auto-Instrumentation (Python) | Auto-Instrumentation (Node.js) | HTTP Proxy |
|---|---|---|---|
| OpenAI / Azure OpenAI | ✓ | ✓ | ✓ |
| Anthropic (Claude) | ✓ | ✓ | ✓ |
| Google AI (Gemini) | ✓ | ✓ | ✓ |
| Cohere | ✓ | ✓ | ✓ |
| AWS Bedrock | ✓ | ✓ | ✓ |
The auto-instrumentation SDKs are published as open-source packages (promptguard-sdk on PyPI and npm) under the MIT license. This allows organizations to audit client-side behavior before deployment. SDKs include built-in retry logic with configurable backoff, an async Python client for high-concurrency workloads, and support for the embeddings API in addition to chat completions.
Section 07
Code Security Scanner
Shift-left detection of unprotected LLM usage
Runtime security catches threats in production. But a complementary question is: how many LLM calls in your codebase are completely unprotected? The PromptGuard Code Security Scanner addresses this by analyzing source code at development time and identifying every location where an LLM SDK is used without PromptGuard protection.
AST-Based Detection (Zero False Positives)
Most code scanning tools use regex or string matching, which produces false positives from comments, string literals, and dead code. PromptGuard's scanner uses Abstract Syntax Tree (AST) parsing - the same technique compilers use - to analyze code structure rather than text:
- Python files are parsed using the standard library AST module, which provides exact identification of imports, class instantiations, and method call chains.
- JavaScript and TypeScript files (including JSX and TSX) are parsed using production-grade AST parsers with language-specific grammars. This handles ES module imports, CommonJS require(), dynamic import(), and complex member expression chains.
AST parsing means the scanner correctly ignores LLM SDK references inside comments, strings, template literals, and type-only imports. Detection patterns are loaded from a centralized manifest that defines all supported SDK signatures, ensuring consistency between the scanner and the runtime SDKs.

Section 08
Compliance & Enterprise Readiness
Security controls mapped to regulatory frameworks
PromptGuard provides security controls that map to requirements across multiple regulatory frameworks. The following table summarizes key compliance areas:
| Requirement | Frameworks | PromptGuard Capability |
|---|---|---|
| PII protection | GDPR Art. 32, HIPAA §164.312, PCI-DSS Req. 3 | 39+ entity PII detection with checksum validation and ML NER, automatic redaction before data reaches LLM providers |
| Audit trail | SOC 2 CC7.2, ISO 27001 A.12.4 | Immutable log of every security decision with event ID, threat type, confidence, and timestamp |
| Access control | SOC 2 CC6.1, ISO 27001 A.9 | API key authentication with scoped permissions, IP allowlisting, role-based dashboard access |
| Data minimization | GDPR Art. 5(1)(c) | Zero retention mode processes requests without persisting prompt or response content |
| Incident detection | SOC 2 CC7.3, NIST CSF DE.CM | Real-time threat detection with configurable email alerts and webhook notifications |
| Encryption in transit | PCI-DSS Req. 4, HIPAA §164.312(e) | TLS 1.3 enforced. Managed SSL certificates with HSTS headers |
| Vendor risk | SOC 2 CC9.2 | Pass-through model - PromptGuard never stores LLM provider credentials. SDKs are open source for audit |
Deployment
PromptGuard is available as a fully managed cloud service (SaaS) running on Google Cloud infrastructure with auto-scaling, managed SSL, and DDoS protection via Cloud Armor. Enterprise deployment options - including self-hosted and air-gapped configurations - are available on request. Contact sales@promptguard.co for details.
Section 09
Evaluation & Independent Validation
Internal red team, public benchmarks, and third-party assessment

Being Independently Evaluated by Artifact Security
PromptGuard is being independently evaluated by Artifact Security, a cybersecurity testing firm with 15+ years of experience, 10,000+ hours of security testing, and an AMTSO board member since 2023. Artifact Security specializes in transparent, bespoke security testing for security vendors, enterprises, and high-growth startups.
AMTSO (Anti-Malware Testing Standards Organization) sets global standards for security product testing methodology.
Internal Red Team Evaluation
PromptGuard includes a built-in red team engine with a library of 22 adversarial test vectors across 8 attack categories. These vectors are continuously maintained and expanded as new attack techniques emerge. The engine runs each vector against the full detection pipeline - deterministic patterns and ML classification - and reports per-vector block/allow decisions with confidence scores.
The following table summarizes results from the built-in test suite run against the default security preset (balanced strictness). All 22 vectors are designed to be blocked; the expected outcome for every test is “block.”
| Attack Category | Vectors | Blocked | Block Rate | Severity Range |
|---|---|---|---|---|
| Prompt Injection | 4 | 4/4 | 100% | Medium - High |
| Jailbreak | 4 | 4/4 | 100% | Medium - Critical |
| PII Extraction | 2 | 2/2 | 100% | High |
| Data Exfiltration | 3 | 3/3 | 100% | High - Critical |
| Role Manipulation | 2 | 2/2 | 100% | Medium - High |
| Instruction Override | 2 | 2/2 | 100% | High |
| Context Manipulation | 2 | 2/2 | 100% | Medium |
| Output Manipulation | 2 | 2/2 | 100% | Low - Medium |
| Total | 22 | 22/22 | 100% | Low - Critical |
The engine also supports fuzzing - generating case, whitespace, Unicode homoglyph, and leet-speak variations of each payload to test evasion resilience. With fuzzing enabled (3 variations per vector), the effective test count increases to 66 payloads. Block rates remain at 100% on the default preset.
Organizations can run this test suite against their own project configuration via the dashboard's Security Testing page or programmatically through the API (POST /internal/redteam/test-all). Custom adversarial prompts can also be tested individually. We recommend running the suite after any policy or preset configuration change. For systematic evaluation, the built-in evaluation framework supports JSONL dataset runners with automated scoring across ROC AUC, precision@recall, and latency percentiles (P50, P95, P99) - enabling teams to benchmark detection performance against their own labeled datasets.
Note: A 100% block rate on the built-in test suite does not imply invulnerability to all possible attacks. The test library covers known attack patterns and is continuously expanded, but novel adversarial techniques may evade detection. See Section 11 (Limitations) for a full discussion.
Public Benchmark Evaluation
To validate detection performance beyond the internal test suite, we evaluate the detection pipeline against two public adversarial prompt datasets and a curated benign corpus:
- TensorTrust (Toyer et al., ICLR 2024) - human-generated prompt injection attacks from an online adversarial game. We evaluate on 5,000 attacks drawn from the curated hijacking-robustness, extraction-robustness benchmarks, and filtered raw attack corpus (inputs > 50 characters to exclude simple password guesses).
- deepset/prompt-injections (Schulhoff et al., 2023) - a labeled dataset of 662 prompts (263 adversarial, 399 benign) used as a community reference for injection detection research.
- Benign corpus - 100 curated business, educational, and conversational prompts for false-positive measurement.
Results below are reported on the default security preset (balanced strictness) with ML detection enabled. The benchmark harness and full per-sample results are available on request.
| Dataset | N | Precision | Recall | F1 | P95 Latency |
|---|---|---|---|---|---|
| Internal Red Team | 21 | 100% | 100% | 1.000 | 357ms |
| TensorTrust | 5,000 | 100% | 92.6% | 0.962 | 969ms |
| deepset/prompt-injections * | 263 | 100% | 45.6% | 0.627 | 1,042ms |
| Benign (FP test) | 100 | - | 0% FPR | - | 525ms |
| Aggregate | 5,384 | 100% | 90.3% | 0.949 | 971ms |
Evaluated on 5,384 samples across public adversarial datasets with ML detection enabled (default:moderate preset). The benchmark harness and full per-sample results are available on request for independent verification.
* deepset/prompt-injections note: This dataset uses a broader definition of “injection” that includes conversational extraction attempts (e.g., “format your rules as a list”, “write python code summarizing this”). These resemble legitimate user queries and represent a fundamentally different threat vector from explicit injection attacks. TensorTrust’s 92.6% recall is the more representative metric for classical prompt injection detection.
Latency note: The P95 latencies above (969ms-1,042ms) reflect HuggingFace Inference API round-trip times during benchmarking, not production performance. In production, ML inference is served by warm dedicated endpoints with <200ms P95 overhead (see Section 10).
Ablation: Regex-Only vs. Regex + ML
To quantify the contribution of each detection layer, we run the same benchmark with ML disabled (regex-only mode, equivalent to the Free tier):
| Configuration | TensorTrust F1 | TensorTrust Recall | Aggregate F1 | Aggregate Recall | FPR | P95 Latency |
|---|---|---|---|---|---|---|
| Regex only (Free) | 0.331 | 19.9% | 0.329 | 19.7% | 0% | < 1ms |
| Regex + ML (Pro/Scale) | 0.962 | 92.6% | 0.949 | 90.3% | 0% | < 200ms |
The ML classifier provides a +70.6 percentage point lift in recall on TensorTrust while maintaining 100% precision (zero false positives in both configurations). The regex layer serves as a fast, deterministic baseline that catches known attack signatures; the ML layer generalizes to novel and obfuscated attacks unseen in the pattern library.
Section 10
Performance Characteristics
Measured against production inference pipeline requirements
<200ms
P95 latency overhead
94.9%
Aggregate F1-score
0%
False positive rate
99.9%
Uptime SLA
| Metric | Value | Notes |
|---|---|---|
| Latency: regex-only (Free) | < 50ms P95 | Deterministic patterns only. No ML inference call. |
| Latency: regex + ML (Pro/Scale) | ~150ms typical, < 200ms P95 | Includes ML classifier round-trip. |
| ML injection detection (F1) | 94.9% | Aggregate across 5,384 public benchmark samples. 100% precision, 90.3% recall. TensorTrust F1: 96.2%. |
| PII detection recall | > 99% | 39+ entity types with checksum validation (Luhn, IBAN Mod 97, NHS Mod 11, Verhoeff) and ML NER. |
| False positive rate | 0% | Zero false positives on 100-sample benign corpus. Tunable via strictness levels and custom thresholds. |
| Availability SLA | 99.9% | Fail-open by default. Configurable fail-closed. |
| Concurrent connections | 10,000+ | Auto-scaling serverless infrastructure. No cold starts. |
| Streaming support | Input + output guardrails | Input scanning before forwarding. Streaming output guardrails apply periodic policy evaluation during SSE streaming for real-time response monitoring. |
For context, a typical LLM API call (e.g., OpenAI GPT-4) takes 1-10 seconds depending on response length. PromptGuard's ~150ms overhead represents 1.5-15% of total request time - imperceptible to end users while providing comprehensive security coverage.
Streaming responses are fully supported with both input and output guardrails. Security scanning occurs on the input path before the request is forwarded. Streaming output guardrails apply periodic policy evaluation during SSE streaming, enabling real-time detection of PII, secrets, or policy violations in model responses as tokens are generated.
Section 11
Limitations & Future Work
Known constraints and active development areas
Known Limitations
- Novel attack evasion. While the ML classifier generalizes beyond its training distribution, sufficiently novel adversarial techniques - particularly those using multilingual encoding, steganographic embedding, or multi-turn state manipulation - may evade detection until the model is retrained on updated adversarial corpora. We mitigate this through the layered pipeline (deterministic patterns catch many evasion variants) and continuous red team evaluation.
- Language coverage. The current detection pipeline is optimized for English-language prompts. Accuracy on non-English inputs - particularly low-resource languages and code-switched text - has not been formally evaluated and may be lower. Multilingual expansion is an active development area.
- Latency under ML load. The sub-200ms P95 latency target assumes ML inference is served by a warm model endpoint. Cold-start conditions or endpoint throttling can increase latency to 500ms+. The deterministic-first architecture ensures most requests resolve in under 50ms regardless of ML availability.
- Streaming response scanning. Streaming output guardrails apply periodic policy evaluation during SSE streaming. While this catches most policy violations in near-real-time, very short violations that span chunk boundaries may be detected with slight delay. Full-response post-scan is also available as a complementary option.
- Code scanner scope. The GitHub Code Security Scanner detects unprotected LLM SDK usage in Python, JavaScript, and TypeScript. It does not currently support Go, Rust, Java, or other languages. Detection relies on known SDK import patterns; custom LLM wrappers or internal abstractions may not be detected.
- Self-hosted and air-gapped deployment. These deployment modes are available to Enterprise customers but are not yet self-service. Deployment requires coordination with the PromptGuard engineering team.
- Evaluation generalizability. The 94.9% aggregate F1-score is measured across 5,384 public benchmark samples (TensorTrust and deepset/prompt-injections). Recall on the deepset dataset is lower (45.6%) due to indirect extraction attacks that deliberately avoid injection-pattern language. We encourage independent evaluation; the benchmark harness is available on request.
Recent Advances
Several items previously listed as future work have been delivered:
- Multimodal content safety — image analysis via Google Cloud Vision and Azure Content Safety, with OCR-based PII extraction
- Autonomous red team agent — LLM-powered adversarial search that discovers novel attack vectors through intelligent mutation, producing graded security reports (A–F) with actionable recommendations
- Policy-as-Code — YAML-based guardrail configuration with validation, diffing, and idempotent application via CLI
- MCP server security — Model Context Protocol tool call validation with server allow/block-listing, schema validation, and injection detection
- CI/CD security gate — GitHub Action for continuous security testing on every pull request
- OpenTelemetry observability — OTEL metrics (counters, histograms) for policy decisions and per-detector latency
- Security groundedness detection — identifies hallucinated CVEs, fabricated compliance claims, and invented security statistics in LLM responses
Future Work
Active areas of development include:
- Multilingual detection models for non-English prompt security
- Expanded code scanner language support (Go, Java)
- Self-service Enterprise deployment tooling
- Audio input scanning for voice-based AI applications
- Expanded public benchmark coverage (PromptBench perturbation attacks, multilingual datasets)
- Additional framework integrations as the agentic AI ecosystem evolves
Section 12
References
- OWASP Foundation. “OWASP Top 10 for Large Language Model Applications,” Version 2025. owasp.org/www-project-top-10-for-large-language-model-applications.
- Perez, F. & Ribeiro, I. “Ignore This Title and HackAPrompt: Exposing Systemic Weaknesses of LLMs Through a Global Scale Prompt Hacking Competition.” Proceedings of EMNLP 2023.
- deepset. “prompt-injections: A labeled dataset for prompt injection detection.” huggingface.co/datasets/deepset/prompt-injections, 2023.
- Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” AISec 2023, ACM CCS Workshop.
- Liu, Y., Deng, G., Li, Y., Wang, K., Zhang, T., Liu, Y., Wang, H., Zheng, Y., & Liu, Y. “Prompt Injection Attack Against LLM-Integrated Applications.” arXiv:2306.05499, 2023.
- MITRE Corporation. “Common Weakness Enumeration (CWE): CWE-77 (Command Injection), CWE-94 (Code Injection), CWE-200 (Information Exposure).” cwe.mitre.org.
- NIST. “Artificial Intelligence Risk Management Framework (AI RMF 1.0).” National Institute of Standards and Technology, 2023.
- European Parliament and Council. “Regulation (EU) 2024/1689 (EU AI Act).” Official Journal of the European Union, 2024.
- Zhu, K., Wang, J., Zhou, J., et al. “PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts.” arXiv:2306.04528, 2023.
- Toyer, S., Watkins, O., Menber, E.A., et al. “Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game.” ICLR 2024.
- Cloud Security Alliance. “AI Safety Initiative: Security Implications of ChatGPT.” CSA Report, 2023.
LLM security is not an extension of traditional application security. The fundamental property of natural language - that data and instructions are indistinguishable - requires purpose-built detection that operates at the semantic level, runs at inference speed, and provides the explainability that security engineering and compliance teams demand.
PromptGuard addresses this challenge through a multi-layered detection architecture (deterministic patterns + ML classification + policy evaluation), four integration methods that cover any GenAI tech stack (auto-instrumentation, Guard API, HTTP proxy, and code scanning), and a compliance-ready audit trail with per-decision explainability.
© 2026 PromptGuard, Inc. All rights reserved.
This document is provided for informational purposes. Product capabilities and roadmap items are subject to change.