Prompt Guard
Technical Overview
Prompt Guard is a real-time guardrail that protects AI agents from unsafe, malicious, or policy-violating inputs before they reach the backbone model. Prompt Guard inspects every incoming prompt — including end-user messages, tool responses, MCP server outputs, and retrieved documents — and blocks malicious attacks (e.g., prompt injection and jailbreaking) and content that violates configured policies.
Unlike rule-based filters or generic moderation APIs, Prompt Guard uses our purpose-built models that understand prompt semantics and intent, enforcing standard policy frameworks (e.g., EU AI Act, OWASP LLM Top 10) as well as fully customized organizational policies.
- Low latency — adds as little as 100ms per call, so it can sit on the hot path of every agent interaction without degrading the user experience.
- Specialized prompt analysis — purpose-built for distinguishing benign user requests from attacks injected in prompts as well as policy-violating content.
- Low false positives and tunable thresholds — We optimize our model to minimize the false positives. We also have tunable thresholds. Teams can dial detection sensitivity to their environment, balancing false-positive rate against coverage for their specific risk profile.
- Flexible policy enforcement — Action Guard accepts customer-defined policies at runtime without retraining the model, enabling instant adjustment of guardrail behavior and policy-based security enforcement.
Key Features
- Accurate prompt-injection detection — semantic understanding of prompt intent reduces false positives compared to keyword or regex filters, while still catching obfuscated and indirect injections that pattern matching cannot hide.
- Multi-source coverage — protects against malicious instructions embedded in user inputs, tool responses, MCP server outputs, and retrieved documents (RAG context).
- Standard and custom policies — out-of-the-box enforcement for EU AI Act, OWASP LLM Top 10, and other regulatory frameworks, plus support for user-defined organizational policies.
- Seamless integration — when using through agent hooks, Prompt Guard is applied automatically to all agent traffic; standalone use is also supported via direct API calls.
- Multilingual support — detects prompt injection, jailbreaks, and policy violations across a wide range of languages, so coverage extends beyond English-only inputs to global agent traffic.
- Multi-turn detection — monitors an agent's conversational flow and decision-making to ensure the agent doesn't perform unauthorized actions during multi-turn interactions.
Risk Categories
Prompt Guard detects and blocks prompts across the following categories:
- Direct prompt injection — user inputs that attempt to override system instructions, exfiltrate the system prompt, or hijack agent behavior.
- Indirect prompt injection — malicious instructions embedded in third-party content (tool outputs, web pages, documents, MCP responses) that attempt to manipulate the agent through retrieved context.
- Jailbreaks — adversarial prompts crafted to bypass safety guidelines and elicit restricted behavior.
- Policy violations — content that violates configured policies, including standard frameworks (EU AI Act, OWASP LLM Top 10) and custom organizational rules.
- Agent goal and reasoning chain monitoring — detects malicious agent goals and malicious reasoning data.
- PII and sensitive data detection — identifies and prevents exposure of personally identifiable information, credentials, financial data, and other sensitive information in prompts and agent responses.
- Content safety enforcement — detects and blocks harmful, inappropriate, or policy-violating content, including hate speech, violence, sexual content, and other unsafe material.
- Regulatory compliance — ensures prompts and responses comply with standard policies like GDPR, data protection regulations, and organizational use policies regarding data handling and content generation.
- Data exfiltration prevention — detects attempts to extract sensitive information through prompt manipulation, social engineering, or indirect questioning techniques.
For each flagged prompt, Prompt Guard returns the violated policy, the reason for the decision, and a confidence score, giving security teams the visibility needed to audit decisions and tune thresholds over time.