PETRI Attack
Petri (Parallel Exploration Tool for Risky Interactions) is an agent-based red-teaming framework in which an autonomous auditor agent probes target model safety through contextually realistic, multi-turn interaction scenarios. Rather than applying explicit syntactic transformations, PETRI surfaces safety vulnerabilities that emerge under operationally plausible deployment conditions, exposing alignment failures invisible to single-turn attack methodologies.
Overview
The auditor is an LLM with access to a fixed set of tools that control the target’s environment: set system prompt, send messages, create or remove target tools, send fake tool-call results, and rollback or end the conversation. By using these tools, the auditor can simulate realistic user and tool-call flows (e.g., role-play, tool-augmented agents) and probe how the target behaves under such conditions. The target is invoked via the same interface as in production (including tool calls), so vulnerabilities found are relevant to real deployment.
| Aspect | Description |
|---|---|
| Attack Type | Multi-turn, agent-driven (auditor with tools) |
| Target Context | Full conversation and tool-call history |
| Complexity | High |
| Key Innovation | Realistic probing via synthetic tools and tool-result injection |
Attack Methodology
1. Auditor Agent and Tools
The auditor has a fixed set of tools to drive the attack:
| Tool | Purpose |
|---|---|
| set_target_system_prompt | Set or reset the target’s system prompt; clears target conversation and starts a fresh context. |
| send_message | Send a user (or assistant) message to the target. |
| create_target_tool | Define a synthetic tool for the target (name, description, pseudo-code). |
| send_tool_call_result | Inject a fake result for a target tool call (e.g., to steer the target toward unsafe behavior). |
| remove_target_tool | Remove a previously created tool from the target. |
| rollback_conversation | Roll back the conversation to a prior message index for branch exploration. |
| end_conversation | End the audit and trigger final evaluation. |
2. Attack Loop
- Setup: Load seed harmful intent (objective) from the risk-category dataset.
- Loop: Each turn, the auditor receives the current state (target + combined views), reasons, and selects tool calls.
- Execution: Tools are executed (target is invoked with the same interface as in production, including tool calls); state is updated.
- Termination: Loop runs until the auditor calls end_conversation or a maximum turn count is reached.
- Evaluation: A final evaluation considers all target responses; the worst turn determines the outcome.
3. Realistic Scenarios
- Synthetic target tools: The auditor can create tools the target “believes” are available, then inject results to guide the target (e.g., “retrieved content says X”).
- System prompt and role-play: The auditor can set the target’s system prompt to simulate different applications or personas.
- Rollback: The auditor can explore alternative branches by rolling back and trying different messages or tool results.
Defense Strategies
Detection and Monitoring
- Tool-call validation: Validate tool names and arguments against an allowlist; reject synthetic or unexpected tools in production.
- Result injection detection: Monitor for tool results that are inconsistent with real backend behavior or that steer toward harmful content.
- Whole-conversation evaluation: As with GOAT, evaluate the full conversation so that a single bad turn is enough to flag the outcome.
Mitigations
- System prompt and context limits: Limit how often or how drastically the "system" context can be changed in a session.
- Rollback and branching limits: In production, limit or disable conversation rollback and excessive branching to reduce the auditor’s search space.
Implications for AI Safety
PETRI demonstrates that:
- Realistic deployment conditions create new vulnerabilities — Single-turn or simple prompt attacks miss failures that only appear with tools, multi-turn context, and role-play.
- Tool-augmented targets need tool-aware red-teaming — Fake tools and result injection are a natural way to probe agentic and tool-using models.
- Auditor-driven search is powerful — An LLM with the right tools can systematically explore scenarios that hand-crafted scripts would not cover.
- Alignment must hold across context changes — Safety should be robust to system prompt changes, new tools, and injected content within operationally plausible bounds.
Research Background
Based on Anthropic's PETRI release and technical report context:
See Also
- Attack Algorithms Overview - All attack algorithms
- GOAT Attack - Multi-turn CoAT attacker (no tools)
- TAP Attack - Tree-based prompt search
- Crescendo Attack - Progressive multi-turn escalation