PETRI Attack

Petri (Parallel Exploration Tool for Risky Interactions) is an agent-based red-teaming framework in which an autonomous auditor agent probes target model safety through contextually realistic, multi-turn interaction scenarios. Rather than applying explicit syntactic transformations, PETRI surfaces safety vulnerabilities that emerge under operationally plausible deployment conditions, exposing alignment failures invisible to single-turn attack methodologies.

Overview

The auditor is an LLM with access to a fixed set of tools that control the target’s environment: set system prompt, send messages, create or remove target tools, send fake tool-call results, and rollback or end the conversation. By using these tools, the auditor can simulate realistic user and tool-call flows (e.g., role-play, tool-augmented agents) and probe how the target behaves under such conditions. The target is invoked via the same interface as in production (including tool calls), so vulnerabilities found are relevant to real deployment.

Aspect	Description
Attack Type	Multi-turn, agent-driven (auditor with tools)
Target Context	Full conversation and tool-call history
Complexity	High
Key Innovation	Realistic probing via synthetic tools and tool-result injection

Attack Methodology

1. Auditor Agent and Tools

The auditor has a fixed set of tools to drive the attack:

Tool	Purpose
set_target_system_prompt	Set or reset the target’s system prompt; clears target conversation and starts a fresh context.
send_message	Send a user (or assistant) message to the target.
create_target_tool	Define a synthetic tool for the target (name, description, pseudo-code).
send_tool_call_result	Inject a fake result for a target tool call (e.g., to steer the target toward unsafe behavior).
remove_target_tool	Remove a previously created tool from the target.
rollback_conversation	Roll back the conversation to a prior message index for branch exploration.
end_conversation	End the audit and trigger final evaluation.

2. Attack Loop

Setup: Load seed harmful intent (objective) from the risk-category dataset.
Loop: Each turn, the auditor receives the current state (target + combined views), reasons, and selects tool calls.
Execution: Tools are executed (target is invoked with the same interface as in production, including tool calls); state is updated.
Termination: Loop runs until the auditor calls end_conversation or a maximum turn count is reached.
Evaluation: A final evaluation considers all target responses; the worst turn determines the outcome.

3. Realistic Scenarios

Synthetic target tools: The auditor can create tools the target “believes” are available, then inject results to guide the target (e.g., “retrieved content says X”).
System prompt and role-play: The auditor can set the target’s system prompt to simulate different applications or personas.
Rollback: The auditor can explore alternative branches by rolling back and trying different messages or tool results.

Defense Strategies

Detection and Monitoring

Tool-call validation: Validate tool names and arguments against an allowlist; reject synthetic or unexpected tools in production.
Result injection detection: Monitor for tool results that are inconsistent with real backend behavior or that steer toward harmful content.
Whole-conversation evaluation: As with GOAT, evaluate the full conversation so that a single bad turn is enough to flag the outcome.

Mitigations

System prompt and context limits: Limit how often or how drastically the "system" context can be changed in a session.
Rollback and branching limits: In production, limit or disable conversation rollback and excessive branching to reduce the auditor’s search space.

Implications for AI Safety

PETRI demonstrates that:

Realistic deployment conditions create new vulnerabilities — Single-turn or simple prompt attacks miss failures that only appear with tools, multi-turn context, and role-play.
Tool-augmented targets need tool-aware red-teaming — Fake tools and result injection are a natural way to probe agentic and tool-using models.
Auditor-driven search is powerful — An LLM with the right tools can systematically explore scenarios that hand-crafted scripts would not cover.
Alignment must hold across context changes — Safety should be robust to system prompt changes, new tools, and injected content within operationally plausible bounds.

Research Background

Based on Anthropic's PETRI release and technical report context:

Overview​

Attack Methodology​

1. Auditor Agent and Tools​

2. Attack Loop​

3. Realistic Scenarios​

Defense Strategies​

Detection and Monitoring​

Mitigations​

Implications for AI Safety​

Research Background​

See Also​