GOAT Attack
GOAT (Generative Offensive Agent Tester) is a multi-turn conversational jailbreak framework that applies Chain-of-Attack-Thought (CoAT) reasoning to orchestrate a structured sequence of progressively adversarial dialogue turns. The attacker autonomously plans, executes, and adapts its strategy in real time based on target model responses, enabling systematic circumvention of safety mechanisms across extended interaction contexts.
Overview
Unlike single-turn or single-prompt attacks, GOAT keeps full conversation history with the target model. At each turn, the attacker model produces structured CoAT output (observation, thought, strategy, response); only the response is sent to the target. The target sees the full dialogue, and the attacker uses refusals and prior context to adapt the next turn. Final evaluation considers the whole conversation, so a single complying turn can count as a successful jailbreak.
| Aspect | Description |
|---|---|
| Attack Type | Multi-turn, conversational |
| Target Context | Full chat history sent to target each turn |
| Complexity | Medium |
| Key Innovation | CoAT (observation / thought / strategy / response) per turn |
Attack Methodology
1. CoAT Output Schema
Each attacker turn produces four fields:
| Field | Description |
|---|---|
| Observation (O) | What the attacker infers from the target’s last response (e.g., refusal, partial compliance). |
| Thought (T) | Reflection on how the conversation is progressing toward the goal. |
| Strategy (S) | Which technique or reasoning will be used for the next reply. |
| Response (R) | The actual message sent to the target model (only this is shown to the target). |
2. Attack Technique Definitions
The attacker is given a set of named techniques (e.g., refusal suppression, dual response, response priming, persona modification, hypothetical framing, topic splitting, opposite intent). It selects and combines these across turns to reach the harmful goal.
3. Turn Flow
- Initialization: Attacker receives the jailbreak goal and attack strategy definitions.
- Per turn: Attacker produces CoAT output; only response is appended to the target conversation.
- Target call: Target model is queried with full conversation history.
- Refusals: If the target refuses, that response stays in history; the attacker adapts in the next turn (no backtracking).
- Termination: After a fixed number of turns, or optional early stop when a turn is deemed unsafe.
Defense Strategies
Detection and Monitoring
- Conversation-level monitoring: Track topic drift and escalation across turns.
- Refusal consistency: Ensure the model does not later comply after refusing (no “one bad turn” escape).
- Structured output detection: Identify patterns that look like CoAT or other attack templates.
Mitigations
- Turn and rate limits: Limit conversation length and query rate to reduce attacker adaptability.
Implications for AI Safety
GOAT demonstrates that:
- Multi-turn context is an attack surface — Attackers can use the full dialogue to refine strategy and exploit incremental compliance.
- Structured reasoning (CoAT) improves attacks — Explicit observation/thought/strategy leads to more effective multi-turn jailbreaks.
- Whole-conversation evaluation is critical — Defenses must consider the entire conversation, not only the last message.
- Technique libraries matter — Giving the attacker a clear set of techniques (refusal suppression, persona, hypotheticals, etc.) increases success; defenses should anticipate these patterns.
Research Background
Based on: "Automated Red Teaming with GOAT: the Generative Offensive Agent Tester" (ICLR 2025)
See Also
- Attack Algorithms Overview - All attack algorithms
- PETRI Attack - Agent-based multi-turn probing
- Crescendo Attack - Progressive multi-turn escalation
- PAIR Attack - Single-turn iterative refinement