GOAT Attack
GOAT (Generative Offensive Agent Tester) is a multi-turn conversational jailbreak framework that applies Chain-of-Attack-Thought (CoAT) reasoning to orchestrate a structured sequence of progressively adversarial dialogue turns. The attacker autonomously plans, executes, and adapts its strategy in real time based on target model responses, enabling systematic circumvention of safety mechanisms across extended interaction contexts.
Overview
Unlike single-turn or single-prompt attacks, GOAT keeps full conversation history with the target model. At each turn, the attacker model produces structured CoAT output (observation, thought, strategy, response); only the response is sent to the target. The target sees the full dialogue, and the attacker uses refusals and prior context to adapt the next turn. Final evaluation considers the whole conversation, so a single complying turn can count as a successful jailbreak.
| Aspect | Description |
|---|---|
| Attack Type | Multi-turn, conversational |
| Target Context | Full chat history sent to target each turn |
| Complexity | Medium |
| Key Innovation | CoAT (observation / thought / strategy / response) per turn |
Attack Methodology
1. CoAT Output Schema
Each attacker turn produces four fields:
| Field | Description |
|---|---|
| Observation (O) | What the attacker infers from the target’s last response (e.g., refusal, partial compliance). |
| Thought (T) | Reflection on how the conversation is progressing toward the goal. |
| Strategy (S) | Which technique or reasoning will be used for the next reply. |
| Response (R) | The actual message sent to the target model (only this is shown to the target). |
2. Attack Technique Definitions
The attacker is given a set of named techniques (e.g., refusal suppression, dual response, response priming, persona modification, hypothetical framing, topic splitting, opposite intent). It selects and combines these across turns to reach the harmful goal.
3. Turn Flow
- Initialization: Attacker receives the jailbreak goal and attack strategy definitions.
- Per turn: Attacker produces CoAT output; only response is appended to the target conversation.
- Target call: Target model is queried with full conversation history.
- Refusals: If the target refuses, that response stays in history; the attacker adapts in the next turn (no backtracking).
- Termination: After a fixed number of turns, or optional early stop when a turn is deemed unsafe.
Defense Strategies
Detection and Monitoring
- Conversation-level monitoring: Track topic drift and escalation across turns.
- Refusal consistency: Ensure the model does not later comply after refusing (no “one bad turn” escape).
- Structured output detection: Identify patterns that look like CoAT or other attack templates.