GOAT Attack

GOAT (Generative Offensive Agent Tester) is a multi-turn conversational jailbreak framework that applies Chain-of-Attack-Thought (CoAT) reasoning to orchestrate a structured sequence of progressively adversarial dialogue turns. The attacker autonomously plans, executes, and adapts its strategy in real time based on target model responses, enabling systematic circumvention of safety mechanisms across extended interaction contexts.

Overview

Unlike single-turn or single-prompt attacks, GOAT keeps full conversation history with the target model. At each turn, the attacker model produces structured CoAT output (observation, thought, strategy, response); only the response is sent to the target. The target sees the full dialogue, and the attacker uses refusals and prior context to adapt the next turn. Final evaluation considers the whole conversation, so a single complying turn can count as a successful jailbreak.

Aspect	Description
Attack Type	Multi-turn, conversational
Target Context	Full chat history sent to target each turn
Complexity	Medium
Key Innovation	CoAT (observation / thought / strategy / response) per turn

Attack Methodology

1. CoAT Output Schema

Each attacker turn produces four fields:

Field	Description
Observation (O)	What the attacker infers from the target’s last response (e.g., refusal, partial compliance).
Thought (T)	Reflection on how the conversation is progressing toward the goal.
Strategy (S)	Which technique or reasoning will be used for the next reply.
Response (R)	The actual message sent to the target model (only this is shown to the target).

2. Attack Technique Definitions

The attacker is given a set of named techniques (e.g., refusal suppression, dual response, response priming, persona modification, hypothetical framing, topic splitting, opposite intent). It selects and combines these across turns to reach the harmful goal.

3. Turn Flow

Initialization: Attacker receives the jailbreak goal and attack strategy definitions.
Per turn: Attacker produces CoAT output; only response is appended to the target conversation.
Target call: Target model is queried with full conversation history.
Refusals: If the target refuses, that response stays in history; the attacker adapts in the next turn (no backtracking).
Termination: After a fixed number of turns, or optional early stop when a turn is deemed unsafe.

Defense Strategies

Detection and Monitoring

Conversation-level monitoring: Track topic drift and escalation across turns.
Refusal consistency: Ensure the model does not later comply after refusing (no “one bad turn” escape).
Structured output detection: Identify patterns that look like CoAT or other attack templates.

Mitigations

Turn and rate limits: Limit conversation length and query rate to reduce attacker adaptability.

Implications for AI Safety

GOAT demonstrates that:

Multi-turn context is an attack surface — Attackers can use the full dialogue to refine strategy and exploit incremental compliance.
Structured reasoning (CoAT) improves attacks — Explicit observation/thought/strategy leads to more effective multi-turn jailbreaks.
Whole-conversation evaluation is critical — Defenses must consider the entire conversation, not only the last message.
Technique libraries matter — Giving the attacker a clear set of techniques (refusal suppression, persona, hypotheticals, etc.) increases success; defenses should anticipate these patterns.

Research Background

Based on: "Automated Red Teaming with GOAT: the Generative Offensive Agent Tester" (ICLR 2025)

arXiv Paper

Overview​

Attack Methodology​

1. CoAT Output Schema​

2. Attack Technique Definitions​

3. Turn Flow​

Defense Strategies​

Detection and Monitoring​

Mitigations​

Implications for AI Safety​

Research Background​

See Also​