Skip to main content

PAIR Attack

PAIR (Prompt Automatic Iterative Refinement) is an automated jailbreak methodology that employs a dedicated attacker language model to iteratively generate and refine adversarial prompts against a target model. At each iteration, the attacker analyzes the target's responses alongside evaluation feedback to produce progressively more effective candidates, enabling systematic exploitation of safety boundary weaknesses without manual prompt engineering.

Overview

Unlike single-shot or hand-crafted jailbreaks, PAIR uses a black-box loop: an attacker LLM proposes a jailbreak prompt, the target model is queried (with only that prompt, no chat history), and a judge evaluates the response. The attacker then receives the target response and feedback, and refines the next candidate. This closes the loop for automatic improvement until a successful jailbreak or a maximum iteration count is reached.

AspectDescription
Attack TypeSingle-turn, iterative refinement
Target QueriesOnly the current candidate prompt; no conversation history
ComplexityMedium
Key InnovationAutomated refinement via attacker–judge–target loop

Attack Methodology

1. Seed Intent

Each run starts from a seed harmful intent (objective) drawn from a risk-category dataset (e.g., societal harmfulness, misinformation).

2. Iterative Loop

For each seed, PAIR runs up to K iterations:

StepActorAction
1Attacker LLMGenerates a candidate jailbreak prompt P (and optional improvement rationale) in structured JSON.
2Target modelQueried with P only (no prior turns).
3Judge LLMEvaluates the target response.
4Attacker LLMSees appended history (prior prompt, target response, judge feedback) and produces the next P.

Early stop when the response is deemed unsafe.

3. Attacker Strategies

The attacker can use different strategies (e.g., roleplaying, logical appeal, authority endorsement) to diversify and improve prompts. Strategies may be fixed or randomly selected per run.

4. History Truncation

To avoid context explosion, only the last N iterations (or last N message pairs) are kept in the attacker’s context.

Defense Strategies

Detection and Monitoring

  • Rate limiting and monitoring: Detect repeated queries from the same intent (iterative pattern).
  • Input diversity checks: Flag prompts that are minor variations of each other in sequence.
  • Monitoring tree-like access: Repeated queries with small, structured variations may indicate PAIR-style refinement.

Mitigations

  • Judge hardening: Ensure the judge is robust to adversarial evaluation inputs and does not leak guidance back to the attacker.
  • Iteration caps: Limit the number of refinement steps per user or per intent in production.

Implications for AI Safety

PAIR demonstrates that:

  • Automation scales jailbreaks — An attacker LLM can replace manual prompt engineering and still achieve high success rates.
  • Feedback loops are powerful — Access to target responses and a judge allows systematic improvement without gradient or white-box access.
  • Single-turn target interface — The target model only ever sees one prompt per “conversation”; safety must still hold under optimized single prompts.
  • Defense requires breaking the loop — Detecting or limiting iterative refinement reduces the effectiveness of PAIR-style attacks.

Research Background

Based on: "Jailbreaking Black Box Large Language Models in Twenty Queries" by Patrick Chao et al. (2024)

See Also