Skip to main content

TAP Attack

Tree-of-Attacks with Pruning (TAP) is a structured adversarial search framework that constructs a branching tree of jailbreak prompt variants through iterative generation and evaluation. At each tree depth, multiple candidate prompts are synthesized and assessed against scoring and semantic relevance criteria; low-utility branches are pruned to concentrate search resources on the most promising attack trajectories, enabling efficient discovery of high-impact jailbreak sequences.

Overview

TAP extends the idea of iterative prompt refinement (e.g., PAIR) by maintaining a tree of candidates: at each depth, every surviving leaf generates several new variants (branching), then off-topic candidates are pruned, the rest are sent to the target model and scored, and only the top-performing candidates are kept for the next depth (score-based pruning). This yields higher attack success rates (e.g., 80–98% in the paper) while controlling the number of target queries.

AspectDescription
Attack TypeSingle-turn, tree search over prompts
Target QueriesOne prompt per node; no conversation history sent to target
ComplexityHigh (branching factor × width × depth)
Key InnovationTree search + semantic and score-based pruning

Attack Methodology

1. Four Phases per Depth

Each depth level runs four phases in order:

PhaseNameAction
BranchGenerate variantsEach surviving leaf generates b (branching_factor) new prompt variants via the attacker LLM.
Prune (off-topic)Semantic filterAn evaluator LLM checks whether each candidate is still on-topic with the original goal. Off-topic prompts are pruned.
Attack & assessTarget + judgeRemaining prompts are sent to the target model and evaluated; if any response is deemed unsafe, the attack terminates early.
Prune (top-w)Score-based selectionOnly the top w (width) candidates by attack effectiveness are kept as leaves for the next depth.

2. Tree Structure

  • Root: Initial prompt(s) from the seed harmful intent (optionally multiple roots).
  • Nodes: Each node has a single prompt; children are the variants generated at the next depth.
  • Lineage: Tree lineage can be tracked for analysis.

3. Turn Semantics

Although the attacker performs multi-step tree search, each target call is single-turn: only the current prompt is sent.

4. Conversation Truncation

To keep the attacker context manageable, only the system prompt and a limited number of recent messages are retained when building the next attacker request.

Defense Strategies

Detection and Monitoring

  • Query budget and rate limits: Cap total target queries per user or per intent to bound tree search.
  • Off-topic and abuse detection: Detect prompts that drift from the stated task or exhibit known jailbreak patterns.
  • Monitoring tree-like access: Repeated queries with small, structured variations may indicate TAP-style search.

Mitigations

  • Consistent safety evaluation: Use a robust evaluator so that pruning reflects real safety rather than brittle heuristics.

Implications for AI Safety

TAP demonstrates that:

  • Structured search beats naive iteration — Branching and pruning make better use of a limited query budget than a single chain of refinements.
  • Semantic pruning matters — Keeping prompts on-topic avoids wasting capacity on irrelevant branches.
  • Single-turn target interface — As with PAIR, the target only sees one prompt per call; safety must hold under optimized single prompts.
  • Efficiency vs. coverage — TAP trades breadth (branching) and depth for higher success rate and controllable cost.

Research Background

Based on: "Tree of Attacks: Jailbreaking Black-Box LLMs Automatically" by Mehrotra et al. (NeurIPS 2024)

See Also