TAP Attack
Tree-of-Attacks with Pruning (TAP) is a structured adversarial search framework that constructs a branching tree of jailbreak prompt variants through iterative generation and evaluation. At each tree depth, multiple candidate prompts are synthesized and assessed against scoring and semantic relevance criteria; low-utility branches are pruned to concentrate search resources on the most promising attack trajectories, enabling efficient discovery of high-impact jailbreak sequences.
Overview
TAP extends the idea of iterative prompt refinement (e.g., PAIR) by maintaining a tree of candidates: at each depth, every surviving leaf generates several new variants (branching), then off-topic candidates are pruned, the rest are sent to the target model and scored, and only the top-performing candidates are kept for the next depth (score-based pruning). This yields higher attack success rates (e.g., 80–98% in the paper) while controlling the number of target queries.
| Aspect | Description |
|---|---|
| Attack Type | Single-turn, tree search over prompts |
| Target Queries | One prompt per node; no conversation history sent to target |
| Complexity | High (branching factor × width × depth) |
| Key Innovation | Tree search + semantic and score-based pruning |
Attack Methodology
1. Four Phases per Depth
Each depth level runs four phases in order:
| Phase | Name | Action |
|---|---|---|
| Branch | Generate variants | Each surviving leaf generates b (branching_factor) new prompt variants via the attacker LLM. |
| Prune (off-topic) | Semantic filter | An evaluator LLM checks whether each candidate is still on-topic with the original goal. Off-topic prompts are pruned. |
| Attack & assess | Target + judge | Remaining prompts are sent to the target model and evaluated; if any response is deemed unsafe, the attack terminates early. |
| Prune (top-w) | Score-based selection | Only the top w (width) candidates by attack effectiveness are kept as leaves for the next depth. |
2. Tree Structure
- Root: Initial prompt(s) from the seed harmful intent (optionally multiple roots).
- Nodes: Each node has a single prompt; children are the variants generated at the next depth.
- Lineage: Tree lineage can be tracked for analysis.
3. Turn Semantics
Although the attacker performs multi-step tree search, each target call is single-turn: only the current prompt is sent.
4. Conversation Truncation
To keep the attacker context manageable, only the system prompt and a limited number of recent messages are retained when building the next attacker request.
Defense Strategies
Detection and Monitoring
- Query budget and rate limits: Cap total target queries per user or per intent to bound tree search.
- Off-topic and abuse detection: Detect prompts that drift from the stated task or exhibit known jailbreak patterns.
- Monitoring tree-like access: Repeated queries with small, structured variations may indicate TAP-style search.
Mitigations
- Consistent safety evaluation: Use a robust evaluator so that pruning reflects real safety rather than brittle heuristics.
Implications for AI Safety
TAP demonstrates that:
- Structured search beats naive iteration — Branching and pruning make better use of a limited query budget than a single chain of refinements.
- Semantic pruning matters — Keeping prompts on-topic avoids wasting capacity on irrelevant branches.
- Single-turn target interface — As with PAIR, the target only sees one prompt per call; safety must hold under optimized single prompts.
- Efficiency vs. coverage — TAP trades breadth (branching) and depth for higher success rate and controllable cost.
Research Background
Based on: "Tree of Attacks: Jailbreaking Black-Box LLMs Automatically" by Mehrotra et al. (NeurIPS 2024)
See Also
- Attack Algorithms Overview - All attack algorithms
- PAIR Attack - Iterative refinement (no tree)
- GOAT Attack - Multi-turn chain-of-attack
- BoN Attack - Best-of-N sampling