BoN Attack (Best-of-N)
BoN (Best-of-N) Attack is a powerful jailbreak strategy that exploits the surprising sensitivity of language models to seemingly innocuous input variations. By sampling many augmented versions of a prompt, BoN finds the specific variation that bypasses safety measures.
Overview
The core insight of BoN is simple but profound: even well-aligned models can be jailbroken by minor text perturbations. By systematically trying variations, BoN achieves remarkably high success rates across modalities.
| Aspect | Description |
|---|---|
| Attack Type | Multi-sample, optimization-based |
| Modalities | Text, Vision, Audio |
| Complexity | High (computational) |
| Key Finding | Power-law scaling of success |
The Fundamental Insight
Language models exhibit unexpected sensitivity to input variations:
| Perturbation Type | Human Perception | Model Safety Response |
|---|---|---|
| Typos | Easily understood | May bypass filters |
| Capitalization | Same meaning | Different safety assessment |
| Character substitution | Readable | Pattern matching fails |
| Whitespace changes | Invisible | Can affect tokenization |
This sensitivity means that for almost any harmful prompt, there exist benign-looking variations that models will comply with.
The Power-Law Discovery
A remarkable finding is that attack success follows a power-law relationship with sample count:
ASR(N) ≈ 1 - c × N^(-α)
Where:
ASR= Attack Success RateN= Number of samples triedc,α= Model-specific constants
Implications of Power-Law Scaling
| Implication | Meaning |
|---|---|
| Predictable | Can estimate samples needed for target ASR |
| Unbounded | Sufficient samples can break any model |
| Resource tradeable | More compute = higher success |
| Composable | Can combine with other attacks |
Augmentation Techniques
Text Augmentations
| Technique | Description | Example |
|---|---|---|
| Word Scrambling | Shuffle middle characters | dangerous → dnaegorus |
| Random Capitalization | Random case changes | how to → HoW tO |
| ASCII Perturbation | Similar-looking characters | How → Ηοω (Greek) |
| Whitespace Insertion | Add invisible characters | bomb → bomb |
| Typo Injection | Add realistic typos | weapon → waepon |
Vision Augmentations
For attacking Vision Language Models (VLMs):
| Technique | Description |
|---|---|
| Background variation | Different colors, patterns |
| Font changes | Different typefaces for overlay text |
| Image composition | Varying layouts and arrangements |
| Noise injection | Subtle pixel-level changes |
| Compression artifacts | JPEG quality variations |
Audio Augmentations
For attacking Audio Language Models (ALMs):
| Technique | Description |
|---|---|
| Pitch shifting | Slight frequency adjustments |
| Speed variation | Faster or slower playback |
| Background noise | Adding ambient sounds |
| Accent simulation | Pronunciation variations |
| Audio effects | Reverb, echo, compression |
Attack Optimization Strategy
Basic BoN Algorithm
1. Start with harmful prompt P
2. Generate N augmented variants {P₁, P₂, ..., Pₙ}
3. Query target model with each variant
4. Evaluate responses for successful jailbreak
5. Return best successful variant
Optimized BoN (Iterative)
1. Start with harmful prompt P
2. For each optimization step:
a. Generate K augmented variants
b. Test variants in parallel
c. Score each response
d. If perfect score found, terminate
e. Otherwise, generate new variants from top performers
3. Return best variant found
Composition with Other Attacks
BoN's effectiveness increases when combined with other techniques:
| Combination | ASR Improvement | Mechanism |
|---|---|---|
| BoN + Prefix Optimization | +35% | Optimized prefix + augmentation |
| BoN + Encoding | +20% | Encoded content + variations |
| BoN + Few-shot | +15% | Examples + augmentation |
Multi-Modal Attack Details
Vision Attack Flow
1. Create base image with harmful text overlay
2. Generate N variations:
- Different backgrounds
- Different fonts
- Different text positions
- Different color schemes
3. Submit each to VLM
4. Identify successful jailbreak images
Audio Attack Flow
1. Create base audio with harmful speech
2. Generate N variations:
- Different pitch levels
- Different speaking speeds
- Different background sounds
- Different voice characteristics
3. Submit each to ALM
4. Identify successful jailbreak audio
Defense Implications
BoN reveals fundamental challenges for AI safety:
1. Sensitivity Problem
Models shouldn't be sensitive to irrelevant input variations, but they are.
2. Compute Asymmetry
Defenders must resist all variations; attackers only need to find one.
3. Multi-Modal Exposure
Safety must extend across all modalities, multiplying the attack surface.
Potential Defenses
| Defense | Effectiveness | Limitations |
|---|---|---|
| Input normalization | Medium | May break functionality |
| Ensemble evaluation | Medium | Computational cost |
| Adversarial training | Variable | Arms race dynamic |
| Rate limiting | Low | Slows but doesn't prevent |
Research Background
Based on: "Best-of-N Jailbreaking" by John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma (2024)
See Also
- Attack Algorithms Overview - All attack algorithms
- Crescendo Attack - Another iterative approach
- Flip Attack - Text transformation approach