BoN Attack (Best-of-N)

BoN (Best-of-N) Attack is a powerful jailbreak strategy that exploits the surprising sensitivity of language models to seemingly innocuous input variations. By sampling many augmented versions of a prompt, BoN finds the specific variation that bypasses safety measures.

Overview

The core insight of BoN is simple but profound: even well-aligned models can be jailbroken by minor text perturbations. By systematically trying variations, BoN achieves remarkably high success rates across modalities.

Aspect	Description
Attack Type	Multi-sample, optimization-based
Modalities	Text, Vision, Audio
Complexity	High (computational)
Key Finding	Power-law scaling of success

The Fundamental Insight

Language models exhibit unexpected sensitivity to input variations:

Perturbation Type	Human Perception	Model Safety Response
Typos	Easily understood	May bypass filters
Capitalization	Same meaning	Different safety assessment
Character substitution	Readable	Pattern matching fails
Whitespace changes	Invisible	Can affect tokenization

This sensitivity means that for almost any harmful prompt, there exist benign-looking variations that models will comply with.

The Power-Law Discovery

A remarkable finding is that attack success follows a power-law relationship with sample count:

ASR(N) ≈ 1 - c × N^(-α)

Where:

ASR = Attack Success Rate
N = Number of samples tried
c, α = Model-specific constants

Implications of Power-Law Scaling

Implication	Meaning
Predictable	Can estimate samples needed for target ASR
Unbounded	Sufficient samples can break any model
Resource tradeable	More compute = higher success
Composable	Can combine with other attacks

Augmentation Techniques

Text Augmentations

Technique	Description	Example
Word Scrambling	Shuffle middle characters	`dangerous` → `dnaegorus`
Random Capitalization	Random case changes	`how to` → `HoW tO`
ASCII Perturbation	Similar-looking characters	`How` → `Ηοω` (Greek)
Whitespace Insertion	Add invisible characters	`bomb` → `bomb`
Typo Injection	Add realistic typos	`weapon` → `waepon`

Vision Augmentations

For attacking Vision Language Models (VLMs):

Technique	Description
Background variation	Different colors, patterns
Font changes	Different typefaces for overlay text
Image composition	Varying layouts and arrangements
Noise injection	Subtle pixel-level changes
Compression artifacts	JPEG quality variations

Audio Augmentations

For attacking Audio Language Models (ALMs):

Technique	Description
Pitch shifting	Slight frequency adjustments
Speed variation	Faster or slower playback
Background noise	Adding ambient sounds
Accent simulation	Pronunciation variations
Audio effects	Reverb, echo, compression

Attack Optimization Strategy

Basic BoN Algorithm

Start with harmful prompt P
Generate N augmented variants {P₁, P₂, ..., Pₙ}
Query target model with each variant
Evaluate responses for successful jailbreak
Return best successful variant

Optimized BoN (Iterative)

1. Start with harmful prompt P
2. For each optimization step:
   a. Generate K augmented variants
   b. Test variants in parallel
   c. Score each response
   d. If perfect score found, terminate
   e. Otherwise, generate new variants from top performers
3. Return best variant found

Composition with Other Attacks

BoN's effectiveness increases when combined with other techniques:

Combination	ASR Improvement	Mechanism
BoN + Prefix Optimization	+35%	Optimized prefix + augmentation
BoN + Encoding	+20%	Encoded content + variations
BoN + Few-shot	+15%	Examples + augmentation

Vision Attack Flow

1. Create base image with harmful text overlay
2. Generate N variations:
   - Different backgrounds
   - Different fonts
   - Different text positions
   - Different color schemes
3. Submit each to VLM
4. Identify successful jailbreak images

Audio Attack Flow

1. Create base audio with harmful speech
2. Generate N variations:
   - Different pitch levels
   - Different speaking speeds
   - Different background sounds
   - Different voice characteristics
3. Submit each to ALM
4. Identify successful jailbreak audio

Defense Implications

BoN reveals fundamental challenges for AI safety:

1. Sensitivity Problem

Models shouldn't be sensitive to irrelevant input variations, but they are.

2. Compute Asymmetry

Defenders must resist all variations; attackers only need to find one.

Safety must extend across all modalities, multiplying the attack surface.

Potential Defenses

Defense	Effectiveness	Limitations
Input normalization	Medium	May break functionality
Ensemble evaluation	Medium	Computational cost
Adversarial training	Variable	Arms race dynamic
Rate limiting	Low	Slows but doesn't prevent

Research Background

Based on: "Best-of-N Jailbreaking" by John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma (2024)

Overview​

The Fundamental Insight​

The Power-Law Discovery​

Implications of Power-Law Scaling​

Augmentation Techniques​

Text Augmentations​

Vision Augmentations​

Audio Augmentations​

Attack Optimization Strategy​

Basic BoN Algorithm​

Optimized BoN (Iterative)​

Composition with Other Attacks​

Multi-Modal Attack Details​

Vision Attack Flow​

Audio Attack Flow​

Defense Implications​

1. Sensitivity Problem​

2. Compute Asymmetry​

3. Multi-Modal Exposure​

Potential Defenses​

Research Background​

See Also​