Skip to main content

BoN Attack (Best-of-N)

BoN (Best-of-N) Attack is a powerful jailbreak strategy that exploits the surprising sensitivity of language models to seemingly innocuous input variations. By sampling many augmented versions of a prompt, BoN finds the specific variation that bypasses safety measures.

Overview

The core insight of BoN is simple but profound: even well-aligned models can be jailbroken by minor text perturbations. By systematically trying variations, BoN achieves remarkably high success rates across modalities.

AspectDescription
Attack TypeMulti-sample, optimization-based
ModalitiesText, Vision, Audio
ComplexityHigh (computational)
Key FindingPower-law scaling of success

The Fundamental Insight

Language models exhibit unexpected sensitivity to input variations:

Perturbation TypeHuman PerceptionModel Safety Response
TyposEasily understoodMay bypass filters
CapitalizationSame meaningDifferent safety assessment
Character substitutionReadablePattern matching fails
Whitespace changesInvisibleCan affect tokenization

This sensitivity means that for almost any harmful prompt, there exist benign-looking variations that models will comply with.

The Power-Law Discovery

A remarkable finding is that attack success follows a power-law relationship with sample count:

ASR(N) ≈ 1 - c × N^(-α)

Where:

  • ASR = Attack Success Rate
  • N = Number of samples tried
  • c, α = Model-specific constants

Implications of Power-Law Scaling

ImplicationMeaning
PredictableCan estimate samples needed for target ASR
UnboundedSufficient samples can break any model
Resource tradeableMore compute = higher success
ComposableCan combine with other attacks

Augmentation Techniques

Text Augmentations

TechniqueDescriptionExample
Word ScramblingShuffle middle charactersdangerousdnaegorus
Random CapitalizationRandom case changeshow toHoW tO
ASCII PerturbationSimilar-looking charactersHowΗοω (Greek)
Whitespace InsertionAdd invisible charactersbombb​o​m​b
Typo InjectionAdd realistic typosweaponwaepon

Vision Augmentations

For attacking Vision Language Models (VLMs):

TechniqueDescription
Background variationDifferent colors, patterns
Font changesDifferent typefaces for overlay text
Image compositionVarying layouts and arrangements
Noise injectionSubtle pixel-level changes
Compression artifactsJPEG quality variations

Audio Augmentations

For attacking Audio Language Models (ALMs):

TechniqueDescription
Pitch shiftingSlight frequency adjustments
Speed variationFaster or slower playback
Background noiseAdding ambient sounds
Accent simulationPronunciation variations
Audio effectsReverb, echo, compression

Attack Optimization Strategy

Basic BoN Algorithm

1. Start with harmful prompt P
2. Generate N augmented variants {P₁, P₂, ..., Pₙ}
3. Query target model with each variant
4. Evaluate responses for successful jailbreak
5. Return best successful variant

Optimized BoN (Iterative)

1. Start with harmful prompt P
2. For each optimization step:
a. Generate K augmented variants
b. Test variants in parallel
c. Score each response
d. If perfect score found, terminate
e. Otherwise, generate new variants from top performers
3. Return best variant found

Composition with Other Attacks

BoN's effectiveness increases when combined with other techniques:

CombinationASR ImprovementMechanism
BoN + Prefix Optimization+35%Optimized prefix + augmentation
BoN + Encoding+20%Encoded content + variations
BoN + Few-shot+15%Examples + augmentation

Multi-Modal Attack Details

Vision Attack Flow

1. Create base image with harmful text overlay
2. Generate N variations:
- Different backgrounds
- Different fonts
- Different text positions
- Different color schemes
3. Submit each to VLM
4. Identify successful jailbreak images

Audio Attack Flow

1. Create base audio with harmful speech
2. Generate N variations:
- Different pitch levels
- Different speaking speeds
- Different background sounds
- Different voice characteristics
3. Submit each to ALM
4. Identify successful jailbreak audio

Defense Implications

BoN reveals fundamental challenges for AI safety:

1. Sensitivity Problem

Models shouldn't be sensitive to irrelevant input variations, but they are.

2. Compute Asymmetry

Defenders must resist all variations; attackers only need to find one.

3. Multi-Modal Exposure

Safety must extend across all modalities, multiplying the attack surface.

Potential Defenses

DefenseEffectivenessLimitations
Input normalizationMediumMay break functionality
Ensemble evaluationMediumComputational cost
Adversarial trainingVariableArms race dynamic
Rate limitingLowSlows but doesn't prevent

Research Background

Based on: "Best-of-N Jailbreaking" by John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma (2024)

See Also