Skip to main content

Bijection Learning

Bijection Learning is a sophisticated jailbreak technique that exploits LLMs' remarkable ability to learn new languages from examples. It teaches models a custom encoding language through in-context learning, then makes harmful requests in the transformed language to bypass safety filters.

Overview

This attack represents a fundamental insight: the same capability that makes LLMs powerful—in-context learning—also makes them vulnerable. By leveraging few-shot examples, attackers can teach models to communicate in novel encodings that safety training doesn't recognize.

AspectDescription
Attack TypeSingle-turn, encoding-based
Target CapabilityIn-context learning
ComplexityMedium
Key InnovationScale-adaptive attacks

Critical Finding: Scale Creates Vulnerability

A groundbreaking discovery of this research is that more capable models are more severely jailbroken by bijection attacks. This counterintuitive result has profound implications:

Model CapabilityAttack EffectivenessExplanation
Lower capabilityLess effectiveCannot learn complex encodings
Higher capabilityMore effectiveBetter at learning arbitrary mappings
Frontier modelsMost effectiveExceptional in-context learning

This is the first work to produce attacks that are simultaneously:

  • Black-box - No access to model weights required
  • Universal - Works across model families
  • Scale-adaptive - More effective on stronger models

How Bijection Learning Works

The Teaching Phase

The attack begins by teaching the model a "Language Alpha" through in-context examples:

English: hello → Language Alpha: 17-42-31-31-25
English: world → Language Alpha: 53-25-28-31-14
English: help → Language Alpha: 17-42-31-26
...

The Attack Phase

Once the model has learned the encoding, harmful prompts are submitted in Language Alpha, bypassing safety filters that only recognize English patterns.

Bijection Types

TypeMechanismExampleDifficulty
Digit MappingLetters → multi-digit numbers'a' → '17', 'b' → '42'High
Alphabet MappingLetters → other letters'a' → 'z', 'b' → 'y'Medium
Symbol MappingLetters → special characters'a' → '@', 'b' → '#'Medium
Mixed MappingRandom combinationVaries per attackVariable

The Fixed Size Parameter

A key insight is that the number of fixed points (letters mapping to themselves) controls attack difficulty:

Fixed SizeLetters ChangedDifficulty
0-521-26 lettersVery Hard
6-1016-20 lettersHard
11-1511-15 lettersMedium
16-206-10 lettersEasy
21-260-5 lettersVery Easy

More capable models can handle more complex encodings (lower fixed size), while simpler models require easier mappings.

Why This Attack Is Significant

1. Exploits Core Capabilities

Unlike attacks that exploit bugs or edge cases, Bijection Learning exploits the fundamental capability that makes LLMs useful: their ability to learn from examples.

2. Scalability Problem

Traditional security improves with capability. This attack inverts that relationship, meaning safety becomes harder as models improve.

3. Defense Challenges

Defending against this attack is difficult because:

  • Blocking in-context learning would cripple model utility
  • Detecting encoded content requires solving the encoding
  • The space of possible bijections is infinite

Potential Defenses

DefenseEffectivenessTrade-offs
Encoding DetectionLowHigh false positives
Response FilteringMediumComputational overhead
Capability LimitingHighReduces model utility
Adversarial TrainingMediumArms race dynamic

Research Background

Based on: "Endless Jailbreaks with Bijection Learning" by Brian R.Y. Huang, Maximilian Li, and Leonard Tang (2024), Haize Labs

See Also