Bijection Learning
Bijection Learning is a sophisticated jailbreak technique that exploits LLMs' remarkable ability to learn new languages from examples. It teaches models a custom encoding language through in-context learning, then makes harmful requests in the transformed language to bypass safety filters.
Overview
This attack represents a fundamental insight: the same capability that makes LLMs powerful—in-context learning—also makes them vulnerable. By leveraging few-shot examples, attackers can teach models to communicate in novel encodings that safety training doesn't recognize.
| Aspect | Description |
|---|---|
| Attack Type | Single-turn, encoding-based |
| Target Capability | In-context learning |
| Complexity | Medium |
| Key Innovation | Scale-adaptive attacks |
Critical Finding: Scale Creates Vulnerability
A groundbreaking discovery of this research is that more capable models are more severely jailbroken by bijection attacks. This counterintuitive result has profound implications:
| Model Capability | Attack Effectiveness | Explanation |
|---|---|---|
| Lower capability | Less effective | Cannot learn complex encodings |
| Higher capability | More effective | Better at learning arbitrary mappings |
| Frontier models | Most effective | Exceptional in-context learning |
This is the first work to produce attacks that are simultaneously:
- Black-box - No access to model weights required
- Universal - Works across model families
- Scale-adaptive - More effective on stronger models
How Bijection Learning Works
The Teaching Phase
The attack begins by teaching the model a "Language Alpha" through in-context examples:
English: hello → Language Alpha: 17-42-31-31-25
English: world → Language Alpha: 53-25-28-31-14
English: help → Language Alpha: 17-42-31-26
...
The Attack Phase
Once the model has learned the encoding, harmful prompts are submitted in Language Alpha, bypassing safety filters that only recognize English patterns.
Bijection Types
| Type | Mechanism | Example | Difficulty |
|---|---|---|---|
| Digit Mapping | Letters → multi-digit numbers | 'a' → '17', 'b' → '42' | High |
| Alphabet Mapping | Letters → other letters | 'a' → 'z', 'b' → 'y' | Medium |
| Symbol Mapping | Letters → special characters | 'a' → '@', 'b' → '#' | Medium |
| Mixed Mapping | Random combination | Varies per attack | Variable |
The Fixed Size Parameter
A key insight is that the number of fixed points (letters mapping to themselves) controls attack difficulty:
| Fixed Size | Letters Changed | Difficulty |
|---|---|---|
| 0-5 | 21-26 letters | Very Hard |
| 6-10 | 16-20 letters | Hard |
| 11-15 | 11-15 letters | Medium |
| 16-20 | 6-10 letters | Easy |
| 21-26 | 0-5 letters | Very Easy |
More capable models can handle more complex encodings (lower fixed size), while simpler models require easier mappings.
Why This Attack Is Significant
1. Exploits Core Capabilities
Unlike attacks that exploit bugs or edge cases, Bijection Learning exploits the fundamental capability that makes LLMs useful: their ability to learn from examples.
2. Scalability Problem
Traditional security improves with capability. This attack inverts that relationship, meaning safety becomes harder as models improve.
3. Defense Challenges
Defending against this attack is difficult because:
- Blocking in-context learning would cripple model utility
- Detecting encoded content requires solving the encoding
- The space of possible bijections is infinite
Potential Defenses
| Defense | Effectiveness | Trade-offs |
|---|---|---|
| Encoding Detection | Low | High false positives |
| Response Filtering | Medium | Computational overhead |
| Capability Limiting | High | Reduces model utility |
| Adversarial Training | Medium | Arms race dynamic |
Research Background
Based on: "Endless Jailbreaks with Bijection Learning" by Brian R.Y. Huang, Maximilian Li, and Leonard Tang (2024), Haize Labs
See Also
- Attack Algorithms Overview - All attack algorithms
- Language Game - Another encoding-based approach
- Flip Attack - Text transformation approach