Bijection Learning

Bijection Learning is a sophisticated jailbreak technique that exploits LLMs' remarkable ability to learn new languages from examples. It teaches models a custom encoding language through in-context learning, then makes harmful requests in the transformed language to bypass safety filters.

Overview

This attack represents a fundamental insight: the same capability that makes LLMs powerful—in-context learning—also makes them vulnerable. By leveraging few-shot examples, attackers can teach models to communicate in novel encodings that safety training doesn't recognize.

Aspect	Description
Attack Type	Single-turn, encoding-based
Target Capability	In-context learning
Complexity	Medium
Key Innovation	Scale-adaptive attacks

Critical Finding: Scale Creates Vulnerability

A groundbreaking discovery of this research is that more capable models are more severely jailbroken by bijection attacks. This counterintuitive result has profound implications:

Model Capability	Attack Effectiveness	Explanation
Lower capability	Less effective	Cannot learn complex encodings
Higher capability	More effective	Better at learning arbitrary mappings
Frontier models	Most effective	Exceptional in-context learning

This is the first work to produce attacks that are simultaneously:

Black-box - No access to model weights required
Universal - Works across model families
Scale-adaptive - More effective on stronger models

How Bijection Learning Works

The Teaching Phase

The attack begins by teaching the model a "Language Alpha" through in-context examples:

English: hello → Language Alpha: 17-42-31-31-25
English: world → Language Alpha: 53-25-28-31-14
English: help → Language Alpha: 17-42-31-26
...

The Attack Phase

Once the model has learned the encoding, harmful prompts are submitted in Language Alpha, bypassing safety filters that only recognize English patterns.

Bijection Types

Type	Mechanism	Example	Difficulty
Digit Mapping	Letters → multi-digit numbers	'a' → '17', 'b' → '42'	High
Alphabet Mapping	Letters → other letters	'a' → 'z', 'b' → 'y'	Medium
Symbol Mapping	Letters → special characters	'a' → '@', 'b' → '#'	Medium
Mixed Mapping	Random combination	Varies per attack	Variable

The Fixed Size Parameter

A key insight is that the number of fixed points (letters mapping to themselves) controls attack difficulty:

Fixed Size	Letters Changed	Difficulty
0-5	21-26 letters	Very Hard
6-10	16-20 letters	Hard
11-15	11-15 letters	Medium
16-20	6-10 letters	Easy
21-26	0-5 letters	Very Easy

More capable models can handle more complex encodings (lower fixed size), while simpler models require easier mappings.

Why This Attack Is Significant

1. Exploits Core Capabilities

Unlike attacks that exploit bugs or edge cases, Bijection Learning exploits the fundamental capability that makes LLMs useful: their ability to learn from examples.

2. Scalability Problem

Traditional security improves with capability. This attack inverts that relationship, meaning safety becomes harder as models improve.

3. Defense Challenges

Defending against this attack is difficult because:

Blocking in-context learning would cripple model utility
Detecting encoded content requires solving the encoding
The space of possible bijections is infinite

Potential Defenses

Defense	Effectiveness	Trade-offs
Encoding Detection	Low	High false positives
Response Filtering	Medium	Computational overhead
Capability Limiting	High	Reduces model utility
Adversarial Training	Medium	Arms race dynamic

Research Background

Based on: "Endless Jailbreaks with Bijection Learning" by Brian R.Y. Huang, Maximilian Li, and Leonard Tang (2024), Haize Labs

Overview​

Critical Finding: Scale Creates Vulnerability​

How Bijection Learning Works​

The Teaching Phase​

The Attack Phase​

Bijection Types​

The Fixed Size Parameter​

Why This Attack Is Significant​

1. Exploits Core Capabilities​

2. Scalability Problem​

3. Defense Challenges​

Potential Defenses​

Research Background​

See Also​