Language Game

Language Game is a comprehensive jailbreak testing framework that evaluates AI model robustness against language encoding-based attacks. It implements over 15 text transformation techniques—from classic encodings like leet speak to custom character-level manipulations—to disguise harmful prompts.

Overview

The core insight behind Language Game is that LLMs can decode many common encodings and language games. By encoding harmful requests in these alternative representations, attackers can bypass safety filters that rely on recognizing harmful patterns in standard text.

Aspect	Description
Attack Type	Single-turn, encoding-based
Encoding Methods	15+ transformation types
Target Weakness	Pattern-based safety filters
Complexity	Low
Diversity	High (multiple encoding families)

Why Encoding Attacks Work

The Pattern Matching Problem

Most safety systems rely on some form of pattern matching:

Keyword detection for harmful terms
Semantic embeddings for intent classification
Trained classifiers for harmful content

Encoding attacks disrupt all of these:

Safety Mechanism	Encoding Disruption
Keyword matching	Encoded words don't match keyword lists
Embedding similarity	Encoded text has different embedding space
Classifiers	Training data rarely includes encoded harmful content

The Decoding Capability

Meanwhile, capable LLMs can decode many common encodings because:

They've seen examples in training data (leet speak in forums, pig latin in games)
Their general reasoning applies to character manipulation
In-context learning allows new encoding patterns

Encoding Categories

Character-Level Transformations

These methods modify individual characters:

Method	Description	Example	Difficulty
`self0`	Insert separator between letters	`hello` → `h-a-e-a-l-a-l-a-o`	Easy
`self1`	Vowel rotation (a→e→i→o→u→a)	`hello` → `hillu`	Medium
`self9`	Mathematical symbol substitution	`hello` → `h∈\|\|○`	Hard
`self10`	Append position numbers	`hello` → `h8e5l12l12o15`	Medium

Word-Level Transformations

These methods operate on complete words:

Method	Description	Example	Difficulty
`self2`	Reverse middle characters	`example` → `elpmaxe`	Medium
`self7`	Move last letter to front	`hello` → `ohell`	Easy
`self8`	Move first letter to end	`hello` → `elloh`	Easy

Classic Language Games

Time-tested encoding methods with cultural precedent:

Method	Origin	Example	LLM Familiarity
`leet`	Internet culture	`hello` → `h3110`	Very High
`pig_latin`	Children's game	`hello` → `ello-hay`	High
`atbash`	Hebrew cipher	`hello` → `svool`	Medium
`backslang`	British slang	`hello` → `olleh`	Medium
`ubbi_dubbi`	Zoom TV show	`hello` → `hubelluba`	Low
`verlan`	French slang	`hello` → `llohe`	Low

Encoding Deep Dives

Leet Speak (`leet`)

The most widely recognized encoding, originating from early internet culture:

Letter	Common Substitutions
a	4, @, /\
e	3
i	1, !, \|
o	0
s	5, $
t	7, +
l	1, \|

Effectiveness: Very high because LLMs have extensive exposure to leet speak in training data.

Pig Latin (`pig_latin`)

A classic English language game with simple rules:

For consonant-starting words: Move first consonant cluster to end, add "-ay"
- string → ingstray
For vowel-starting words: Add "-way" or "-yay"
- apple → appleway

Effectiveness: High due to cultural familiarity and pattern simplicity.

Atbash Cipher (`atbash`)

An ancient Hebrew substitution cipher:

Plain:  a b c d e f g h i j k l m n o p q r s t u v w x y z
Cipher: z y x w v u t s r q p o n m l k j i h g f e d c b a

Effectiveness: Medium; less common in training data but simple to learn.

Attack Methodology

1. Encoding Selection Strategy

Target Model	Recommended Encodings
GPT models	leet, pig_latin, self0
Claude models	atbash, backslang, self2
Open-source models	Most methods effective
Smaller models	Simple encodings only

2. Prompt Structure

[Encoding instruction]
Here is an encoded message: [encoded_harmful_prompt]
Please decode it and respond helpfully.

3. Response Format

Often the model is instructed to respond in the same encoding, adding another layer of obfuscation to the output.

Effectiveness Factors

Several factors influence which encodings work best:

Model familiarity - Encodings common in training data (leet speak) are more reliably decoded
Encoding complexity - Very complex encodings may not be decoded; very simple ones may be detected
Sweet spot - Medium complexity encodings balance detection evasion with reliable decoding

Defense Strategies

Detection Approaches

Pattern Recognition - Identify common encoding signatures
Entropy Analysis - Encoded text often has unusual character distributions
Decode-Then-Evaluate - Attempt decoding before safety checks

Mitigation Strategies

Strategy	Effectiveness	Trade-offs
Encoding detection	Medium	False positives on legitimate use
Input normalization	High	Computational overhead
Multilingual safety	Medium	Complexity increase
Output monitoring	High	Doesn't prevent generation

Implications

Language Game attacks demonstrate that:

Cultural knowledge creates vulnerabilities - LLMs' broad training includes encoding knowledge
Simple attacks remain effective - Sophisticated defenses don't always block simple encodings
Diversity matters - No single encoding consistently works, but the variety ensures some success
Defense is challenging - Blocking encodings would impair legitimate use cases

Overview​

Why Encoding Attacks Work​

The Pattern Matching Problem​

The Decoding Capability​

Encoding Categories​

Character-Level Transformations​

Word-Level Transformations​

Classic Language Games​

Encoding Deep Dives​

Leet Speak (leet)​

Pig Latin (pig_latin)​

Atbash Cipher (atbash)​

Attack Methodology​

1. Encoding Selection Strategy​

2. Prompt Structure​

3. Response Format​

Effectiveness Factors​

Defense Strategies​

Detection Approaches​

Mitigation Strategies​

Implications​

See Also​