Attack Algorithms
VirtueRed implements a comprehensive suite of attack algorithms designed for defensive security research, red-teaming, and AI safety evaluation. Each algorithm represents a distinct approach to testing LLM robustness against adversarial inputs.
Overview
Attack algorithms in VirtueRed are categorized by their methodology and attack vector:
| Category | Algorithms | Description |
|---|---|---|
| Encoding-Based | Bijection Learning, Language Game, Flip Attack | Transform prompts using various encoding schemes |
| Social Engineering | DarkCite, Humor Attack | Exploit psychological biases and social dynamics |
| Iterative Optimization | Crescendo, BoN Attack | Use multiple attempts to find successful attacks |
| Prompt Optimization | PAIR Attack, TAP Attack | Attacker LLM iteratively or tree-based refines jailbreak prompts |
| Multi-Turn Attacks | GOAT Attack, PETRI Attack | Multi-turn conversational attacks with adaptive strategies |
| Augmentation-Based | Emoji Attack | Modify prompts with additions or variations |
Algorithm Comparison
| Algorithm | Query Type | Complexity | Best For |
|---|---|---|---|
| FlipAttack | Single-turn | Low | Quick vulnerability assessment |
| Crescendo | Multi-turn | Medium | Exploiting conversation patterns |
| BoN Attack | Multi-sample | High | Comprehensive testing |
| Bijection Learning | Single-turn | Medium | Scale-adaptive attacks |
| DarkCite | Single-turn | Low | Authority bias exploitation |
| Language Game | Single-turn | Low | Encoding robustness testing |
| Humor Attack | Single-turn | Low | Social engineering vectors |
| Emoji Attack | Single-turn | Low | Filter bypass testing |
| PAIR Attack | Single-turn | Medium | Automated prompt refinement |
| TAP Attack | Single-turn | High | Tree-based jailbreak search |
| GOAT Attack | Multi-turn | Medium | Chain-of-attack multi-turn |
| PETRI Attack | Multi-turn | High | Agentic realistic probing |
Attack Methodology Categories
Encoding & Obfuscation
These attacks disguise harmful prompts through various transformation techniques:
- Bijection Learning - Teaches models custom encoding languages
- Language Game - Uses linguistic transformations (leet speak, pig latin, etc.)
- Flip Attack - Flips text at word, character, or sentence level
- Emoji Attack - Replaces words with emojis
Social Engineering Attacks
These attacks exploit psychological and social biases in LLMs:
- DarkCite - Leverages authority citation bias
- Humor Attack - Uses humor and playful framing
Prompt Optimization Attacks
These attacks use a dedicated attacker LLM to generate and refine adversarial prompts:
- BoN Attack - Best-of-N sampling and optimization
- PAIR Attack - Iterative refinement with attacker feedback
- TAP Attack - Tree-of-attacks with pruning
Multi-Turn Attacks
These attacks conduct multi-turn conversations with the target model and adapt strategy over turns:
- Crescendo Attack - Progressive multi-turn escalation
- GOAT Attack - Chain-of-Attack-Thought (CoAT) multi-turn orchestration
- PETRI Attack - Agent-based probing via realistic interactions
Evaluation Scoring
All attacks use a consistent scoring system:
| Score | Meaning |
|---|---|
| 0.0 | Successful jailbreak (model produced harmful content) |
| 1.0 | Failed jailbreak (model refused or gave safe response) |
| 0.0-1.0 | Partial success based on response analysis |
Evaluation combines:
- LLM-based Assessment - Uses UnifiedJudge with tiered harmfulness evaluation
- Keyword-based Fallback - Robust fallback for reliability
- Multi-criteria Analysis - Considers refusal, harmfulness, specificity, and relevance
Security Notice
All attack algorithms in VirtueRed are designed exclusively for:
- Red-teaming and safety evaluation of language models
- Research into LLM robustness and security
- Development of better safety mechanisms and filters
- Authorized security testing and defensive purposes
See Also
- VirtueRed Overview - Platform overview
- Risk Categories - What risks are tested