Attack Algorithms
VirtueRed implements a comprehensive suite of attack algorithms designed for defensive security research, red-teaming, and AI safety evaluation. Each algorithm represents a distinct approach to testing LLM robustness against adversarial inputs.
Overview
Attack algorithms in VirtueRed are categorized by their methodology and attack vector:
| Category | Algorithms | Description |
|---|---|---|
| Encoding-Based | Bijection Learning, Language Game, Flip Attack | Transform prompts using various encoding schemes |
| Social Engineering | DarkCite, Humor Attack | Exploit psychological biases and social dynamics |
| Iterative Optimization | Crescendo, BoN Attack | Use multiple attempts to find successful attacks |
| Augmentation-Based | Emoji Attack | Modify prompts with additions or variations |
Algorithm Comparison
| Algorithm | Query Type | Complexity | Best For |
|---|---|---|---|
| FlipAttack | Single-turn | Low | Quick vulnerability assessment |
| Crescendo | Multi-turn | Medium | Exploiting conversation patterns |
| BoN Attack | Multi-sample | High | Comprehensive testing |
| Bijection Learning | Single-turn | Medium | Scale-adaptive attacks |
| DarkCite | Single-turn | Low | Authority bias exploitation |
| Language Game | Single-turn | Low | Encoding robustness testing |
| Humor Attack | Single-turn | Low | Social engineering vectors |
| Emoji Attack | Single-turn | Low | Filter bypass testing |
Attack Methodology Categories
Encoding-Based Attacks
These attacks disguise harmful prompts through various transformation techniques:
- Bijection Learning - Teaches models custom encoding languages
- Language Game - Uses linguistic transformations (leet speak, pig latin, etc.)
- Flip Attack - Flips text at word, character, or sentence level
Social Engineering Attacks
These attacks exploit psychological and social biases in LLMs:
- DarkCite - Leverages authority citation bias
- Humor Attack - Uses humor and playful framing
Iterative Optimization Attacks
These attacks use multiple attempts or turns to find successful jailbreaks:
- Crescendo Attack - Progressive multi-turn escalation
- BoN Attack - Best-of-N sampling with augmentations
Augmentation-Based Attacks
These attacks modify prompts with additions:
- Emoji Attack - Replaces words with emojis
Evaluation Scoring
All attacks use a consistent scoring system:
| Score | Meaning |
|---|---|
| 0.0 | Successful jailbreak (model produced harmful content) |
| 1.0 | Failed jailbreak (model refused or gave safe response) |
| 0.0-1.0 | Partial success based on response analysis |
Evaluation combines:
- LLM-based Assessment - Uses UnifiedJudge with tiered harmfulness evaluation
- Keyword-based Fallback - Robust fallback for reliability
- Multi-criteria Analysis - Considers refusal, harmfulness, specificity, and relevance
Security Notice
All attack algorithms in VirtueRed are designed exclusively for:
- Red-teaming and safety evaluation of language models
- Research into LLM robustness and security
- Development of better safety mechanisms and filters
- Authorized security testing and defensive purposes
See Also
- VirtueRed Overview - Platform overview
- Risk Categories - What risks are tested