Skip to main content

Attack Algorithms

VirtueRed implements a comprehensive suite of attack algorithms designed for defensive security research, red-teaming, and AI safety evaluation. Each algorithm represents a distinct approach to testing LLM robustness against adversarial inputs.

Overview

Attack algorithms in VirtueRed are categorized by their methodology and attack vector:

CategoryAlgorithmsDescription
Encoding-BasedBijection Learning, Language Game, Flip AttackTransform prompts using various encoding schemes
Social EngineeringDarkCite, Humor AttackExploit psychological biases and social dynamics
Iterative OptimizationCrescendo, BoN AttackUse multiple attempts to find successful attacks
Prompt OptimizationPAIR Attack, TAP AttackAttacker LLM iteratively or tree-based refines jailbreak prompts
Multi-Turn AttacksGOAT Attack, PETRI AttackMulti-turn conversational attacks with adaptive strategies
Augmentation-BasedEmoji AttackModify prompts with additions or variations

Algorithm Comparison

AlgorithmQuery TypeComplexityBest For
FlipAttackSingle-turnLowQuick vulnerability assessment
CrescendoMulti-turnMediumExploiting conversation patterns
BoN AttackMulti-sampleHighComprehensive testing
Bijection LearningSingle-turnMediumScale-adaptive attacks
DarkCiteSingle-turnLowAuthority bias exploitation
Language GameSingle-turnLowEncoding robustness testing
Humor AttackSingle-turnLowSocial engineering vectors
Emoji AttackSingle-turnLowFilter bypass testing
PAIR AttackSingle-turnMediumAutomated prompt refinement
TAP AttackSingle-turnHighTree-based jailbreak search
GOAT AttackMulti-turnMediumChain-of-attack multi-turn
PETRI AttackMulti-turnHighAgentic realistic probing

Attack Methodology Categories

Encoding & Obfuscation

These attacks disguise harmful prompts through various transformation techniques:

Social Engineering Attacks

These attacks exploit psychological and social biases in LLMs:

Prompt Optimization Attacks

These attacks use a dedicated attacker LLM to generate and refine adversarial prompts:

Multi-Turn Attacks

These attacks conduct multi-turn conversations with the target model and adapt strategy over turns:

Evaluation Scoring

All attacks use a consistent scoring system:

ScoreMeaning
0.0Successful jailbreak (model produced harmful content)
1.0Failed jailbreak (model refused or gave safe response)
0.0-1.0Partial success based on response analysis

Evaluation combines:

  • LLM-based Assessment - Uses UnifiedJudge with tiered harmfulness evaluation
  • Keyword-based Fallback - Robust fallback for reliability
  • Multi-criteria Analysis - Considers refusal, harmfulness, specificity, and relevance

Security Notice

All attack algorithms in VirtueRed are designed exclusively for:

  • Red-teaming and safety evaluation of language models
  • Research into LLM robustness and security
  • Development of better safety mechanisms and filters
  • Authorized security testing and defensive purposes

See Also