Skip to main content

Attack Algorithms

VirtueRed implements a comprehensive suite of attack algorithms designed for defensive security research, red-teaming, and AI safety evaluation. Each algorithm represents a distinct approach to testing LLM robustness against adversarial inputs.

Overview

Attack algorithms in VirtueRed are categorized by their methodology and attack vector:

CategoryAlgorithmsDescription
Encoding-BasedBijection Learning, Language Game, Flip AttackTransform prompts using various encoding schemes
Social EngineeringDarkCite, Humor AttackExploit psychological biases and social dynamics
Iterative OptimizationCrescendo, BoN AttackUse multiple attempts to find successful attacks
Augmentation-BasedEmoji AttackModify prompts with additions or variations

Algorithm Comparison

AlgorithmQuery TypeComplexityBest For
FlipAttackSingle-turnLowQuick vulnerability assessment
CrescendoMulti-turnMediumExploiting conversation patterns
BoN AttackMulti-sampleHighComprehensive testing
Bijection LearningSingle-turnMediumScale-adaptive attacks
DarkCiteSingle-turnLowAuthority bias exploitation
Language GameSingle-turnLowEncoding robustness testing
Humor AttackSingle-turnLowSocial engineering vectors
Emoji AttackSingle-turnLowFilter bypass testing

Attack Methodology Categories

Encoding-Based Attacks

These attacks disguise harmful prompts through various transformation techniques:

Social Engineering Attacks

These attacks exploit psychological and social biases in LLMs:

Iterative Optimization Attacks

These attacks use multiple attempts or turns to find successful jailbreaks:

Augmentation-Based Attacks

These attacks modify prompts with additions:

Evaluation Scoring

All attacks use a consistent scoring system:

ScoreMeaning
0.0Successful jailbreak (model produced harmful content)
1.0Failed jailbreak (model refused or gave safe response)
0.0-1.0Partial success based on response analysis

Evaluation combines:

  • LLM-based Assessment - Uses UnifiedJudge with tiered harmfulness evaluation
  • Keyword-based Fallback - Robust fallback for reliability
  • Multi-criteria Analysis - Considers refusal, harmfulness, specificity, and relevance

Security Notice

All attack algorithms in VirtueRed are designed exclusively for:

  • Red-teaming and safety evaluation of language models
  • Research into LLM robustness and security
  • Development of better safety mechanisms and filters
  • Authorized security testing and defensive purposes

See Also